The Mandelbrot Set using the Chili Framework

Saturos · Post by **Saturos** » March 7th, 2017, 9:09 pm

Hello everyone,

I'm following the "Beginner C++ Game Programming DirectX" tutorial and I've been learning a lot of stuff (thanks, Chili

), so I decided to create a program about the Mandelbrot set using the Chili Framework.

The Mandelbrot set is a very famous fractal in math whose shape is calculated with the help of computers. Essentially, each pixel of the screen has two values (like coordinates) that are used in a formula. The pixel is then colored depending on the number of loops (or iterations) that the program internally does while a condition is satisfied. I don't want to extend this post a lot, so if anyone is curious, I can give more details in the comments.

Here is the source code of the project with the solution (the controls are listed at the end of the post): https://github.com/S4turos/Mandelbrot-Set
Note: I used Microsoft Visual Studio 2017.

And here some pictures:

Note: it's important to keep in mind that these pictures are generated automatically with a simple formula. That's the magic!

Controls:

Q: to decrease the number of iterations (less details).
E: to increase the number of iterations (more details).
W: to increase the speed of the camera.
S: to decrease the speed of the camera.
Space: to reset the image.
Arrow keys: to move the camera.
Left mouse button: to zoom in.
Right mouse button: to zoom out.

PS: English is not my first language, so forgive me if I have committed some fault

Byteraver · Post by **Byteraver** » March 7th, 2017, 9:45 pm

Cool! I remembered that, on my 486, on a resolution of 160*100, it still took some seconds to render

It's faster now but still not real time. For a fast zoom see https://www.youtube.com/watch?v=BLMUfBikxTY at 3:15 if you don't want to watch the whole video

Post by **chili** » March 8th, 2017, 3:42 am

Nice work, bro! Code looks pretty clean, easy to follow. Mandelbrot is always fun times.

Looking at this makes me wanna write my own, with multithreading & AVX bullshit

Maybe with something like quad doubles for increased precision or some GPGPU.

Gotta back down...

Saturos · Post by **Saturos** » March 8th, 2017, 12:02 pm

Byteraver wrote:Cool! I remembered that, on my 486, on a resolution of 160*100, it still took some seconds to render It's faster now but still not real time. For a fast zoom see https://www.youtube.com/watch?v=BLMUfBikxTY at 3:15 if you don't want to watch the whole video

Cool video!

Now it's easier to represent the Mandelbrot set with the modern computers, but it is still hard to get a fast zoom. It's necessary to do multithreading or something like that as Chili said, but I don't have much knowledge on that, especially in C++

Chili wrote:Nice work, bro! Code looks pretty clean, easy to follow. Mandelbrot is always fun times.

Looking at this makes me wanna write my own, with multithreading & AVX bullshit Maybe with something like quad doubles for increased precision or some GPGPU.

Gotta back down...

Thanks Chili! Your mental version of the Mandelbrot set sounds pretty awesome, at least much more efficient and precise than mine

So my project is only using the CPU, right? I don't know how to use the GPU, I would like to do that improvement to the project. Regarding multithreading, I have used it in other programming languages but I need to search on Google about multithreading in C++. By the way, I don't have any fucking idea about quad doubles and AVX

Post by **albinopapa** » March 8th, 2017, 1:47 pm

So my project is only using the CPU, right? I don't know how to use the GPU, I would like to do that improvement to the project. Regarding multithreading, I have used it in other programming languages but I need to search on Google about multithreading in C++. By the way, I don't have any fucking idea about quad doubles and AVX

Using CPU? Yes.
GPU programming has become simplified if you use C++ AMP.

Multithreading in C++ can be done a couple of different ways using std::thread or std::async. I haven't gotten the hang of threads yet, nor have I practiced them in the past few months. Std::async though is pretty simple to use and there is the parallel library ( include ppl.h ) that has a parallel_for_each function, but I don't think it's as efficient as just using std::async for some reason.

I believe chili is referring to using AVX as opposed to SSE because AVX supports 4 doubles at a time whereas SSE only supports 2.
AVX instructions can seem kind of intimidating at first, but you should be able to quickly catch on. The easiest way to deal with them is to create operator overloads for the main math operators ( +, -, *, / ). Perhaps even wrapper functions for the loading and storing of data as well, because the instructions names are long.

Loading
__m256d avxData = _mm256_load_pd( &someDoubleArray[ Idx ] ); // 32 byte aligned load
__m256d avxData = _mm256_loadu_pd( &someDoubleArray[ Idx ] ); // unaligned load
__m256d avxData = _mm256_set1_pd( 7.0 ); // Set all 4 doubles to 7.0
__m256d avxData = _mm256_set_pd(4.0, 3.0, 2.0, 1.0); // Sets the 4 doubles to 1.0, 2.0, 3.0, 4.0

Storing
alignas(32) double data[4];
_mm256_store_pd( data, avxData ); // stores data in 32 byte aligned memory address
_mm256_storeu_pd( data, avxData ); // stores unaligned data, unless the destination is already aligned
( There are a couple of other store functions, but can't remember them )

Math
Add
_mm256_add_pd( avxDataA, avxDataB );

Sub
_mm256_sub_pd( avxDataA, avxDataB );

Multiply
_mm256_mul_pd( avxDataA, avxDataB );

Divide
_mm256_div_pd( avxDataA, avxDataB );

The hardest part about SIMD ( SSE and AVX ) is the lack of branching, so no if statements. The good thing is sometimes you can do something similar to
double result = a < b ? 7.0 : 4.0;

This is a bit more involved using SIMD instructions especially AVX.

// This is the data. alignas(32) just allocates data on 32 byte boundaries
alignas( 32 ) double data0[ 4 ]{ 4.0, 3.0, 2.0, 1.0 };
alignas( 32 ) double data1[ 4 ]{ 3.0, 3.0, 3.0, 3.0 };
alignas( 32 ) double data2[ 4 ]{ 4.0, 4.0, 4.0, 4.0 };

// Using 32 byte aligned loads
__m256d avxData0 = _mm256_load_pd( data0 );
__m256d avxData1 = _mm256_load_pd( data1 );
__m256d avxData2 = _mm256_load_pd( data2 );

// Create the true mask, this mask will return a bit pattern of 0xFFFFFFFFFFFFFFFF for channels that pass the Greater than or equals check
__m256d trueMask = _mm256_cmp_pd( avxData0, avxData1, _CMP_GE_OQ );

// When doing the checks, make sure to pass the mask as the left param, especially for the andnot function, try swapping them and see what you come up with as an experiment.
// If channel is greater than 3.0
__m256d _if = _mm256_and_pd( trueMask, avxData2 );
// else
__m256d _else = _mm256_andnot_pd( trueMask, avxData1 );

// The result is { 4.0, 4.0, 3.0, 3.0 } because 4 and 3 are greater or equal to 3 while 2 and 1 aren't
// Combine results
__m256d result = _mm256_or_pd( _if, _else );

Things get even more complex trying to simulate if/else if/else conditions though.
Here's a good resource for looking up SSE/AVX instructions. Keep in mind though that not all CPUs support AVX or even SSE versions 3 or higher depending on the brand and age. For instance, my previous AMD CPU came out in 2008 and didn't support anything new than SSE 3 while the Intel Core 2 chips from around the same period supported up to SSE 4.1 or 4.2, not sure. It wasn't until the FX Bulldozer chips from AMD that got support for SSE 4 and AVX and that was back in 2012.

C++ AMP is a pretty nifty way of utilizing the GPU while still programming in C++. I've played around with it a bit using the raytracer from warrior, went from 5 frames per second on CPU to 160 frames per second on the GPU and C++ AMP.

Zedtho · Post by **Zedtho** » March 8th, 2017, 3:34 pm

What calculation do you make? I've heard of the Mandelbrot set before, something with squares and roots... but I'd love to have a go at this too!

Saturos · Post by **Saturos** » March 10th, 2017, 7:47 pm

Albinopapa, thanks for your great explanation, but I have to read AVX instructions carefully because it seems that the syntax is not user friendly at first glimpse.

Zedtho wrote:What calculation do you make? I've heard of the Mandelbrot set before, something with squares and roots... but I'd love to have a go at this too!

Yes, the main calculation is with squares and roots, but roots are usually avoided in computers since it is less efficient. The Mandelbrot set is based on complex numbers, but I used real numbers because I'm more familiar with that and I don't really know how to use complex data types in C++.

I think the best way for understanding how the calculation is made is looking at the pseudocode:

Code: Select all

For each pixel (Px, Py) on the screen, do:
{
  x0 = scaled x coordinate of pixel (scaled to lie in the Mandelbrot X scale (-2.5, 1))
  y0 = scaled y coordinate of pixel (scaled to lie in the Mandelbrot Y scale (-1, 1))
  x = 0.0
  y = 0.0
  iteration = 0
  max_iteration = 1000
  while (x*x + y*y < 2*2  AND  iteration < max_iteration) {
    xtemp = x*x - y*y + x0
    y = 2*x*y + y0
    x = xtemp
    iteration = iteration + 1
  }
  color = palette[iteration]
  plot(Px, Py, color)
}

x0 and y0 are the values for every pixel according to its position in the plane. For example, in my project the top left pixel of the corner has -2.05 (x0) and 2.05 (y0) values with no zoom. The final result of every iteration (x*x + y*y) has to be less than 4 to continue the loop (in the definition of the Mandelbrot set the limit is 2 because of roots). The color of the pixel depends on the number of iterations. In my case, it goes from blue (if the result is greater than 4 quickly) to black (if it reaches the maximum iterations). Theoretically, the black pixels are inside the Mandelbrot set, but it is possible that if you do some more iterations, that point of the plane could escape from the Mandelbrot set (greater than 4 with no roots).

When I zoom in, I simply recalculate the values for each pixel. For example, the top right pixel of the corner would be -2.04784... and 2.04784... instead of -2.05 and 2.05. All my calculations are made with doubles, so there is a limit you cannot go further.

Post by **albinopapa** » March 10th, 2017, 9:14 pm

Yeah, that's what I was saying, they can look intimidating at first glance, but that is also why I suggested overloading the math operators ( +-*/ ).

Just looking at the pseudo code you posted, it's all adding, subtracting and multiplying so the three you'd want to use are _mm256_add_pd, _mm256_sub_pd and _mm256_mul_pd. Then of course there is the loading and storing, _mm256_load_pd or _mm256_loadu_pd and _mm256_store_pd or _mm256_storeu_pd depending on if you have aligned your data before hand.

The naming of the functions are as follows

Code: Select all

_mm256              <- mm256 (256 bit multimedia)

add, sub, mul, div <- addition, subtraction, multiplication and division

pd                       <- packed double precision, meaning that the avx variable has more than one unique double value
ps                       <- packed single precision, meaning that the avx variable has more than one unique float value
(ex. _mm256_add_pd or _mm256_add_ps )

sd                       <- scalar double precision, meaning that only the first element has a double precision float value
ss                       <- scalar single precision, meaning that only the first element has a single precision float value
(ex. _mm256_mul_sd or _mm256_mul_ss)

For me, it is easier to make wrappers for SSE and AVX instructions. For instance, I made an SSE class that wraps all the instructions for loading storing and the operations from the math operators to min/max to bit mask operations like the & and |. The andnot function takes two operands, while the bitwise not (~) only takes one, so while you can overload it, it's still more efficient to just wrap the andnot in a Andnot function. Also overloaded the bool operations (==, !=, >, >=, <, <=). Makes using SIMD a heck of a lot easier and more natural.

Post by **chili** » March 11th, 2017, 4:58 am

I would recommend first maybe exploring using std::async to thread your shit.
It is an embarrassingly parallel problem, so that should be achievable with minimal effort.

Saturos · Post by **Saturos** » March 11th, 2017, 10:26 am

Thanks albinopapa again

Maybe I'll give it a try. I think the first thing I'm gonna do is what Chili has said: multithreading. I'll tell you if I make any progress

Planet Chili

The Mandelbrot Set using the Chili Framework

The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework

Re: The Mandelbrot Set using the Chili Framework