So my project is only using the CPU, right? I don't know how to use the GPU, I would like to do that improvement to the project. Regarding multithreading, I have used it in other programming languages but I need to search on Google about multithreading in C++. By the way, I don't have any fucking idea about quad doubles and AVX
Using CPU? Yes.
GPU programming has become simplified if you use C++ AMP.
Multithreading in C++ can be done a couple of different ways using std::thread or std::async. I haven't gotten the hang of threads yet, nor have I practiced them in the past few months. Std::async though is pretty simple to use and there is the parallel library ( include ppl.h ) that has a parallel_for_each function, but I don't think it's as efficient as just using std::async for some reason.
I believe chili is referring to using AVX as opposed to SSE because AVX supports 4 doubles at a time whereas SSE only supports 2.
AVX instructions can seem kind of intimidating at first, but you should be able to quickly catch on. The easiest way to deal with them is to create operator overloads for the main math operators ( +, -, *, / ). Perhaps even wrapper functions for the loading and storing of data as well, because the instructions names are long.
Loading
__m256d avxData = _mm256_load_pd( &someDoubleArray[ Idx ] ); // 32 byte aligned load
__m256d avxData = _mm256_loadu_pd( &someDoubleArray[ Idx ] ); // unaligned load
__m256d avxData = _mm256_set1_pd( 7.0 ); // Set all 4 doubles to 7.0
__m256d avxData = _mm256_set_pd(4.0, 3.0, 2.0, 1.0); // Sets the 4 doubles to 1.0, 2.0, 3.0, 4.0
Storing
alignas(32) double data[4];
_mm256_store_pd( data, avxData ); // stores data in 32 byte aligned memory address
_mm256_storeu_pd( data, avxData ); // stores unaligned data, unless the destination is already aligned
( There are a couple of other store functions, but can't remember them )
Math
Add
_mm256_add_pd( avxDataA, avxDataB );
Sub
_mm256_sub_pd( avxDataA, avxDataB );
Multiply
_mm256_mul_pd( avxDataA, avxDataB );
Divide
_mm256_div_pd( avxDataA, avxDataB );
The hardest part about SIMD ( SSE and AVX ) is the lack of branching, so no if statements. The good thing is sometimes you can do something similar to
double result = a < b ? 7.0 : 4.0;
This is a bit more involved using SIMD instructions especially AVX.
// This is the data. alignas(32) just allocates data on 32 byte boundaries
alignas( 32 ) double data0[ 4 ]{ 4.0, 3.0, 2.0, 1.0 };
alignas( 32 ) double data1[ 4 ]{ 3.0, 3.0, 3.0, 3.0 };
alignas( 32 ) double data2[ 4 ]{ 4.0, 4.0, 4.0, 4.0 };
// Using 32 byte aligned loads
__m256d avxData0 = _mm256_load_pd( data0 );
__m256d avxData1 = _mm256_load_pd( data1 );
__m256d avxData2 = _mm256_load_pd( data2 );
// Create the true mask, this mask will return a bit pattern of 0xFFFFFFFFFFFFFFFF for channels that pass the Greater than or equals check
__m256d trueMask = _mm256_cmp_pd( avxData0, avxData1, _CMP_GE_OQ );
// When doing the checks, make sure to pass the mask as the left param, especially for the andnot function, try swapping them and see what you come up with as an experiment.
// If channel is greater than 3.0
__m256d _if = _mm256_and_pd( trueMask, avxData2 );
// else
__m256d _else = _mm256_andnot_pd( trueMask, avxData1 );
// The result is { 4.0, 4.0, 3.0, 3.0 } because 4 and 3 are greater or equal to 3 while 2 and 1 aren't
// Combine results
__m256d result = _mm256_or_pd( _if, _else );
Things get even more complex trying to simulate if/else if/else conditions though.
Here's a good resource for looking up SSE/AVX instructions. Keep in mind though that not all CPUs support AVX or even SSE versions 3 or higher depending on the brand and age. For instance, my previous AMD CPU came out in 2008 and didn't support anything new than SSE 3 while the Intel Core 2 chips from around the same period supported up to SSE 4.1 or 4.2, not sure. It wasn't until the FX Bulldozer chips from AMD that got support for SSE 4 and AVX and that was back in 2012.
C++ AMP is a pretty nifty way of utilizing the GPU while still programming in C++. I've played around with it a bit using the raytracer from warrior, went from 5 frames per second on CPU to 160 frames per second on the GPU and C++ AMP.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com