Looping through index rather than iterators
When coding with simd, you're going to loop through a lot of arrays at once. This proved to be easier to write down when looping with indices. That way, you can just use one index to reference the members of each array. Using iterators would require you to make the additional iterators outside the for loop if they are of different types. Plus, you need to increment each iterator individually which all ends up with having a lot of clutter in your code. Just how inefficient can looping by index be as opposed to looping with iterators that I have to learn to deal with the clutter?
As it turns out; not that much really. I created two tests where one loops through 5 vectors with indices and the other with iterators. I ran them multiple times and they end up with very similar numbers. If there was a difference, it would be too miniscule to notice. From my Assembly knowledge, the cpu is optimised for various memory access methods including [start_of_memory_segment + displacement]. This is really useful for OSs that use a lot of virtual memory for memory management and most don't use actual physical addresses all that much to access memory. So really, accessing memory through indices is how memory is accessed by default.
Typing that out really does prove how much easier and cleaner looping through indices are. When making the for loop with iterators, I ended up with an unauthorized access error. It really took me a long time to find that I accidentally incremented an iterator twice. That for loop was painful to make, ugly to look at and easy to make mistakes in.
Doing branching at a separate for loop
If statements and simd don't mix very well. To get simd to work, you need to have a group of data that are all going to run the same set of instructions. Branching instructions like if statements and switch statements(and logical operators too but simd has their own set of instructions for that) break that. As a result, I wanted to test just how much inefficient it is to separate the branching from the for loop that does the calculation. That would require 2 loops with the first actually doing the calculation and the second to make a decision based on the result. It would also require a container to store the results of those calculations so you can evaluate them. This way, simd can be used for the calculation for loop and then we can evaluate the result normally.
The tests use a simple sphere collision check that stores the number 69 in an output array if they touch and 420 if they don't. Simd optimisation was not used since I wanted the test to be fair. I need to see how much it performs without any other optimisations.
In this benchmark, the results are close but the difference is noticable. The difference wasn't that bad but it is there and you can see it if you look for it. You're basically doing the exact same thing but one loops twice instead of once. Plus, storing those calculations in additional containers to be evaluated later will give you some overhead too. I would imagine that this would be worth it once you implement simd but it's not something you want to do normally.
Separating your data into different arrays based on what you want to do with them
Continuing our quest to eliminate branching for simd friendly code, I tested out the difference in performance if you separate your data into different arrays based on what you want to do with them. This is a cited optimisation professional programmers advocate for when optimising even normal code. It also relates to the problem with heterogenous arrays(both using polymorphism and c-style polymorphism with unions and function pointers). When evaluating elements in a heterogeneous array, you have to do last minute type checking before you use them with an enum or a virtual function etc. This can easily be avoided with just using homogeneous arrays where you already know what the type is when you evaluate them. Furthermore separating them into different arrays based on the operations you will do on them will further optimise your code. So instead of...
Code: Select all
for ( auto& archer : archers )
{
switch ( archer.state )
{
case ArcherState::RETREAT:
{
//Go back to base
break;
}
case ArcherState::HUNT:
{
//find something to hunt
break;
}
}
}
Code: Select all
for ( auto& archer : hunting_archers )
{
//Go back to base
}
for ( auto& archer : retreating_archers )
{
//find something to hunt
}
After all, the previous test shows how doing more loops will hurt performance.
The professionals were right. The code that separates the data based on what instructions they will run performs nearly 12 times faster! Getting rid of all those unnecessary branching was worth the extra for loops.
Here is the code I used to benchmark. You'll have to run them multiple times since they will output a certain range of numbers. Record the numbers outputted and see which one is more likely to give lower results.
Test1
Code: Select all
#include <iostream>
#include <string>
#include <cstdint>
#include <vector>
#include <chrono>
#include <random>
#include <memory>
#include <algorithm>
#include <limits>
#include <fstream>
void loop_test ( )
{
std::random_device seed { };
std::mt19937 engine { seed ( ) };
std::uniform_real_distribution<long double> rand { 1, 10000 };
const std::size_t size { 10000000 };
std::vector<int> container1;
std::vector<int> container2;
std::vector<int> container3;
std::vector<int> container4;
std::vector<int> container5;
container1.resize ( size );
container2.resize ( size );
container3.resize ( size );
container4.resize ( size );
container5.resize ( size );
for ( std::size_t index = 0; index < size; ++index )
{
container1 [ index ] = rand ( engine );
container2 [ index ] = rand ( engine );
container3 [ index ] = rand ( engine );
container4 [ index ] = rand ( engine );
container5 [ index ] = rand ( engine );
}
int total { };
auto it1 = container1.begin ( );
auto it2 = container2.begin ( );
auto it3 = container3.begin ( );
auto it4 = container4.begin ( );
//benchmark begin
const auto start = std::chrono::steady_clock::now ( );
for ( auto it5 = container5.begin ( ), end = container5.end ( ); it5 != end; ++it1, ++it2, ++it3, ++it4, ++it5 )
{
total += *it1 + *it2 + *it3 + *it4 + *it5;
}
const auto end = std::chrono::steady_clock::now ( );
//benchmark end
const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
std::cout << "LOOP THROUGH ITERATORS" << std::endl;
std::cout << "Total: " << total << std::endl;
std::cout << "Time: " << nanoseconds_taken << std::endl;
std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}
struct Vector3 final
{
Vector3 ( ) = default;
Vector3 ( const float x, const float y, const float z ) : x ( x ), y ( y ), z ( z )
{
}
Vector3 operator-( const Vector3 rhs ) const
{
return Vector3 { x - rhs.x, y - rhs.y, z - rhs.z };
}
float LengthSquared ( ) const
{
return x * x + y * y + z * z;
}
float x { };
float y { };
float z { };
};
namespace CollisionBody
{
struct Sphere final
{
Sphere ( ) = default;
Sphere ( const Vector3 position, float radius ) : position ( position ), radius ( radius )
{
}
Vector3 position { };
float radius { };
};
}
void do_branching_on_resolution ( )
{
using namespace CollisionBody;
std::random_device seed { };
std::mt19937 engine { seed ( ) };
std::uniform_real_distribution<float> rand_pos { 1.f, 100000.f };
std::uniform_real_distribution<float> rand_radius { 1.f, 100.f };
const std::size_t size { 10000 };
std::vector<Sphere> spheres;
spheres.resize ( size );
for ( Sphere& sphere : spheres )
{
sphere.position = Vector3 { rand_pos ( engine ), rand_pos ( engine ), rand_pos ( engine ) };
sphere.radius = rand_radius ( engine );
}
std::vector<int> results;
results.resize ( ( size * size + size ) / 2 );
auto result_it = results.begin ( );
//benchmark begin
const auto start = std::chrono::steady_clock::now ( );
for ( auto it1 = spheres.begin ( ), end = spheres.end ( ); it1 != end; ++it1 )
{
for ( auto it2 = it1 + 1; it2 != end; ++it2, ++result_it )
{
const Vector3 position1 = it1->position;
const float radius1 = it1->radius;
const Vector3 position2 = it2->position;
const float radius2 = it2->radius;
if ( ( position2 - position1 ).LengthSquared ( ) < radius1 * radius1 + radius2 * radius2 )
{
*result_it = 69;
}
else
{
*result_it = 420;
}
}
}
const auto end = std::chrono::steady_clock::now ( );
//benchmark end
const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
std::cout << "DO BRANCHING IN SAME LOOP" << std::endl;
std::cout << "Time: " << nanoseconds_taken << std::endl;
std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}
enum class Command
{
MULTIPLY,
DIVIDE,
PLUS,
MINUS
};
using TestType = int;
struct Test final
{
Test ( ) = default;
Test ( const TestType num1, const TestType num2, const Command command ) : num1 ( num1 ), num2 ( num2 ), command ( command )
{
}
TestType num1 { };
TestType num2 { };
Command command { Command::PLUS };
};
void sort_your_data_into_arrays_based_on_what_you_want_to_do_with_them ( )
{
std::random_device seed { };
std::mt19937 engine { seed ( ) };
std::uniform_int_distribution<int> rand_command { 1, 4 };
std::uniform_real_distribution<long double> rand { 1, 1000000 };
const std::size_t size { 10000000 };
std::vector<Test> tests;
std::vector<TestType> results;
tests.resize ( size );
results.resize ( size );
for ( Test& test : tests )
{
Command command;
const int command_index { rand_command ( engine ) };
switch ( command_index )
{
case 1:
{
command = Command::DIVIDE;
break;
}
case 2:
{
command = Command::PLUS;
break;
}
case 3:
{
command = Command::MINUS;
break;
}
case 4:
{
command = Command::MULTIPLY;
break;
}
default:
{
throw std::exception ( "You missed one dumb dumb!" );
}
}
test = Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ), command };
}
//benchmark begin
auto it = results.begin ( );
const auto start = std::chrono::steady_clock::now ( );
for ( const Test test : tests )
{
switch ( test.command )
{
case Command::DIVIDE:
{
*it = test.num1 / test.num2;
break;
}
case Command::MINUS:
{
*it = test.num1 - test.num2;
break;
}
case Command::MULTIPLY:
{
*it = test.num1 * test.num2;
break;
}
case Command::PLUS:
{
*it = test.num1 + test.num2;
break;
}
}
++it;
}
const auto end = std::chrono::steady_clock::now ( );
//benchmark end
const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
std::cout << "STORE DATA IN ONE ARRAY" << std::endl;
std::cout << "Time: " << nanoseconds_taken << std::endl;
std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}
int main ( )
{
//loop_test ( );
do_branching_on_resolution ( );
//sort_your_data_into_arrays_based_on_what_you_want_to_do_with_them ( );
system ( "pause" );
return 0;
}
Code: Select all
#include <iostream>
#include <string>
#include <cstdint>
#include <vector>
#include <chrono>
#include <random>
#include <memory>
#include <algorithm>
#include <limits>
void loop_test ( )
{
std::random_device seed { };
std::mt19937 engine { seed ( ) };
std::uniform_real_distribution<long double> rand { 1, 10000 };
const std::size_t size { 10000000 };
std::vector<int> container1;
std::vector<int> container2;
std::vector<int> container3;
std::vector<int> container4;
std::vector<int> container5;
container1.resize ( size );
container2.resize ( size );
container3.resize ( size );
container4.resize ( size );
container5.resize ( size );
for ( std::size_t index = 0; index < size; ++index )
{
container1 [ index ] = rand ( engine );
container2 [ index ] = rand ( engine );
container3 [ index ] = rand ( engine );
container4 [ index ] = rand ( engine );
container5 [ index ] = rand ( engine );
}
int total { };
//benchmark begin
const auto start = std::chrono::steady_clock::now ( );
for ( std::size_t index = 0; index < size; ++index )
{
total += container1 [ index ] + container2 [ index ] + container3 [ index ] + container4 [ index ] + container5 [ index ];
}
const auto end = std::chrono::steady_clock::now ( );
//benchmark end
const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
std::cout << "LOOP THROUGH INDICES" << std::endl;
std::cout << "Total: " << total << std::endl;
std::cout << "Time: " << nanoseconds_taken << std::endl;
std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}
struct Vector3 final
{
Vector3 ( ) = default;
Vector3 ( const float x, const float y, const float z ) : x ( x ), y ( y ), z ( z )
{
}
Vector3 operator-( const Vector3 rhs ) const
{
return Vector3 { x - rhs.x, y - rhs.y, z - rhs.z };
}
float LengthSquared ( ) const
{
return x * x + y * y + z * z;
}
float x { };
float y { };
float z { };
};
namespace CollisionBody
{
struct Sphere final
{
Sphere ( ) = default;
Sphere ( const Vector3 position, float radius ) : position ( position ), radius ( radius )
{
}
Vector3 position { };
float radius { };
};
}
void do_branching_on_resolution ( )
{
using namespace CollisionBody;
std::random_device seed { };
std::mt19937 engine { seed ( ) };
std::uniform_real_distribution<float> rand_pos { 1.f, 100000.f };
std::uniform_real_distribution<float> rand_radius { 1.f, 100.f };
const std::size_t size { 10000 };
std::vector<Sphere> spheres;
spheres.resize ( size );
for ( Sphere& sphere : spheres )
{
sphere.position = Vector3 { rand_pos ( engine ), rand_pos ( engine ), rand_pos ( engine ) };
sphere.radius = rand_radius ( engine );
}
std::vector<float> distances;
std::vector<float> boundaries;
std::vector<int> results;
const std::size_t result_size = ( size * size + size ) / 2;
distances.resize ( result_size );
boundaries.resize ( result_size );
results.resize ( result_size );
auto distance_it = distances.begin ( );
auto boundary_it = boundaries.begin ( );
//benchmark begin
const auto start = std::chrono::steady_clock::now ( );
for ( auto it1 = spheres.begin ( ), end = spheres.end ( ); it1 != end; ++it1 )
{
for ( auto it2 = it1 + 1; it2 != end; ++it2, ++distance_it, ++boundary_it )
{
const Vector3 position1 = it1->position;
const float radius1 = it1->radius;
const Vector3 position2 = it2->position;
const float radius2 = it2->radius;
*distance_it = ( position2 - position1 ).LengthSquared ( );
*boundary_it = radius1 * radius1 + radius2 * radius2;
}
}
distance_it = distances.begin ( );
boundary_it = boundaries.begin ( );
auto result_it = results.begin ( );
for ( int& result : results )
{
if ( *distance_it <= *boundary_it )
{
*result_it = 69;
}
else
{
*result_it = 420;
}
++distance_it;
++boundary_it;
}
const auto end = std::chrono::steady_clock::now ( );
//benchmark end
const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
std::cout << "DO BRANCHING IN SEPERATE LOOP" << std::endl;
std::cout << "Time: " << nanoseconds_taken << std::endl;
std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}
using TestType = int;
struct Test final
{
Test ( ) = default;
Test ( const TestType num1, const TestType num2 ) : num1 ( num1 ), num2 ( num2 )
{
}
TestType num1 { };
TestType num2 { };
};
void sort_your_data_into_arrays_based_on_what_you_want_to_do_with_them ( )
{
std::random_device seed { };
std::mt19937 engine { seed ( ) };
std::uniform_int_distribution<int> rand_command { 1, 4 };
std::uniform_real_distribution<long double> rand { 1, 1000000 };
const std::size_t size { 10000000 };
std::vector<Test> plus_tests;
std::vector<Test> minus_tests;
std::vector<Test> divide_tests;
std::vector<Test> multiply_tests;
std::vector<TestType> results;
results.resize( size );
for ( std::size_t index = 0; index < size; ++index )
{
const int command_index { rand_command ( engine ) };
switch ( command_index )
{
case 1:
{
plus_tests.push_back ( Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ) } );
break;
}
case 2:
{
minus_tests.push_back ( Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ) } );
break;
}
case 3:
{
divide_tests.push_back ( Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ) } );
break;
}
case 4:
{
multiply_tests.push_back ( Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ) } );
break;
}
}
}
//benchmark begin
auto it = results.begin ( );
const auto start = std::chrono::steady_clock::now ( );
for ( const Test test : plus_tests )
{
*it = test.num1 + test.num2;
++it;
}
for ( const Test test : minus_tests )
{
*it = test.num1 - test.num2;
++it;
}
for ( const Test test : divide_tests )
{
*it = test.num1 / test.num2;
++it;
}
for ( const Test test : multiply_tests )
{
*it = test.num1 * test.num2;
++it;
}
const auto end = std::chrono::steady_clock::now ( );
//benchmark end
const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
std::cout << "SEPARATE DATA INTO DIFFERENT ARRAYS" << std::endl;
std::cout << "Time: " << nanoseconds_taken << std::endl;
std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}
int main ( )
{
//loop_test ( );
do_branching_on_resolution ( );
//sort_your_data_into_arrays_based_on_what_you_want_to_do_with_them ( );
system ( "pause" );
return 0;
}