Some benchmarking tests

cyboryxmen · Post by **cyboryxmen** » February 8th, 2017, 6:56 am

Since I've been planning on refactoring my code to be more and more simd friendly, I wanted to get a guideline on how I should code from now on. As a result, I made a bunch of benchmarking tests to be more clear about certain aspects of coding. I would like to share my findings with you to see what you guys can make out of it.

Looping through index rather than iterators

When coding with simd, you're going to loop through a lot of arrays at once. This proved to be easier to write down when looping with indices. That way, you can just use one index to reference the members of each array. Using iterators would require you to make the additional iterators outside the for loop if they are of different types. Plus, you need to increment each iterator individually which all ends up with having a lot of clutter in your code. Just how inefficient can looping by index be as opposed to looping with iterators that I have to learn to deal with the clutter?

As it turns out; not that much really. I created two tests where one loops through 5 vectors with indices and the other with iterators. I ran them multiple times and they end up with very similar numbers. If there was a difference, it would be too miniscule to notice. From my Assembly knowledge, the cpu is optimised for various memory access methods including [start_of_memory_segment + displacement]. This is really useful for OSs that use a lot of virtual memory for memory management and most don't use actual physical addresses all that much to access memory. So really, accessing memory through indices is how memory is accessed by default.

Typing that out really does prove how much easier and cleaner looping through indices are. When making the for loop with iterators, I ended up with an unauthorized access error. It really took me a long time to find that I accidentally incremented an iterator twice. That for loop was painful to make, ugly to look at and easy to make mistakes in.

Doing branching at a separate for loop

If statements and simd don't mix very well. To get simd to work, you need to have a group of data that are all going to run the same set of instructions. Branching instructions like if statements and switch statements(and logical operators too but simd has their own set of instructions for that) break that. As a result, I wanted to test just how much inefficient it is to separate the branching from the for loop that does the calculation. That would require 2 loops with the first actually doing the calculation and the second to make a decision based on the result. It would also require a container to store the results of those calculations so you can evaluate them. This way, simd can be used for the calculation for loop and then we can evaluate the result normally.

The tests use a simple sphere collision check that stores the number 69 in an output array if they touch and 420 if they don't. Simd optimisation was not used since I wanted the test to be fair. I need to see how much it performs without any other optimisations.

In this benchmark, the results are close but the difference is noticable. The difference wasn't that bad but it is there and you can see it if you look for it. You're basically doing the exact same thing but one loops twice instead of once. Plus, storing those calculations in additional containers to be evaluated later will give you some overhead too. I would imagine that this would be worth it once you implement simd but it's not something you want to do normally.

Separating your data into different arrays based on what you want to do with them

Continuing our quest to eliminate branching for simd friendly code, I tested out the difference in performance if you separate your data into different arrays based on what you want to do with them. This is a cited optimisation professional programmers advocate for when optimising even normal code. It also relates to the problem with heterogenous arrays(both using polymorphism and c-style polymorphism with unions and function pointers). When evaluating elements in a heterogeneous array, you have to do last minute type checking before you use them with an enum or a virtual function etc. This can easily be avoided with just using homogeneous arrays where you already know what the type is when you evaluate them. Furthermore separating them into different arrays based on the operations you will do on them will further optimise your code. So instead of...

Code: Select all

	for ( auto& archer : archers )
	{
		switch ( archer.state )
		{
			case ArcherState::RETREAT:
			{
				//Go back to base
				break;
			}
			case ArcherState::HUNT:
			{
				//find something to hunt
				break;
			}
		}
	}

...you do this:

Code: Select all

	for ( auto& archer : hunting_archers )
	{
		//Go back to base
	}

	for ( auto& archer : retreating_archers )
	{
		//find something to hunt
	}

Plus, this makes it such that objects that will run the same set of instructions will be grouped together in one array making them very simd friendly. But what use is a theory if we don't test it?
After all, the previous test shows how doing more loops will hurt performance.

The professionals were right. The code that separates the data based on what instructions they will run performs nearly 12 times faster! Getting rid of all those unnecessary branching was worth the extra for loops.

Here is the code I used to benchmark. You'll have to run them multiple times since they will output a certain range of numbers. Record the numbers outputted and see which one is more likely to give lower results.

Test1

Code: Select all

#include <iostream>
#include <string>
#include <cstdint>
#include <vector>
#include <chrono>
#include <random>
#include <memory>
#include <algorithm>
#include <limits>
#include <fstream>

void loop_test ( )
{
	std::random_device seed { };
	std::mt19937 engine { seed ( ) };
	std::uniform_real_distribution<long double> rand { 1, 10000 };

	const std::size_t size { 10000000 };

	std::vector<int> container1;
	std::vector<int> container2;
	std::vector<int> container3;
	std::vector<int> container4;
	std::vector<int> container5;
	container1.resize ( size );
	container2.resize ( size );
	container3.resize ( size );
	container4.resize ( size );
	container5.resize ( size );

	for ( std::size_t index = 0; index < size; ++index )
	{
		container1 [ index ] = rand ( engine );
		container2 [ index ] = rand ( engine );
		container3 [ index ] = rand ( engine );
		container4 [ index ] = rand ( engine );
		container5 [ index ] = rand ( engine );
	}

	int total { };

	auto it1 = container1.begin ( );
	auto it2 = container2.begin ( );
	auto it3 = container3.begin ( );
	auto it4 = container4.begin ( );
	//benchmark begin
	const auto start = std::chrono::steady_clock::now ( );
	for ( auto it5 = container5.begin ( ), end = container5.end ( ); it5 != end; ++it1, ++it2, ++it3, ++it4, ++it5 )
	{
		total += *it1 + *it2 + *it3 + *it4 + *it5;
	}
	const auto end = std::chrono::steady_clock::now ( );
	//benchmark end

	const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
	std::cout << "LOOP THROUGH ITERATORS" << std::endl;
	std::cout << "Total: " << total << std::endl;
	std::cout << "Time: " << nanoseconds_taken << std::endl;
	std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}

struct Vector3 final
{
	Vector3 ( ) = default;
	Vector3 ( const float x, const float y, const float z ) : x ( x ), y ( y ), z ( z )
	{
	}

	Vector3 operator-( const Vector3 rhs ) const
	{
		return Vector3 { x - rhs.x, y - rhs.y, z - rhs.z };
	}

	float LengthSquared ( ) const
	{
		return x * x + y * y + z * z;
	}

	float x { };
	float y { };
	float z { };
};

namespace CollisionBody
{
	struct Sphere final
	{
		Sphere ( ) = default;
		Sphere ( const Vector3 position, float radius ) : position ( position ), radius ( radius )
		{
		}

		Vector3 position { };
		float radius { };
	};
}

void do_branching_on_resolution ( )
{
	using namespace CollisionBody;

	std::random_device seed { };
	std::mt19937 engine { seed ( ) };
	std::uniform_real_distribution<float> rand_pos { 1.f, 100000.f };
	std::uniform_real_distribution<float> rand_radius { 1.f, 100.f };
	const std::size_t size { 10000 };

	std::vector<Sphere> spheres;
	spheres.resize ( size );

	for ( Sphere& sphere : spheres )
	{
		sphere.position = Vector3 { rand_pos ( engine ), rand_pos ( engine ), rand_pos ( engine ) };
		sphere.radius = rand_radius ( engine );
	}

	std::vector<int> results;
	results.resize ( ( size * size + size ) / 2 );

	auto result_it = results.begin ( );
	//benchmark begin
	const auto start = std::chrono::steady_clock::now ( );
	for ( auto it1 = spheres.begin ( ), end = spheres.end ( ); it1 != end; ++it1 )
	{
		for ( auto it2 = it1 + 1; it2 != end; ++it2, ++result_it )
		{
			const Vector3 position1 = it1->position;
			const float radius1 = it1->radius;

			const Vector3 position2 = it2->position;
			const float radius2 = it2->radius;

			if ( ( position2 - position1 ).LengthSquared ( ) < radius1 * radius1 + radius2 * radius2 )
			{
				*result_it = 69;
			}
			else
			{
				*result_it = 420;
			}
		}
	}
	const auto end = std::chrono::steady_clock::now ( );
	//benchmark end

	const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
	std::cout << "DO BRANCHING IN SAME LOOP" << std::endl;
	std::cout << "Time: " << nanoseconds_taken << std::endl;
	std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}

enum class Command
{
	MULTIPLY,
	DIVIDE,
	PLUS,
	MINUS
};

using TestType = int;

struct Test final
{
	Test ( ) = default;
	Test ( const TestType num1, const TestType num2, const Command command ) : num1 ( num1 ), num2 ( num2 ), command ( command )
	{
	}

	TestType num1 { };
	TestType num2 { };
	Command command { Command::PLUS };
};

void sort_your_data_into_arrays_based_on_what_you_want_to_do_with_them ( )
{
	std::random_device seed { };
	std::mt19937 engine { seed ( ) };
	std::uniform_int_distribution<int> rand_command { 1, 4 };
	std::uniform_real_distribution<long double> rand { 1, 1000000 };

	const std::size_t size { 10000000 };

	std::vector<Test> tests;
	std::vector<TestType> results;
	tests.resize ( size );
	results.resize ( size );

	for ( Test& test : tests )
	{
		Command command;
		const int command_index { rand_command ( engine ) };

		switch ( command_index )
		{
			case 1:
			{
				command = Command::DIVIDE;
				break;
			}
			case 2:
			{
				command = Command::PLUS;
				break;
			}
			case 3:
			{
				command = Command::MINUS;
				break;
			}
			case 4:
			{
				command = Command::MULTIPLY;
				break;
			}
			default:
			{
				throw std::exception ( "You missed one dumb dumb!" );
			}
		}

		test = Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ), command };
	}

	//benchmark begin
	auto it = results.begin ( );
	const auto start = std::chrono::steady_clock::now ( );
	for ( const Test test : tests )
	{
		switch ( test.command )
		{
			case Command::DIVIDE:
			{
				*it = test.num1 / test.num2;
				break;
			}
			case Command::MINUS:
			{
				*it = test.num1 - test.num2;
				break;
			}
			case Command::MULTIPLY:
			{
				*it = test.num1 * test.num2;
				break;
			}
			case Command::PLUS:
			{
				*it = test.num1 + test.num2;
				break;
			}
		}
		++it;
	}
	const auto end = std::chrono::steady_clock::now ( );
	//benchmark end

	const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
	std::cout << "STORE DATA IN ONE ARRAY" << std::endl;
	std::cout << "Time: " << nanoseconds_taken << std::endl;
	std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}

int main ( )
{
	//loop_test ( );
	do_branching_on_resolution ( );
	//sort_your_data_into_arrays_based_on_what_you_want_to_do_with_them ( );

	system ( "pause" );
	return 0;
}

Test2

Code: Select all

#include <iostream>
#include <string>
#include <cstdint>
#include <vector>
#include <chrono>
#include <random>
#include <memory>
#include <algorithm>
#include <limits>

void loop_test ( )
{
	std::random_device seed { };
	std::mt19937 engine { seed ( ) };
	std::uniform_real_distribution<long double> rand { 1, 10000 };

	const std::size_t size { 10000000 };

	std::vector<int> container1;
	std::vector<int> container2;
	std::vector<int> container3;
	std::vector<int> container4;
	std::vector<int> container5;
	container1.resize ( size );
	container2.resize ( size );
	container3.resize ( size );
	container4.resize ( size );
	container5.resize ( size );

	for ( std::size_t index = 0; index < size; ++index )
	{
		container1 [ index ] = rand ( engine );
		container2 [ index ] = rand ( engine );
		container3 [ index ] = rand ( engine );
		container4 [ index ] = rand ( engine );
		container5 [ index ] = rand ( engine );
	}

	int total { };

	//benchmark begin
	const auto start = std::chrono::steady_clock::now ( );
	for ( std::size_t index = 0; index < size; ++index )
	{
		total += container1 [ index ] + container2 [ index ] + container3 [ index ] + container4 [ index ] + container5 [ index ];
	}
	const auto end = std::chrono::steady_clock::now ( );
	//benchmark end

	const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
	std::cout << "LOOP THROUGH INDICES" << std::endl;
	std::cout << "Total: " << total << std::endl;
	std::cout << "Time: " << nanoseconds_taken << std::endl;
	std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}

struct Vector3 final
{
	Vector3 ( ) = default;
	Vector3 ( const float x, const float y, const float z ) : x ( x ), y ( y ), z ( z )
	{
	}

	Vector3 operator-( const Vector3 rhs ) const
	{
		return Vector3 { x - rhs.x, y - rhs.y, z - rhs.z };
	}

	float LengthSquared ( ) const
	{
		return x * x + y * y + z * z;
	}

	float x { };
	float y { };
	float z { };
};

namespace CollisionBody
{
	struct Sphere final
	{
		Sphere ( ) = default;
		Sphere ( const Vector3 position, float radius ) : position ( position ), radius ( radius )
		{
		}

		Vector3 position { };
		float radius { };
	};
}

void do_branching_on_resolution ( )
{
	using namespace CollisionBody;

	std::random_device seed { };
	std::mt19937 engine { seed ( ) };
	std::uniform_real_distribution<float> rand_pos { 1.f, 100000.f };
	std::uniform_real_distribution<float> rand_radius { 1.f, 100.f };
	const std::size_t size { 10000 };

	std::vector<Sphere> spheres;
	spheres.resize ( size );

	for ( Sphere& sphere : spheres )
	{
		sphere.position = Vector3 { rand_pos ( engine ), rand_pos ( engine ), rand_pos ( engine ) };
		sphere.radius = rand_radius ( engine );
	}

	std::vector<float> distances;
	std::vector<float> boundaries;
	std::vector<int> results;
	const std::size_t result_size = ( size * size + size ) / 2;
	distances.resize ( result_size );
	boundaries.resize ( result_size );
	results.resize ( result_size );

	auto distance_it = distances.begin ( );
	auto boundary_it = boundaries.begin ( );
	//benchmark begin
	const auto start = std::chrono::steady_clock::now ( );
	for ( auto it1 = spheres.begin ( ), end = spheres.end ( ); it1 != end; ++it1 )
	{
		for ( auto it2 = it1 + 1; it2 != end; ++it2, ++distance_it, ++boundary_it )
		{
			const Vector3 position1 = it1->position;
			const float radius1 = it1->radius;

			const Vector3 position2 = it2->position;
			const float radius2 = it2->radius;

			*distance_it = ( position2 - position1 ).LengthSquared ( );
			*boundary_it = radius1 * radius1 + radius2 * radius2;
		}
	}

	distance_it = distances.begin ( );
	boundary_it = boundaries.begin ( );
	auto result_it = results.begin ( );
	for ( int& result : results )
	{
		if ( *distance_it <= *boundary_it )
		{
			*result_it = 69;
		}
		else
		{
			*result_it = 420;
		}

		++distance_it;
		++boundary_it;
	}
	const auto end = std::chrono::steady_clock::now ( );
	//benchmark end

	const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
	std::cout << "DO BRANCHING IN SEPERATE LOOP" << std::endl;
	std::cout << "Time: " << nanoseconds_taken << std::endl;
	std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}

using TestType = int;

struct Test final
{
	Test ( ) = default;
	Test ( const TestType num1, const TestType num2 ) : num1 ( num1 ), num2 ( num2 )
	{
	}

	TestType num1 { };
	TestType num2 { };
};

void sort_your_data_into_arrays_based_on_what_you_want_to_do_with_them ( )
{
	std::random_device seed { };
	std::mt19937 engine { seed ( ) };
	std::uniform_int_distribution<int> rand_command { 1, 4 };
	std::uniform_real_distribution<long double> rand { 1, 1000000 };

	const std::size_t size { 10000000 };

	std::vector<Test> plus_tests;
	std::vector<Test> minus_tests;
	std::vector<Test> divide_tests;
	std::vector<Test> multiply_tests;
	std::vector<TestType> results;
	results.resize( size );

	for ( std::size_t index = 0; index < size; ++index )
	{
		const int command_index { rand_command ( engine ) };

		switch ( command_index )
		{
			case 1:
			{
				plus_tests.push_back ( Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ) } );
				break;
			}
			case 2:
			{
				minus_tests.push_back ( Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ) } );
				break;
			}
			case 3:
			{
				divide_tests.push_back ( Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ) } );
				break;
			}
			case 4:
			{
				multiply_tests.push_back ( Test { static_cast<TestType>( rand ( engine ) ), static_cast<TestType>( rand ( engine ) ) } );
				break;
			}
		}
	}

	//benchmark begin
	auto it = results.begin ( );
	const auto start = std::chrono::steady_clock::now ( );
	for ( const Test test : plus_tests )
	{
		*it = test.num1 + test.num2;
		++it;
	}
	for ( const Test test : minus_tests )
	{
		*it = test.num1 - test.num2;
		++it;
	}
	for ( const Test test : divide_tests )
	{
		*it = test.num1 / test.num2;
		++it;
	}
	for ( const Test test : multiply_tests )
	{
		*it = test.num1 * test.num2;
		++it;
	}
	const auto end = std::chrono::steady_clock::now ( );
	//benchmark end

	const auto nanoseconds_taken = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start ).count ( );
	std::cout << "SEPARATE DATA INTO DIFFERENT ARRAYS" << std::endl;
	std::cout << "Time: " << nanoseconds_taken << std::endl;
	std::cout << "Average: " << nanoseconds_taken / size << std::endl;
}

int main ( )
{
	//loop_test ( );
	do_branching_on_resolution ( );
	//sort_your_data_into_arrays_based_on_what_you_want_to_do_with_them ( );

	system ( "pause" );
	return 0;
}

Post by **albinopapa** » February 8th, 2017, 8:12 pm

Code: Select all

                LOOP THROUGH ITERATORS:     7.3                | LOOP THROUGH INDICES:     6.8
             DO BRANCHING IN SAME LOOP: 24528.1       | DO BRANCHING IN SEPERATE LOOP: 34965.2
                 STORE DATA IN ONE ARRAY    9.8   | SEPARATE DATA INTO DIFFERENT ARRAYS    5.1

Here are my results. Interesting results to me.

The iterator vs indices I've done before and came to the same conclusion, really not difference so using indices being a much more elegant way of coding, that's what I stick with.

The branching thing I hadn't heard of before. As for the test itself, if you are only worried about which value to take; 69 or 420 then you can use bit masks to flatten the if statements both in SIMD and x86. If this was suppose to represent true branching like calling another function, then yeah SIMD doesn't do that...but it has a _mm_movemask_??? that returns an int. You can create a condition mask and the result of the movemask function in if statements.

The last one I hadn't heard of either. I wonder what the result would be if you used function pointers, lambdas or the like instead of a switch statement.

reductor · Post by **reductor** » February 8th, 2017, 9:09 pm

These threads on performance are becoming quiet concerning because alot of them are wrong, incorrectly tested, not accounting for 'real data', I suggest you avoid following this optimization route until you actually hit a problem, otherwise your going to end up with some messy code that will likely be the cause of your slowness

Looping through index rather than iterators

The compiler is generally able to detect that your iterating and form the code that it needs for that platform, I suggest not even thinking about this instead focus on code readability.

Doing branching at a separate for loop

Your example is flawed, the condition can be optimized by the compiler into a conditional move, this depends on the compiler, using a ternary operator will do this for MSVC, your also completely discarding the cost of allocating your multiple arrays, and the fragmentation caused by allocating those arrays.

What works in one situation will not necessarily work in other situations, I highly suggest against going in with assumptions that what might be fast in one situation will be fast in other situations, this depends on the data set (size, locality, etc)

Simd optimisation was not used since I wanted the test to be fair.

This is completely flawed, your wanting optimization but your wanting to ignore SIMD? It's easy enough to get this to auto vectorize within MSVC even

Code: Select all

struct Vector3 final
{
   Vector3 ( ) = default;
   Vector3 ( const float x, const float y, const float z, float w = 1 ) : x ( x ), y ( y ), z ( z ), w( 1 )
   {
   }

   Vector3 operator-( const Vector3 & rhs ) const
   {
      return Vector3 { x - rhs.x, y - rhs.y, z - rhs.z, w - rhs.w };
   }

   float LengthSquared ( ) const
   {
      return x * x + y * y + z * z + w * w;
   }

   float x { };
   float y { };
   float z { };
   float w = 0.f;
};

Separating your data into different arrays based on what you want to do with them

A simple std::sort will bring this into line and eliminate the difference (branch prediction takes care of the rest)

Code: Select all

   std::sort(tests.begin(), tests.end(), [](const Test & t1, const Test & t2) {
	   return t1.command < t2.command;
   });

Then with this your multiple array one becomes worse because your allocating extra memory, likely causing extra fragmentation.

Some comments on your coding style:

Your missing optimizations with const &, there is alot of copying going on
Your littering your code with 'final' where its completely unnecessary (I'm guessing you read somewhere it makes things faster; this relates to devirtualization not what your doing)
Your doing wierd stuff like using a uniform_real_distribution<long double> which then converts into a float, don't do this float<->int conversions are expensive, this should be a uniform_int_distribution<int> -- When your seeing those casts think of the cost of them
All of your benchmarks ignore the side affects (more allocations, data locality, etc); they are micro-benchmarks; avoid using these assumptions in anything but your specific examples

And ofcourse premature optimization is the root of all evil

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%

EDIT: Unless your working on real production code, don't worry about optimization unless you hit an actual performance issue, and then do profiling to work what you need to change. My guess is your not working on actual production code, so don't even worry about it. Focus on learning clean and readable code more.

Post by **albinopapa** » February 8th, 2017, 10:07 pm

Yeah, I think he was trying to find a route ahead of time in order to reduce the amount of refactoring later. I understand where you're coming from reductor, but it's hard not to want to write generally more optimized code upfront as opposed to write, test, refactor.

I agree about the test being flawed. I've noticed this in my own trials. I may be able to whiz out 10,000,000 to 40,000,000 operations on data arrays during these tests in 16 ms, but when it comes to actually making a program and using it, I really don't think I get half that by the time I'm through.

I noticed a few places where he doesn't use const &, but it's the same for both tests, so the test results should be somewhat comparable. I messed around with the second test and moving the it1 stuff out of the inner loop and copying by value, while taking the ref of the it2 inner loop produced faster times, even faster than both being const & inside the inner loop.

Code: Select all

	for( auto it1 = spheres.begin(), end = spheres.end(); it1 != end; ++it1 )
	{
		const Vector3 position1 = it1->position;
		const float radius1 = it1->radius;

		for( auto it2 = it1 + 1; it2 != end; ++it2, ++result_it )
		{
			const Vector3 &position2 = it2->position;
			const float &radius2 = it2->radius;

			if( ( position2 - position1 ).LengthSquared() < radius1 * radius1 + radius2 * radius2 )
			{
				*result_it = 69;
			}
			else
			{
				*result_it = 420;
			}
		}
	}

It's really difficult to correctly profile your actual project. I've tried before when I was experimenting with SIMD and all I got from it was the functions that I thought would take the longest took the longest. I did find some surprising results though when I did it. It was faster to do a for loop and all conditions in the loop as opposed to writing multiple conditions with a loop for each condition

Code: Select all

// Faster in my tests
for(;;)
{
     if()
     {}
     else if
     {}
     else
     {}
}
// than
if()
{
     for(;;)
     {}
}
else if()
{
     for(;;)
     {}
}
else
{
     for(;;)
     {}
}

Not sure if this was because of branch prediction or if it was because the jumps would cause stalls looking for the next set of instructions or who knows what.

On that same project, I was doing some software lighting on sprites using SSE. The sprites were 320x320 and depending on there location relative to the camera could be 1x320 or 320x1. I tried two different ways of handling this. One way was to render the sprites and paste to the back buffer, using x86 code when the sprite width was less than 4. The other way was to draw the sprites normally using putpixel and another function to copy over the normals for the sprite. Then I would just blast through the system buffer calculating the lighting all in one go and not have to worry about sprite widths. I was getting about 25 fps with option 1 and 18 fps with option 2. I was completely confused by the results.

I'd be interested to see some of your code reductor, you seem to be quite adamant about code readability vs optimization. What would a good performance test look like?
What factors need to go into a performance test?
Do you not believe in testing for performance so you can get an idea of what might be slowing your code down " in production code "?

reductor · Post by **reductor** » February 8th, 2017, 11:05 pm

it's hard not to want to write generally more optimized code upfront as opposed to write, test, refactor.

This is easy enough to end up going down the directory of harder to read code; I find alot of people who focus on optimization completely neglect readability. This isn't just new people but professionals also.

Not sure if this was because of branch prediction or if it was because the jumps would cause stalls looking for the next set of instructions or who knows what.

Alot of this depends on the actual data, and the contents of those loops and conditions, hence why I keep reiterating that you cannot make one decision in a optimizing the design without first profiling and understanding what the issues actually are.

I'd be interested to see some of your code reductor, you seem to be quite adamant about code readability vs optimization.

Most the code I write is done professionally, I rarely do bits of open source stuff, you can find some stuff on my github, but alot of what I write isn't available online.

What would a good performance test look like? What factors need to go into a performance test?

This all depends on what your testing, sometimes you want to focus on system wide performance tests in these situations you want a real life situation to be simulated in some way consistently (e.g. a time demo in quake based engines); in other situations you might have identified a method that is costly so you will create a micro-benchmark that will input data similar to the real usage to specifically focus on that one method.

Some of the key things to focus on when optimizing are:

Knowing the real data (Random data is not real data, it does not match anything like real data)
Reproducible results (Eliminate noise -- Again reduce random data; use defined data sets)
If you resort to using random data have a constant seed
Profile before and importantly after optimization
Any performance optimizations need a before and after result (I've seen people optimize things saying they are faster when instead they made it slower -- And this is from professionals)
Be aware of side affects that could negate your micro benchmarks (e.g. Ignoring warm-up costs, etc)
Always do multiple iterations (I find googles benchmark framework is good for this)
Don't believe all of what you see on the internet relating to performance, some things are written without considering the above; and some things are written by professionals to be read by professionals who are aware that it's isolated benchmark

There is many more that I could list

Do you not believe in testing for performance so you can get an idea of what might be slowing your code down " in production code "?

You should do performance testing when performance is an issue, this can be found by tracking things like frame time, having some form of hitch detection, etc; You could spend your entire time focusing on getting the performance of adding some effect to an entity but then in turn that entity only appears once and the one occurrence does not have any noticeable impact.

I've worked with many different people (professionals) who try go focus on performance first, alot of the time they end up getting it wrong and have some awful code.

The general order in which I think you should focus on things are:
1. Readability -- This is to make it easier to understand and debug your own code, along with change it if you need to
2. Correctness -- This is also to make your code easier to understand and reason about, this is about having things work correctly -- testing is important here
3. Documenting -- While self documenting code is best, if something needs an explanation then make sure to include it
4. Performance -- When your hitting issues and need more time in your frame, then optimize, the above steps will have just made this much easier; and when optimizing try not to compromise the same steps above

#1 and #2 are very close and sometimes the other way around, depending on the situation (e.g. writing some library code you might want it to be correct at the sacrifice of some readability -- should be extremely rare that you want to sacrifice readability)

Post by **albinopapa** » February 9th, 2017, 12:36 am

Code: Select all

                LOOP THROUGH ITERATORS:     4.1                | LOOP THROUGH INDICES:       7
             DO BRANCHING IN SAME LOOP: 97153.9       | DO BRANCHING IN SEPERATE LOOP:  192940
                 STORE DATA IN ONE ARRAY  191.8   | SEPARATE DATA INTO DIFFERENT ARRAYS  115.7

This may not mean much after our discussion, but I kept all the stuff that would "in the real world" be already allocated out of the timed portion of code and moved the stuff that would be created and destroyed ( or at least cleared ) in to the timed portion.

This is an average of 10 iterations of the tests. So that covers two of the scenarios you mentioned, do multiple iterations and don't ignore the cost of setup.

I agree, setting up code to "be efficient" is a pain in the ass and gets very unreadable quick. I'm probably going to stick to relying on frame rates in the projects I'm actually working on for the most part.

The things I do under the understanding of efficiency is:

const & for parameters
const auto for almost all my vars...if I'm feeling lazy. Mostly const whatever.
if I have nested loops, calculate whatever I can outside the loops, then whatever I can in the outer loop, then all else that can't be pre calculated goes in inner loop.
ternary operator for value selection and const: const int val = condition ? 69 : 420;

I think that about covers all the things I do while developing. I don't go out of the way to do structure of arrays for instance.

Thanks for the feedback reductor.

Post by **chili** » February 9th, 2017, 2:28 am

I don't really have much to say about this, but I wanted to point out one objective error.

cyboryxmen wrote:From my Assembly knowledge, the cpu is optimised for various memory access methods including [start_of_memory_segment + displacement]. This is really useful for OSs that use a lot of virtual memory for memory management and most don't use actual physical addresses all that much to access memory. So really, accessing memory through indices is how memory is accessed by default.

Different memory addressing modes carry with them different overheads, and indexed addressing is not the 'default'. It carries with it a small overhead of offset calculation (about 20% for L1 data cache reads, see this: http://www.7-cpu.com/cpu/Haswell.html). This has nothing to do with virtual memory.

As Reductor says, the compiler will often see that you have a loop indexing sequential memory addresses and replace it with equivalent pointer incrementation. I have caught it failing at this sometimes, but compilers in general and on average will only get better at time progresses.

On the other hand, this latency penalty of 1 cycle will only be apparent in very tight loops were where the total number cycles / iteration is low. And it will only come into play where most of the memory accesses are L1 (typically linear strided access or random access in a very small memory block that already resides in L1).

So if all of these things align, you can theoretically hope for AT MOST an increase of performance of 20% (never that much in reality tho, because there are going to be other instructions for the loop etc.), probably more like 10% ish in the tightest loops and less than 2% in a loop that actually does a bit of work.

So at the end of the day, it will seldom give you a measurable boost, and even when it does, that boost will hardly be worth it if you had to sacrifice any code quality to achieve it. So just focus on writing good code and forget about microoptimizing index vs. pointer incrementing.

EDIT

Check out this stuff: https://godbolt.org/g/bBrgrJ

First of all, both implementation are properly vectorized, which warms me to the cockles of my heart, and second, notice that the indexed version is recognized by the compiler and turned into what is basically pointer incrementation, while the pointer version seems to have been implemented in such a way by the compiler that it is calculating an offset every iteration. Just goes to show that it is never wise to make assumptions about what the compiler will generate, and also that the more straight-forward approach here actually allowed to compiler to make the optimization better than if we tried to force the issue.

Post by **albinopapa** » February 9th, 2017, 3:30 am

Luckily, he realizes that using iterators is far messier and more prone to err than using the simplified and less prone to error indices, it kind of says so in his post. I have personally made my code slower trying to use iterators over indices when it's not completely linear.

For instance, trying to do some image processing, I had an x_iterator and that ran fine, but when trying to make a y_iterator with which the next rows x_iterator started, it actually slowed the code down a few frames per second.

I'm glad he posted this stuff though, because I've always wondered about making separate arrays/vectors for specific math ops, like add/sub/mul/div. The part I think would be the most difficult would be determining the order of the results and what objects they belong to. So aside from not figuring that out, I guess not being able to reason how the code works is the other thing that has stopped me from using it, doesn't mean I'm not going to try it one day though.

cyboryxmen · Post by **cyboryxmen** » February 9th, 2017, 3:05 pm

I actually have a huge focus on using readable and easier to type code when I program regardless of what my recent posts make me seem. The reason why I did all these tests in the first place is to increase my confidence in typing more readable code rather than be worried that I'm doing something that other more experienced programmers would yell at me for.

I always feel like I'm in the dark when I'm coding. I keep getting yelled at by people who are more experienced than me to follow certain practices but given no real examples on how to implement them. Textbooks certainly don't help. The examples they give, while really good at demonstrating the principles they advocate, does not resemble actual code you will make in the end. Really, I'm just throwing shit at the wall until I find something that sticks.

reductor wrote: Some comments on your coding style:

Your missing optimizations with const &, there is alot of copying going on

Your littering your code with 'final' where its completely unnecessary (I'm guessing you read somewhere it makes things faster; this relates to devirtualization not what your doing)

Your doing wierd stuff like using a uniform_real_distribution<long double> which then converts into a float, don't do this float<->int conversions are expensive, this should be a uniform_int_distribution<int> -- When your seeing those casts think of the cost of them

All of your benchmarks ignore the side affects (more allocations, data locality, etc); they are micro-benchmarks; avoid using these assumptions in anything but your specific examples

The copying is mostly in preparation for making more thread safe code. Again, I'm just throwing shit at the wall and have no idea what I really am suppose to actually write for better thread safety. If you have any input on this, I would like to know if this would be worth it.
I kinda have a standard where any type that does not have a virtual destructor becomes sealed. I don't want anyone to inherit from a class like that and cause undefined behaviour because of the lack of virtual destructors.
That's mostly out of laziness. I wanted to make fast switches between testing with floats and testing with ints using aliases but uniform_real_distribution<> does not accept ints. The casting wasn't taken into the benchmark so I considered it a lesser evil.
I'm actually going to make a better benchmark test that takes that into account. I'm learning to make better benchmarks and I'll try to use more realistic code in the future.

I actually like doing all these tests in my spare time. Every time I do one, I can feel the darkness creeping away as I become more familiar with the strengths and weaknesses with certain coding decisions. More often than not, it'll make me a better programmer that writes more readable code. If anyone ever tells me to use iterators over indices ever again, I'll be wiser as to not listen to their hollow advice.

Post by **albinopapa** » February 9th, 2017, 7:40 pm

In this specific case, you aren't reading and writing to the spheres vector, so not copying them would be thread safe. The results vector though you do write to, but you can split your loops up so that each thread has a range to work on. Some have said not to do it like this because some threads might get done before others, so it has been suggested to make work packets. I don't know it this would go down the same road of allocating that the rest of your tests do, but it seems like it would.

I suppose if you preallocated an amount and when the buffer is full or there are no more jobs to add to the buffer, you would then process the packets. This way you wouldn't be allocating and deallocating.

Yeah man, reductor, we need multi-threading lessons lol. From setup to usage to best practices. How to avoid data races and overhead due to context switching.

Planet Chili

Some benchmarking tests

Some benchmarking tests

Re: Some benchmarking tests

Re: Some benchmarking tests

Re: Some benchmarking tests

Re: Some benchmarking tests

Re: Some benchmarking tests

Re: Some benchmarking tests

Re: Some benchmarking tests

Re: Some benchmarking tests

Re: Some benchmarking tests