Optimization puzzle

cyboryxmen · Post by **cyboryxmen** » October 30th, 2017, 1:31 pm

So here's an example some of you would've run into at one point. Say you have a bunch of planes. These planes can shoot missiles at each other. The missiles themselves are homing meaning that it can track its target and move towards it. For some of us, we would represent their data structure as such:

Code: Select all

struct Body
{
	Vector position;
	Vector velocity;
};
struct Plane : public Body
{
};
struct Missile : public Body
{
	const Body* target;
};

We would've realistically used classes with private variables but this is just an example. As for missile, we would make an update() function that looks like this:

Code: Select all

void Update ( const float time )
{
	const auto target_position = target->position;
	const auto relative_position_to_target = target_position - position;
	const auto direction_to_target = relative_position_to_target.Normalized ( );

	velocity = direction_to_target * speed;
	position += velocity * time;
}

...and then loop through a container of missiles like this:

Code: Select all

for ( auto missile_it = missiles.begin ( ), end = missiles.end ( ); missile_it != end; ++missile_it )
{
	missile_it->Update ( delta_time );
}

Instead of doing that however, you can use these:

Code: Select all

void UpdateVelocity ( )
{
	const auto target_position = target->position;
	const auto relative_position_to_target = target_position - position;
	const auto direction_to_target = relative_position_to_target.Normalized ( );
	velocity = direction_to_target * speed;
}

void UpdatePosition ( const float time )
{
	position += velocity * time;
}

...and loop as so:

Code: Select all

for ( auto missile_it = missiles.begin ( ), end = missiles.end ( ); missile_it != end; ++missile_it )
{
	missile_it->UpdateVelocity ( );
}
for ( auto missile_it = missiles.begin ( ), end = missiles.end ( ); missile_it != end; ++missile_it )
{
	missile_it->UpdatePosition ( delta_time );
}

Doing it this way would seem foolish. You're looping through all the missiles again just so you can needlessly split the Update() into two operations. This way would've surely run slower...or at least that's what some of you would believe. In actuality, the second version is faster. With the Visual C++ compiler on an x86 build with 1000 planes and 1000 missiles, The second version would be faster by 5 milliseconds( you can reduce it to less than 1 on x64 ). The difference is very slight but it is there. Plus, the fast that the second version isn't slower let alone faster than the first is perplexing.

I started this experiment to test out what I read online on the cpu's i/o capability. You see, cpus are getting faster and faster nowadays. They're so fast that reading and writing to memory is actually incredibly slow in comparison. Today, optimization doesn't focus on making efficient algorithms but rather efficient ways to do i/o.

OoO(Out of Order) processors have one such way of optimizing for this. These processors take advantage of the fact that reads do not depend on each other to run. You do not need to know of the value of target->position to start reading your own position. Since these instructions do not rely on each other finishing to work, the reads can all be done all at at the same time. Assuming that your cpu is capable of having [N] outstanding reads, you can read [N] number of variables at once for the entire loop! Furthermore, the reads can also be processed in OoO(Out of Order) where the read that finishes first(because it happened to have already been read before/is stored closer to the cpu) gets processed first instead of waiting for the reads that come before to finish.

Writes on the other hand are different. Unlike reads, writes can have dependencies on the previous writes to work. In this case, write to position must wait until write to velocity is done. The cpu can't do multiple writes at once as easily as reads and pretty much has to do them in order.

By having two loops, first for updating velocity and second for updating position, you prevent the writes from blocking each other. By the time you need to write to position in the second loop, the write to velocity in the first loop would have already finished. Therefore, the write to position can proceed pretty much immediately. In comparison, the first version demands that the write to velocity gets completed immediately instead of allowing it to take its time.

Course, all of this is just theory. For all we know, it could also be specific to my system itself. Thus, I am giving all of you the repository containing the test itself. I added a few macros that allows you to customize the test. The test is currently configured to run at maximum efficiency. No matter how you customize it though, it seems like the two loop solution is the most optimal version. I am very interested in the results you get on your own systems compared to mine.

Post by **albinopapa** » October 30th, 2017, 8:51 pm

I have been thinking about this a bit recently. Normally, I would want to cache computation results so as not to cause a recalculation and just grab the result from registers or cache, but I've been wondering if it wouldn't be a better option to do the calculations again so that I wouldn't be waiting for the result of the first to be able to reuse it on another line.

Code: Select all

Approach 1:
int a = 2;
int b = 5;
int c = a + b;

unit[0].doSomething( c );
unit[1].doSomething( c );

Approach 2:
unit[0].doSomething( a + b );
unit[1].doSomething( a + b );

Greatly simplifying the need here, but I just can't break the habit long enough to test this out. In the first example, both functions have to wait for 'c' to be evaluated before being called. In the second, they can be called out of order. Probably a bad example since I'm not sure if function calls can be optimized in such a way unless they are inlined.

I believe the concept is called pipelining where instructions are staggered, first a load, then decode and while the decode is happening there is a load. When the decode passes the instruction on to say an adder, the second load passes it's data to the decoder. This staggering can only happen as long as there are no dependencies on the previous data. If there is a dependency on the result of one operation, then the pipeline is empty and waiting for the operation to complete before moving on to the next one.

One thing that is useful for programmers is to use the const keyword for as many things as possible as this helps the compiler optimize the code so that perhaps a variable won't be refetched from memory since you are promising the value hasn't changed since the last time it was used.

I'll give your program a try and post results when ready. I love these threads.

Post by **albinopapa** » October 30th, 2017, 11:14 pm

Release x64
Single loop
No preprocessors

Code: Select all

lowest: 16312ns
highest: 672238ns
median: 17365ns
mean: 18151ns
standard deviation: 6923.62ns

WRITE_AT_END

Code: Select all

lowest: 14471ns
highest: 610408ns
median: 15523ns
mean: 16270ns
standard deviation: 6873.44ns

DIRECT_VALUE and WRITE_AT_END

Code: Select all

lowest: 13681ns
highest: 239690ns
median: 14471ns
mean: 15042ns
standard deviation: 5557.49ns

SEPARATE_POINTER and WRITE_AT_END

Code: Select all

lowest: 13681ns
highest: 1208977ns
median: 14734ns
mean: 15323ns
standard deviation: 7708ns

DIRECT_VALUE

Code: Select all

lowest: 14734ns
highest: 1859904ns
median: 15786ns
mean: 16575ns
standard deviation: 8350.46ns

SEPARATE_POINTER

Code: Select all

lowest: 14997ns
highest: 617512ns
median: 16049ns
mean: 16721ns
standard deviation: 6223.64ns

Double loop

Code: Select all

lowest: 14470ns
highest: 494379ns
median: 15260ns
mean: 16048ns
standard deviation: 5853.36ns

WRITE_AT_END

Code: Select all

lowest: 14733ns
highest: 419130ns
median: 15524ns
mean: 16366ns
standard deviation: 6289.3ns

DIRECT_VALUE and WRITE_AT_END

Code: Select all

lowest: 13944ns
highest: 688025ns
median: 14734ns
mean: 15391ns
standard deviation: 6196.09ns

SEPARATE_POINTER and WRITE_AT_END

Code: Select all

lowest: 14997ns
highest: 491221ns
median: 16049ns
mean: 16906ns
standard deviation: 6420.2ns

DIRECT_VALUE

Code: Select all

lowest: 13418ns
highest: 442809ns
median: 14471ns
mean: 15061ns
standard deviation: 5869.32ns

SEPARATE_POINTER

Code: Select all

lowest: 14470ns
highest: 837733ns
median: 15523ns
mean: 16272ns
standard deviation: 6033.59ns

Now let's rank them using the mean:

Code: Select all

Sorted
	 1) Single loop ( DIRECT_VALUE and WRITE_AT_END ) 		= 15042ns
	 2) Double loop ( DIRECT_VALUE )						= 15061ns
	 3) Single loop ( SEPARATE_POINTER and WRITE_AT_END)		= 15323ns
	 4) Double loop ( DIRECT_VALUE and WRITE_AT_END )		= 15391ns
	 5) Double loop ( no preprocessors )						= 16048ns
	 6) Single loop ( WRITE_AT_END )						= 16270ns
	 7) Double loop ( SEPARATE_POINTER )					= 16272ns
	 8) Double loop ( WRITE_AT_END )						= 16366ns
	 9) Single loop ( DIRECT_VALUE )						= 16575ns
	10) Single loop ( SEPARATE_POINTER )					= 16721ns
	11) Double loop ( SEPARATE_POINTER and WRITE_AT_END )	= 16906ns
	12) Single loop ( no preprocessors )						= 18151ns

Summary:
For static targets that won't move, use a single loop and do calculations on temp variables and write final results to member variables.
For moving targets regardless if it's going to be the same target or passing a new target to the Update function, use a double loop and do calculations and write final result to member variables, don't use temps.

Specs:
CPU: AMD 7870K APU 3.9GHz
RAM: 8GB 1333 DDR3
L1: ICache = 2x 96KB, DCache = 4x 16KB
L2: 2 x 2MB
L3: None

Post by **albinopapa** » October 30th, 2017, 11:16 pm

I did find a "bug" in the code. For single loop SEPARATE_POINTER you aren't incrementing the target_it iterator. These results are after adding ++target_it to the end of the for loop signature. It did change the results putting the double loop at a better advantage. Otherwise, the single loop looked a lot more promising because it is more straight forward to write and call a single Update function then to have to write and call two functions for the same result.

cyboryxmen · Post by **cyboryxmen** » October 31st, 2017, 3:09 am

albinopapa wrote:

Code: Select all

Approach 1:
int a = 2;
int b = 5;
int c = a + b;

unit[0].doSomething( c );
unit[1].doSomething( c );

Approach 2:
unit[0].doSomething( a + b );
unit[1].doSomething( a + b );

Interesting. If these functions are inlined, it's better to do calculations first since the values would stay in the registers. If not, I would wonder what the performance would actually be. It might be worth testing it out.

Also, thanks for informing me about the bug. I was wondering why it causes single loop to have a performance increase even though it decreases performance for double loop. Now the single loop preforms worse just like the double loop. I guess this means that I have finally proven that Array of Structs is not always worse than Struct of Arrays. I only focused on SoA because SIMD is dependent on it to work optimally. Now that I know that CUDA works well with AoS(better that SoA in some cases too) and preforms better than SIMD, I pretty much don't have much reason to do SoA. I would more likely do a compromise between the two instead.

cyboryxmen · Post by **cyboryxmen** » October 31st, 2017, 4:24 am

About pipelining: Out of Order execution optimizes the pipelining process a little bit further. Generally, an OoO processor works as such:

1)Instruction is fetched.
2)Instruction is stored in an instruction buffer.
3)Instruction waits in the buffer until its inputs are available in the register. The instruction is then allowed to leave the buffer before earlier, older instructions.
4)The instruction is issued to the appropriate functional unit and executed by that unit.
5)The results are stored in a buffer.
6)Only after all older instructions have their results written back to the registers, then this result is written back to the registers.

Thus, non i/o(read/write) instructions can be executed as soon as their input is available in registers(a multiply can be done as soon as the read finishes fetching the data). In an OoO processor, [n] number of reads are all done ahead of other instructions all at once for the specific purpose of fetching these inputs early. This allows multiple operations across every iteration of the loop to occur simultaneously making sure that the pipeline is more frequently filled.

Unfortunately, the OoO processors must return back to the original order the program is made in eventually. Write to memory must pretty much always be done in the program's original order(to retain the program's memory consistency). Doing it otherwise can be disastrous. Thus, you have to assume that a write will always block other writes(and other reads that depend on that write too).

Post by **albinopapa** » October 31st, 2017, 7:43 am

As far as CUDA goes, look into C++ AMP, DirectCompute or OpenCL so you aren't limiting yourself to nVidia hardware. Yes, CUDA can run on CPU, but not really a gain for those that don't have nVidia GPUs. Don't be a jerk lol.

You may not get a full 4x speedup using AoS, but using SIMD instructions still offers an increase over the standard x86/x64 instructions as long as there is enough work to be done between load and store instructions.

cyboryxmen · Post by **cyboryxmen** » November 3rd, 2017, 7:31 am

I decided to take the next step and see if a multi threaded implementation would work better. Multi threading became more popular as people realized that it can be used to work around i/o stalls. If a thread stalls because of i/o, the cpu can just run another thread instead. It failed...horribly.

You can have the repo here. Be warned though: it has boost in it so the file size is a lot larger.

Planet Chili

Optimization puzzle

Optimization puzzle

Re: Optimization puzzle

Re: Optimization puzzle

Re: Optimization puzzle

Re: Optimization puzzle

Re: Optimization puzzle

Re: Optimization puzzle

Re: Optimization puzzle