Data Caching and Code Optimizations

The Partridge Family were neither partridges nor a family. Discuss.
theDevCorner
Posts: 17
Joined: July 12th, 2013, 8:46 pm

Data Caching and Code Optimizations

Post by theDevCorner » January 3rd, 2017, 6:06 am

So lately I've been reading a little about code optimization in hopes of applying it to an RPG that I've been working on. I've come across the word "cache" a lot, and I've done quite a bit of research. For those of you who don't know what the cache is, it's essentially a small memory unit that stores very little amount of information, and it is located very close to the CPU as in interface between the CPU and the actual memory of the system. It stores relevant data that is used most frequently and allows the user to access it much faster than if it were stored in regular memory. A diagram looks a little like this:

CPU -----> Caches -------------------> System memory

If it finds the address it is looking for in the cache, it results in a "cache hit" and if it doesn't the end result is a "cache miss", in which the cache fetches the address from system memory and then gives it to the program to use.

Just as a clarifying note, there isn't just one cache attached to the CPU, and caches are found all over hardware in other places as well. (Including things like web caches etc...)

I was wondering if anyone knew about this and was looking to have a discussion about it. I'm fairly new to the topic and thought that someone here might have some insight into it, as the forum seems to be knowledgable in a wide field of subjects. Anyways, here's a brief list of some things that I've found to help improve performance in any project, and they all help the cache to store more relevant data and access it faster.

1) Use smaller data types. This allows the cache to store more memory, as the cache is typically about 32kb in size.

2) Don't use linked lists, and use arrays that store contiguous data where possible. This allows the CPU to load blocks of relevant data into the CPU and let the code access them more readily. If the addresses were scattered (non-contiguous) then the CPU can't store as much relevant data for a single object, as it is scattered across the system memory and isn't easily fetchable.

3) Avoid multithreading in simple applications if possible, as the cores can actually fight over addresses, resulting in many more cache hits than normal.

4) Avoid the bottleneck of OOP, and don't make EVERYTHING and object. Even if it helps with organization.

Those are just a few things that I've found throughout research. To me it sounds like this drives any programmer away from OOP, and creates a more data driven program. This is a challenge for someone like me who enjoys the flexibility and power of OOP. I've been looking into places to cut objects and to remove dependencies from complex relationships. If anyone has any knowledge into this subject area I'd love to hear it. ;)
Code more, work less.

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Data Caching and Code Optimizations

Post by albinopapa » January 3rd, 2017, 8:53 am

1) Use smaller data types. This allows the cache to store more memory, as the cache is typically about 32kb in size.
Can you be more specific?
2) Don't use linked lists, and use arrays that store contiguous data where possible. This allows the CPU to load blocks of relevant data into the CPU and let the code access them more readily. If the addresses were scattered (non-contiguous) then the CPU can't store as much relevant data for a single object, as it is scattered across the system memory and isn't easily fetchable.
It is true that linked lists are less cache friendly than arrays, in most situations arrays will always win because of data locality. The CPU fetches upto 64 bytes at a time, so it's best to have the next bit of data you plan on using in that initial fetch. If the array is large enough and you insert or remove something from that array, then you will be hurting whereas a linked list, or better yet a doubly linked list would have no problems. I think this is the only situation where the array would fail.
3) Avoid multithreading in simple applications if possible, as the cores can actually fight over addresses, resulting in many more cache hits than normal.
I think you mean cache misses. I don't think cores share L1 cache, so they wouldn't fight over the addresses. Some share L2 and L3 cache so that might be an issue. However, I've read you can overcome some if not all of that if you track of the threads and only allow them to run on data that is in their portion of cache...again it's about data locality.
4) Avoid the bottleneck of OOP, and don't make EVERYTHING and object. Even if it helps with organization.
You may want to clarify yourself, meaning give alternatives or examples. Messy code can be just as bad as a slow program. If you are going to do calculations on a lot of data, then yes, put the data in an array so you can benefit from the data locality. Some programs don't have specific sections of processing so OOP is just fine for those programs. For instance, I wrote a 3D Obj importer, didn't see much of a reason not to use classes and objects. When writing games though, you can create a Enemy class that stores an array of positions, velocities and the like, and that way you can quickly run through and modify the velocities and add them to the positions of all the enemies. So really it kind of depends on the workload you are doing.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
cyboryxmen
Posts: 190
Joined: November 14th, 2014, 2:03 am

Re: Data Caching and Code Optimizations

Post by cyboryxmen » January 3rd, 2017, 10:07 am

Before you dive into this topic, there are certain aspects of computer architecture that you must understand first.

First of all, memory is SLOW. Insanely slow. It used to be on par with the cpu but as the years go by, the cpu has advanced so far in speed leaving the less developed memory in the dust. Despite the advancements we have made in cpu speeds, it is constantly bottlenecked by memory. When you're programming, treat the * and -> as your most expensive operation(right next to the if/switch statement and virtual function calls).

Fetching data in one batch is going to be faster than doing it byte by byte. The cpu however can only do this if you align your data to their natural boundaries(ints in addresses that are a multiple of 4, shorts in addresses that are a multiple of 2 etc.) The compiler aligns data members in structs/classes automatically for you by adding invisible bytes between your members(formally known as padding). As a result, you normally do not need to worry too much about this but it is something that you need to think about when manually managing data.

Also, when requesting data that exists outside of your caches, the cpu uses a prefetcher to fetch a huge batch of data in front and behind the requested data to DRAM. As long as the very next data request you make(if it also exists outside of your caches) is within this batch of data, the prefetcher can stay consistently ahead of(and behind) you for the next data request. and the one after that. and the one after that.

Illustration here:
Image

What that means is that if you process your data in a linear fashion(forwards and backwards), you basically have another level of cache on the DRAM that has UNLIMITED MEMORY

Image

The power this grants you is enough to make linear search of 100,000 objects that takes 99,999 cycles to find said object still faster than a search algorithm that hops around to find it in 10. Data fetching barely ever scales if at all with the size of your array because as I've said: infinite memory. The key takeaway from this is to process your data linearly to allow the prefetcher to always stay ahead of(and behind) you.

With this in mind, I can honestly conclude that OOP is actually the best way of taking advantage of the cpu. OOP promotes encapsulation of data within objects. This keeps your data local(assuming that there isn't a pointer in there) and thus allows you to process them linearly in an array. Any other data can just simply be stored on the stack(Don't be afraid of copying. They are more efficient than you know!) and passed down as function arguments. They are usually properly aligned automatically and thus would still be aligned in an array. As for polymorphism, you can try out this class that lets you store different polymorphic classes inside an array.
Zekilk

User avatar
cyboryxmen
Posts: 190
Joined: November 14th, 2014, 2:03 am

Re: Data Caching and Code Optimizations

Post by cyboryxmen » January 3rd, 2017, 10:20 am

albinopapa wrote:If the array is large enough and you insert or remove something from that array, then you will be hurting whereas a linked list, or better yet a doubly linked list would have no problems. I think this is the only situation where the array would fail.
This would still be true if it wasn't for the prefetcher. The prefetcher is really just that fast at fetching data compared to fetching it normally. Doing a linear insertion would allow the prefetcher to pave the way for the cpu to zoom through the data and copy them all quickly together with the insertion. Even with multiple copies, it still beats linked-lists which doesn't have the benefits of using the prefetcher for UNLIMITED CACHED MEMORY(on DRAM). Like I said, memory is insanely slow. Watch this and be amazed! Not even binary search can beat linear search!
Zekilk

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Data Caching and Code Optimizations

Post by albinopapa » January 3rd, 2017, 11:43 am

cyboryxmen wrote:
albinopapa wrote:If the array is large enough and you insert or remove something from that array, then you will be hurting whereas a linked list, or better yet a doubly linked list would have no problems. I think this is the only situation where the array would fail.
This would still be true if it wasn't for the prefetcher. The prefetcher is really just that fast at fetching data compared to fetching it normally. Doing a linear insertion would allow the prefetcher to pave the way for the cpu to zoom through the data and copy them all quickly together with the insertion. Even with multiple copies, it still beats linked-lists which doesn't have the benefits of using the prefetcher for UNLIMITED CACHED MEMORY(on DRAM). Like I said, memory is insanely slow. Watch this and be amazed! Not even binary search can beat linear search!
Well, here's the deal, and I'd have to test it, buuuut; most classes store more than just a single int, and even Herb Sutter (speaker) says it works for small data types. Try it with classes or structs that are a few hundred bytes a piece and you may get a different outcome. Again, I'd have to test it.

As for your comment about OOP:
OOP promotes encapsulation of data within objects. This keeps your data local(assuming that there isn't a pointer in there) and thus allows you to process them linearly in an array
As long as each object does enough with it's own data before moving on to the next one, I agree. If you only use a small portion of the data, before moving on to the next object in the array I don't think the prefetcher will help you.

I've actually had experiences with this. I wrote a "Game of Life" simulation where in one version I wrote an OOP version, where the cells were in an array using a Cell class. The second version was an array of ints. The second version was like ten times faster going from 200 ms per frame on the OOP version using a Cell class, to 20 ms per frame using just an array of ints.

Also, Pindrought uploaded a test project where he wanted to test linked list against vector. If I remember correctly, list did win in a few instances in his case. His linked lists were home grown, and used new/delete for each element. The vector code didn't use pointers, but objects. When I made a few changes so that vector stored unique_ptr<elements> the speed of vector increased to be way faster than the linked lists. This was adding and removing elements, not necessarily processing those elements.

If you do use OOP, putting your objects in arrays is definitely going to help, but if you want the most speed, use structure of arrays to lay out your data. Keep all your like data in one structure; like

Code: Select all

struct Enemy
{
     std::vector<float> posx;
     std::vector<float> posy;
     std::vector<float> velx;
     std::vector<float> vely;
     std::vector<float> orix;
     std::vector<float> oriy;
     std::vector<std::shared_ptr<Sprite>> color;
};
Believe me, it's a huge pain to program like this. If you don't use iterators, then it probably won't hurt as much, however, I wanted to use iterators, so for every function I wanted to iterate through this data, I had to create and increment each iterator for each array. It's a lot more work, but I think it's about as fast as you can get without using SIMD instructions or using the GPU.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
cyboryxmen
Posts: 190
Joined: November 14th, 2014, 2:03 am

Re: Data Caching and Code Optimizations

Post by cyboryxmen » January 3rd, 2017, 1:58 pm

I suppose that as Chili said, benchmarking must be done first before any theory can be proven right. It's a good thing I was literally experimenting with these concepts as this thread was posted. I'm going to try out some of my optimisations using different memory models for use with SIMD both with C++ OOP and pure assembly code and hybrids of both.

At the end of the day, memory is key and I'm planning to test out all of the memory models you guys have suggested to me on the forums. Though I have to say, I would be more biased to choose to use the ones that will be more usable and readable. The reason why I took so long experimenting with these models is because of how limited it made the code. I had to change programming paradigms several times over just to figure out which ones would be compatible with them.

Also,
albinopapa wrote: If you do use OOP, putting your objects in arrays is definitely going to help, but if you want the most speed, use structure of arrays to lay out your data. Keep all your like data in one structure; like

Code: Select all

struct Enemy
{
     std::vector<float> posx;
     std::vector<float> posy;
     std::vector<float> velx;
     std::vector<float> vely;
     std::vector<float> orix;
     std::vector<float> oriy;
     std::vector<std::shared_ptr<Sprite>> color;
};
There are many cited reasons why this would be better but really, the main reason why this structure ever came up was because of 4 words

Single
Instruction
Multiple
Data

SIMD is a game changer for programming and you would be hard pressed to make your code as SIMD friendly as possible. Computers are really just automated calculators so the ability to make calculations faster is the key to computer optimisations. What better to do that than to have the ability to do multiple calculations at once! If you are not using SIMD, you miss the point of using structures of arrays and may be unoptimised because of the constant use of multiple arrays instead of one.
Zekilk

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Data Caching and Code Optimizations

Post by albinopapa » January 3rd, 2017, 10:27 pm

There are always going to be corner cases I suppose, one size does not fit all.

SIMD is fun to mess around with, especially if you go from traditional memory layouts to a more optimized one and you get to see the performance improvement. I really do need to test the different methods more often, but I really hate having to write the same program more than once.

Overall, you may be right about the original reason for SoA, but in all honesty that's how the prefetcher was designed, so you do get quite a bit of performance boost by using that layout without having to know any SIMD commands.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
LuisR14
Posts: 1248
Joined: May 23rd, 2013, 3:52 pm
Location: USA
Contact:

Re: Data Caching and Code Optimizations

Post by LuisR14 » January 4th, 2017, 1:01 am

theDevCorner wrote:Just as a clarifying note, there isn't just one cache attached to the CPU, and caches are found all over hardware in other places as well. (Including things like web caches etc...)
don't think that a webcache is a hardware thing, but it does use memory (DRAM, cache, and HDD space)
always available, always on, about ~10 years c/c++, java[script], win32/directx api, [x]html/css/php/some asp/sql experience. (all self taught)
Knows English, Spanish and Japanese.
[url=irc://irc.freenode.net/#pchili]irc://irc.freenode.net/#pchili[/url] [url=irc://luisr14.no-ip.org/#pchili]alt[/url] -- join up if ever want real-time help or to just chat :mrgreen: --

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: Data Caching and Code Optimizations

Post by albinopapa » January 4th, 2017, 2:42 am

No, but it does leave a trail, cached to HDD, then to DRAM, then to CPU cache.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

User avatar
cyboryxmen
Posts: 190
Joined: November 14th, 2014, 2:03 am

Re: Data Caching and Code Optimizations

Post by cyboryxmen » January 13th, 2017, 6:23 am

So I was bored one day and I decided to test out the benefits of Struct of Arrays vs Array of Struct(without the benefits of SIMD). I wasn't committed enough to make a bunch of graphs though so you'll have to test it out yourselves to really collect the data.

Here are my specs:
Intel Core i7-4700MQ CPU @ 2.40ghz
16GB DDR3(15.8GB usable)
Windows 10 64x

I decided to use a simple calculation of finding out the length of a math vector. A pretty standard calculation that you'll make in a video game. I did the calculation for the length of each vector in an array individually and added it to a total. I then recorded the duration of the procedure with std::chrono and then output it with the total on the console. The struct uses 3 floats representing each axis and the struct of array uses 3 std::vector<float>(s).

Code: Select all

struct Vector3
{
	float x;
	float y;
	float z;
};

struct Vector3s
{
	std::vector<float> x;
	std::vector<float> y;
	std::vector<float> z;
};
The results were so similar it doesn't seem to matter much. It really seems like both of them performed similarly with slight variations. The fun doesn't stop there though. After that, I tried to deliberately make the Array of Struct more unoptimised by adding more and more members that won't even be part of the calculation(color, uv values etc.). Then I made the Struct of Arrays more optimised by using pure C++ arrays. That way, the arrays will be right next to each other which should in theory help. Overall, it didn't do much to change the results. I barely notice any performance differences.

I'll give you the final code I ended up with so you can benchmark it yourself and see if you can find some corner cases. In my opinion, I really think you should stick to good old Array of Struct. It's standard, more usable and scales well. Seriously, typing out the for loop for the Struct of Arrays is such a pain. Even if you find corner cases(either with your system or with different code), I honestly don't think it is worth sacrificing years of OOP standard development. To me, SIMD remains as the sole reason why Struct of Arrays should ever be even considered.

Array of Struct

Code: Select all

#include <iostream>
#include <string>
#include <cstdint>
#include <vector>
#include <chrono>
#include <random>

struct Vertex
{
	virtual ~Vertex ( ) = default;
	float x;
	float y;
	float z;
	float normal_x;
	float normal_y;
	float normal_z;
	float r;
	float g;
	float b;
	float u;
	float v;
};

int main ( )
{
	std::random_device seed;
	std::mt19937 rand ( seed ( ) );
	std::uniform_real_distribution<float> rand_float ( 0, 12 );

	std::vector<Vertex> vertices;
	vertices.resize ( 200000 );
	for ( auto& vert : vertices )
	{
		vert.x = rand_float ( rand );
		vert.y = rand_float ( rand );
		vert.z = rand_float ( rand );
		vert.normal_x = rand_float ( rand );
		vert.normal_y = rand_float ( rand );
		vert.normal_z = rand_float ( rand );
		vert.r = rand_float ( rand );
		vert.g = rand_float ( rand );
		vert.g = rand_float ( rand );
		vert.u = rand_float ( rand );
		vert.v = rand_float ( rand );
	}

	auto start = std::chrono::steady_clock::now ( );

	float total { };
	for ( const auto vert : vertices )
	{
		total += std::sqrt ( float ( vert.x * vert.y * vert.z ) );
	}

	auto end = std::chrono::steady_clock::now ( );
	std::chrono::nanoseconds duration = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start );
	std::cout << "Total: " << total << std::endl;
	std::cout << "Time taken: " << duration.count ( ) << std::endl;
	std::cout << "Average: " << duration.count ( ) / vertices.size ( ) << std::endl;

	system ( "pause" );
	return 0;
}
Struct of Arrays

Code: Select all

#include <iostream>
#include <string>
#include <cstdint>
#include <vector>
#include <chrono>
#include <random>
#include <memory>

template<std::size_t size>
struct Vector3s
{
	enum
	{
		SIZE = size
	};
	float x [ size ];
	float y [ size ];
	float z [ size ];
};

int main ( )
{
	std::random_device seed;
	std::mt19937 rand ( seed ( ) );
	std::uniform_real_distribution<float> rand_float ( 0, 12 );

	constexpr std::size_t num = 200000;
	std::unique_ptr<Vector3s<num>> pointer ( new Vector3s<num> );
	auto& vector3s = *pointer;

	for ( float* x = vector3s.x, *y = vector3s.y, *z = vector3s.z, *x_end = vector3s.x + vector3s.SIZE; x != x_end; ++x, ++y, ++z )
	{
		*x = rand_float ( rand );
		*y = rand_float ( rand );
		*z = rand_float ( rand );
	}

	auto start = std::chrono::steady_clock::now ( );

	float total { };
	for ( float* x = vector3s.x, *y = vector3s.y, *z = vector3s.z, *x_end = vector3s.x + vector3s.SIZE; x != x_end; ++x, ++y, ++z )
	{
		total += std::sqrt ( *x * *y * *z );
	}

	auto end = std::chrono::steady_clock::now ( );
	std::chrono::nanoseconds duration = std::chrono::duration_cast< std::chrono::nanoseconds >( end - start );
	std::cout << "Total: " << total << std::endl;
	std::cout << "Time taken: " << duration.count ( ) << std::endl;
	std::cout << "Average: " << duration.count ( ) / vector3s.SIZE << std::endl;

	system ( "pause" );
	return 0;
}
Last edited by cyboryxmen on January 13th, 2017, 7:57 am, edited 5 times in total.
Zekilk

Post Reply