caching shenanigans

cyboryxmen · Post by **cyboryxmen** » November 25th, 2017, 12:43 pm

Contiguous storage has become more in favour nowadays. It's better to allocate a huge chunk of memory to store all your objects in over allocating seperate chunks of memory for each object. The main reason comes from one aspect of modern CPUs: caching.

CPUs are so fast at calculations that retrieving memory from DRAM is so much slower in comparison. To alleviate this, CPUs have caches that the CPU has easier access to. Optimizations today will be less about optimizing algorithms and more centred around making it easier to cache memory for the CPU to use.

When the CPU reads or writes memory that is not already cached, it copies it from DRAM to its cache. It doesn't just read it byte by byte either as that would be non-optimal. Instead, the CPU reads 64 bytes at a time by reading memory in 64 byte boundaries. This means that instead of treating memory as a collection of single byte blocks, they are treated as 64 byte blocks instead. These blocks are more commonly known as words. The CPU goes to the address of the 64 byte word containing the requested memory and fetches the whole word at once. This shows how alignment is important for caching optimizations and you should make sure that your data is aligned nicely into a 64 byte word. If your requested memory is in between two 64 byte words, the CPU has no choice but to load both those 64 byte words to load the 4 byte int you wanted.

Now, the cache is split into different cache lines, each storing one 64 byte word. When the CPU wants to cache memory, it stores it in a cache line that has not been accessed for a long time out of all other cache lines. Because of this, it is good to make use of all the data in the cache line at once to prevent the CPU from preemptively evicting a cache line early.

Additionally, there is also the prefetcher. The prefetcher is responsible for prefetching the next consecutive words ahead of the last word that was fetched in memory. It was designed specifically for looping through a block of memory where you iterate through the objects linearly, from begin to end or vice versa. It can move both forwards and backwards and will move based on how it predicts you would loop.

Bringing this all together, having your objects stored contiguously in memory makes them more optimal for caching. The data is closer together and thus more likely to be packed in the same cache lines. You should also preferably access them sequentially to better make use of the prefetcher but as my other thread shows, sequential access isn't always possible.

However, if theoretically, your object fills up the whole cache line anyway, and you're not accessing your objects sequentially, wouldn't there be barely any performance difference between storing contiguously vs individually since you can't make use of these optimizations? I did a test specifically to prove this and I have concluded...that contiguous storage beats individual storage; no competition.

As it turns out, you cannot turn off the prefetcher. The prefetcher doesn't care if you're not storing your objects contiguously and will always be prefetching words ahead of the current word being processed. If you are storing objects contiguously, this will benefit you as it will fetch words that you will actually use later but if you store your objects individually, the prefetched data would most likely be complete garbage filling up your cache.

tf;dr: std::vector and other contiguous containers are most likely the better option than any other container.