This has been the "prefer readable code over fast code". What is fast code though?
Well, if you have been around the forum for about a year or so, there has been a few attempts to cover best practices on how to make your code fast. More than few of them have been done by cyboryxmen I believe, where he covers topics such as cache locality to SIMD. While cache locality or sometimes referred to as data locality and SIMD go hand in hand, their is another branch where data locality plays an important role; memory reads and writes. This has also been covered on the forum a few times over the past couple of years. For instance, which container do you choose? List, Deque, Map, Vector?
The interfaces for each of these containers is pretty similar, so from a schedule point of view, any should be fine. From a read/write point of view, vector is probably going to be your best bet because all the data is contiguous. From allocation point of view, perhaps deque would be best, because it allocated chunks of memory and instead of allocate/copy/deallocate like vector just keeps a linked list of memory chunks.
Ok, my ramblings weren't suppose to be about containers per se, and that's about as far as my knowledge goes anyway.
What I have come to say about efficiency is it's crazy to think and have to think about what kind of efficiency you want. For instance, I've spent the last 2 weeks or so reading over the Box2D API. IT'S A FUCKING NIGHTMARE. I watch an enormous amount of Youtube videos on C++ developers ( mostly from CPPCon, PacC++ and so on ). They have been discussing a pretty narrow field of topics over the past year or two and they focus on Parallelism and Concurrency. Data parallelism is usually talked about in the same topic of SIMD or GPGPU programming, though concurrency is also brought up when discussing GPGPU programming as well. Another field has been dynamic polymorphism and how to avoid it using std::variant, and yet another has been how you should design your code for "best" performance using data locality as the core of their talks.
So, I thought I'd say a few words. From what I've experienced, data locality is king when discussing performance. If you can get all your data in a contiguous block of memory, and you are just iterating over it from lower index to upper index ( or even in reverse as long as it's sequential ), then you are going to get the most out of your CPU cache.
However, as the challenges chili has been posing lately has proved, the algorithm you choose is going to beat out brute force in cases where the data set is huge. This is why different algorithms have been invented, such as binary search and quick sort. Quick sort is on average the quickest sorting algorithm, but there are others like heap sort and merge which can be fairly quick as well. Once the data is sorted, a binary search can be used to find an element ( or not find an element ) in logN time, but this O(logN) time doesn't account for CPU cache prefetch so for smaller vectors, it might be just as fast or faster to just do a linear search.
I bring this up because there are a lot of factors to consider. Going back to the Box2D API, there is a heavy use of pointers. Not only is the interface riddled with pointer parameters, but these pointers could either be single components or it could be a linked list being passed as a pointer. Some are even in/out parameters. These pointers are stored between various classes and data structures for "quick" access. The b2World class stores a linked list of b2Body and b2Joint pointers. The b2Body stores a linked list of b2Fixture and b2ContactEdge pointers and b2Body prev,next pointers so b2Body and b2Contact ( which is a polymorphic class ) also holds acts as a linked list. The b2ContactEdge structure stores prev/next pointers for it as well as a b2Body* and b2Contact* which again is a linked list of polymorphic pointers. So, to say it's quite a nightmare to untangle is an understatement and the fact it's as fast as it is, is nothing short of amazing to me. I think the only thing that keeps this thing afloat is all the calculations are done on local variables. That's right, in each function where calculations are done, they are first copied to local variables, calculations performed, then stored back into the pointer.
Now granted, you have to have some way of getting the data updated for each object, but having a linked list can somewhat hinder performance severely and it seems the author or the original author decided node insertion/deletion was more important than data locality.
Another thing that probably helps is the b2BlockAllocator that is implemented. It's implemented in a way that any object of equal size if allocated form the same pool, so there is a higher chance that multiple allocations, especially done sequentially, will be contiguous which helps data locality. Because this isn't a guarantee though, you still have to do the pointer chasing that plagues all linked lists.
So what would be efficient here? Increase data locality by:
using std::variant as a substitution for dynamic polymorphism and using std::vector to guarantee elements are contiguous in memory.
This only goes so far though. To really pack the data, you might even consider using Structures of Arrays, which means instead of an array of Vec2 structures, you'd have a Vec2 structure of std::vector<float> x, y;
Code: Select all
struct Vec2
{
void AddNewElement( float _x, float Y )
{
x.emplace_back( ) = _x;
y.emplace_back( ) = _y;
}
void RemoveElement( int _index )
{
x.erase( x.begin() + _index );
y.erase( y.begin() + _index );
}
float& X( int _index ) // Read/Write access to X - Kind of like vec[ i ].x = 3.f, but vec.X( i ) = 3.f
{
return x[_index];
}
float X( int _index )const //Read-only access to X
{
return x[_index];
}
float& Y( int _index ) // Read/Write access to Y
{
return y[_index];
}
float Y(int _index )const /// Read-only access to Y
{
return y[_index]
}
private:
std::vector<float> x,y;
};
Now, as far as passing around objects by value or by reference/pointer? I'm still on the fence on this one really. For instance, if you passed the Vec2 from above by value, you'd be making copies of the two vectors stored in the struct. Probably not what you want, so you'd probably want to pass by reference or const reference. There is also move semantics to think about now to. So if you moved this struct to the function, then did the calculations then moved it out as the return value, you'd probably get the biggest benefit of cache locality and you get to actually modify the original container without passing a reference/pointer which again could have negative performance costs.
Man, software designing is really a tough game. There's the hardware to consider, the interface in which you base your code on, the algorithms and the data structures.
So off topic, if anyone is planning on writing a physics library, keep this in mind. Consider all this information ( which there are no conclusions made here ) before you start hacking. Box2D is a pretty descent API from a feature standpoint and is pretty amazing with how quick it can be. The people who worked on it are pretty smart with mathematics, but my personal belief is that it could be better both in terms of design and performance. Especially with all the updates to C++ since it's initial release back in 2007 I if I recall correctly.