Efficiency (not a tutorial)

Post by **albinopapa** » November 11th, 2018, 8:41 am

There are a few ways to discuss efficiency depending on what is important to you or your boss, your wife or your client. There's time efficiency where your boss wants you to complete the job by the deadline and time efficiency where your client wants the program to run faster to meet their deadlines.

This has been the "prefer readable code over fast code". What is fast code though?

Well, if you have been around the forum for about a year or so, there has been a few attempts to cover best practices on how to make your code fast. More than few of them have been done by cyboryxmen I believe, where he covers topics such as cache locality to SIMD. While cache locality or sometimes referred to as data locality and SIMD go hand in hand, their is another branch where data locality plays an important role; memory reads and writes. This has also been covered on the forum a few times over the past couple of years. For instance, which container do you choose? List, Deque, Map, Vector?

The interfaces for each of these containers is pretty similar, so from a schedule point of view, any should be fine. From a read/write point of view, vector is probably going to be your best bet because all the data is contiguous. From allocation point of view, perhaps deque would be best, because it allocated chunks of memory and instead of allocate/copy/deallocate like vector just keeps a linked list of memory chunks.

Ok, my ramblings weren't suppose to be about containers per se, and that's about as far as my knowledge goes anyway.

What I have come to say about efficiency is it's crazy to think and have to think about what kind of efficiency you want. For instance, I've spent the last 2 weeks or so reading over the Box2D API. IT'S A FUCKING NIGHTMARE. I watch an enormous amount of Youtube videos on C++ developers ( mostly from CPPCon, PacC++ and so on ). They have been discussing a pretty narrow field of topics over the past year or two and they focus on Parallelism and Concurrency. Data parallelism is usually talked about in the same topic of SIMD or GPGPU programming, though concurrency is also brought up when discussing GPGPU programming as well. Another field has been dynamic polymorphism and how to avoid it using std::variant, and yet another has been how you should design your code for "best" performance using data locality as the core of their talks.

So, I thought I'd say a few words. From what I've experienced, data locality is king when discussing performance. If you can get all your data in a contiguous block of memory, and you are just iterating over it from lower index to upper index ( or even in reverse as long as it's sequential ), then you are going to get the most out of your CPU cache.

However, as the challenges chili has been posing lately has proved, the algorithm you choose is going to beat out brute force in cases where the data set is huge. This is why different algorithms have been invented, such as binary search and quick sort. Quick sort is on average the quickest sorting algorithm, but there are others like heap sort and merge which can be fairly quick as well. Once the data is sorted, a binary search can be used to find an element ( or not find an element ) in logN time, but this O(logN) time doesn't account for CPU cache prefetch so for smaller vectors, it might be just as fast or faster to just do a linear search.

I bring this up because there are a lot of factors to consider. Going back to the Box2D API, there is a heavy use of pointers. Not only is the interface riddled with pointer parameters, but these pointers could either be single components or it could be a linked list being passed as a pointer. Some are even in/out parameters. These pointers are stored between various classes and data structures for "quick" access. The b2World class stores a linked list of b2Body and b2Joint pointers. The b2Body stores a linked list of b2Fixture and b2ContactEdge pointers and b2Body prev,next pointers so b2Body and b2Contact ( which is a polymorphic class ) also holds acts as a linked list. The b2ContactEdge structure stores prev/next pointers for it as well as a b2Body* and b2Contact* which again is a linked list of polymorphic pointers. So, to say it's quite a nightmare to untangle is an understatement and the fact it's as fast as it is, is nothing short of amazing to me. I think the only thing that keeps this thing afloat is all the calculations are done on local variables. That's right, in each function where calculations are done, they are first copied to local variables, calculations performed, then stored back into the pointer.

Now granted, you have to have some way of getting the data updated for each object, but having a linked list can somewhat hinder performance severely and it seems the author or the original author decided node insertion/deletion was more important than data locality.

Another thing that probably helps is the b2BlockAllocator that is implemented. It's implemented in a way that any object of equal size if allocated form the same pool, so there is a higher chance that multiple allocations, especially done sequentially, will be contiguous which helps data locality. Because this isn't a guarantee though, you still have to do the pointer chasing that plagues all linked lists.

So what would be efficient here? Increase data locality by:
using std::variant as a substitution for dynamic polymorphism and using std::vector to guarantee elements are contiguous in memory.

This only goes so far though. To really pack the data, you might even consider using Structures of Arrays, which means instead of an array of Vec2 structures, you'd have a Vec2 structure of std::vector<float> x, y;

Code: Select all

struct Vec2
{
    void AddNewElement( float _x, float Y )
    {
        x.emplace_back( ) = _x;
        y.emplace_back( ) = _y;
    }
    void RemoveElement( int _index )
    {
        x.erase( x.begin() + _index );
        y.erase( y.begin() + _index );
    }
    float& X( int _index ) // Read/Write access to X  - Kind of like vec[ i ].x = 3.f, but vec.X( i ) = 3.f
    {
        return x[_index];
    }
    float X( int _index )const //Read-only access to X
    {
        return x[_index];
    }
    float& Y( int _index )  // Read/Write access to Y
    {
        return y[_index];
    }
    float Y(int _index )const  /// Read-only access to Y
    {
        return y[_index]
    }
private:
    std::vector<float> x,y;
};

Now, as far as passing around objects by value or by reference/pointer? I'm still on the fence on this one really. For instance, if you passed the Vec2 from above by value, you'd be making copies of the two vectors stored in the struct. Probably not what you want, so you'd probably want to pass by reference or const reference. There is also move semantics to think about now to. So if you moved this struct to the function, then did the calculations then moved it out as the return value, you'd probably get the biggest benefit of cache locality and you get to actually modify the original container without passing a reference/pointer which again could have negative performance costs.

Man, software designing is really a tough game. There's the hardware to consider, the interface in which you base your code on, the algorithms and the data structures.

So off topic, if anyone is planning on writing a physics library, keep this in mind. Consider all this information ( which there are no conclusions made here ) before you start hacking. Box2D is a pretty descent API from a feature standpoint and is pretty amazing with how quick it can be. The people who worked on it are pretty smart with mathematics, but my personal belief is that it could be better both in terms of design and performance. Especially with all the updates to C++ since it's initial release back in 2007 I if I recall correctly.

Post by **albinopapa** » November 16th, 2018, 10:36 pm

Dang, after skimming back through that, I feel like I had a vomiting of thoughts. No direction or goal.

The original goal was to point out that their are so many different ways to "optimize" your code, or what ways are efficient.

There's

the algorithms you choose
the amount of data you are working with
the hardware it is being run on
the layout of your data structures
the length of your instruction dependency chain
the compiler you choose
the compiler flags you choose
the instructions you use
the amount of code that can be run concurrently or in parallel ( no synchronization points )

And I'm sure there is more.

In a project a while back, a ray tracer, going from non-const to const had almost no effect, but in the Box2D project it seemed to have a small effect and I think the major difference was with the ray tracer, there weren't a ton of user defined types being constructed. So it's possible, that the real gain was just going from a two step declare then initialize to a single declare and initialize using the initializer list.

An example:

Code: Select all

// Two step 
Vec2f position;
position.x = 32.f;
position.y = 16.f;

// or
MyClass object;
object.Initialize( params );

// Which should be 
Vec2f position = { 32.f, 16.f };

// and 
MyClass object( params );

// Of course, your MyClass::MyClass needs to assign the params to the MyClass members after the : 
// in the initializer list instead of the MyClass constructor function body in order for this to 
// "be more efficient " than just using an Initialize() function.

MyClass::MyClass( const ParamType& params )
     :
member( params.member ),
member_etc( params.member_etc )

I already talked about algorithms and data set size.

I already talked about data locality in terms of data fetching from a container, but there is also the issue of data locality when going dealing with functions. Say you have a function that does some calculations on just a small portion of data from some class or struct, but this function is called 10X more than most other functions. This function might take a few parameters by const reference to avoid copying these medium to large classes/structs you pass as parameters. Now, the cache line is 64 bytes on most x86/x64 CPUs, which cache line as I understand it is the amount of data fetched from main memory at a time, so if it has been awhile since your last access to this data ( from the parameters ) the cpu has to fetch the data from main memory before it can do the calculations. This can be slow and if the size of the class data is greater than 64 bytes, then the cpu will have to make several calls to memory.

One way of reducing the number of stalls is to make local copies of just the data you need, thus paying for the memory access at the beginning of the function, doing the calculations on data that should be entirely in L1 or L2 cache on the CPU, then assigning the results back into the class. This is seen mostly in member functions, because more than likely if you have a global function you'll be using the return parameter. So this brings up another issue, do you continue passing by const reference, or pass by value?

This brings us to data layout. Most suggestions I've run across go from the structure of arrays I mentioned previously, but some have suggested breaking up your data into smaller structs and compose your object that way. For physics data, a struct with mass, position and velocity and then another struct that holds the state of the object. You'd just pass the physics data to the physics routines which might be small enough to just pass by value do the calculations then return by value, reassigning the returned value to the original classes physics struct.

Speaking of parameters, if I recall, it's common practice to copy 4 32 bit values to registers for the parameters, after that items are pushed onto the stack. So any more than 4 floats or ints or a single RectF/RectI or Vec4 would be pushed on to slower memory though not the slower main memory. So it is usually advised not only for clarity, but for supposed efficiency that you might make structs to store parameters, then just fill the struct and pass by const reference or reference if mutability is needed. Since the parameter struct is usually filled in at the call site, it's a good bet that the data will still be in cache, thus quick to retrieve and use.

These last two approaches could possibly also be combined for efficient SIMD calculations. If you don't have the data organized in the structure of arrays, then you could just copy the data you needed to fill in a struct of arrays, pass the struct to a function to calculate the results. Take the results and pass them back out to their destinations. You get data locality, the data is hot and in cpu cache, you get SIMD parallelism, sounds good. You just have to make sure that there is enough work to do during the calculation function to offset the copying of the data, the loads to SIMD registers, the storing of the SIMD registers back to memory, then copying the data back to their owners.

Using that last technique actually can be used for efficient concurrency. Say you copy out all your data into these SIMD data structs and use them as job packets. You queue up some packets, make a call to std::async, get the future and pack it away into an array. Once you've reached the thread limit of the system, retrieve the results and copy to their destinations. Repeat process until all work is done.

As you can see, it's hard to nail down what is efficient. There is what is efficient for the hardware, there is efficient for the compiler and then there is efficient for you the programmer. Trying to optimize your code so that the CPU is happy and the compiler isn't discusted takes a lot of thinking and planning, especially if you also want your code to be readable, reusable and/or maintainable. So it's not very efficient from a development stand point, but might be worth the effort depending on your program and if the gods are happy.

Post by **albinopapa** » November 16th, 2018, 10:50 pm

Honestly, this whole thread was to point out that I'm so confused on what is efficient, because some things seem counter productive while still sometimes being more productive, such as sometimes passing by value so you get "hot data" in cache even though it's making a copy. The functional programming approach of functions not changing state, but creating new objects. So instead of a class holding a first name and last name, and a SetFirstName()/SetLastName() function, you'd just reassign:

Code: Select all

// Functional programming says, do this:
if( wife.GotMarried() )
{
     wife = Person( wife.GetFirstName(), husband.GetLastName() );
}

// Instead of this:
if( wife.GotMarried() )
{
    wife.SetLastName( husband.GetLastName() );
}

I suppose this is done for smaller structs like Vec2/3/4.

I suppose the better approach would be:

Code: Select all

if( wife.GotMarried() )
{
     wife = Person( std::move(wife), husband.GetLastName() );
}
Person::Person( Person&& _other, std::string _newLastName )
     :
gotMarried( false ), 
gotDivorced( false ),
firstname( std::move( _other.firstname ) ),
lastname( std::move( _newLastName  ) ),
age( _other.age )
{}

Planet Chili

Efficiency (not a tutorial)

Efficiency (not a tutorial)

Re: Efficiency (not a tutorial)

Re: Efficiency (not a tutorial)