Taking a break from std::variant and templates

Post by **albinopapa** » October 25th, 2018, 12:58 am

Well, the title is only slightly misleading.

I couldn't quite figure out how to make the ECS pattern using std::variant because of all the circular dependencies that would ensue or go the route of one massive declaration header...don't like either option. So I wondered if it could be used in another context where compile time knowledge of types would be nice and thought of chili's favorite 2D collision API...Box2D.

Now, not so fast, Box2D is riddled with double-linked lists, custom allocators and a tree structure, so untangling it isn't going to come easy. However, I needed some benchmark numbers of before and after, so I created a scene of about 900 boxes to get my framerate to around 30fps. This way any changes would be noticeable.

I've been doing some easy changes like:

changing the #defines to constexpr auto, no performance gain there obviously.
replacing .Set() math functions with constructor calls
using default in-class values and removing .SetZero() functions
const-ing everything I can, even if it means reorganizing the code.
using either constructors or braced initialization
replacing .Initialize functions with constructor calls
replace the b2Vec2::Normalize function so it returns a normalized b2Vec2 instead of the length
where needed, I had to create a temp delta variable to use the dot product of (delta, normal ) to get the length where b2Vec2::Normalize would have returned the length to avoid another sqrt call
all math functions have been declared as noexcept
all math functions that don't change the state of the object, are marked constexpr ( though, they never used in constexpr context )

Well, the last 3 were all kind of done together, so I'm not entirely sure which was the biggest gain, but I've gone from 30 fps, to about 35 fps after all that, while the const-ing everything which was helped by the removal of the .Set functions only added about 2 fps.

So, when chili talks about const correctness, it is not just safer, it is much more efficient.
You can const something if you have to initialize it later on in the function, so use constructors instead of an initialize function.

There are no exceptions in the library, so I could and probably will mark everything as noexcept to see what effect if any it has on the performance.

The next steps will be:

get rid of the in/out params and have the functions return by value
I have noticed a lot of code repeat, I might try consolidating even if it means using lambdas
obviously I'd like to eventually do away with the custom allocators and linked lists and runtime polymorphism and trade it all in for vectors of shape variants or fixture variants and so on, but one step at a time as I morph this thing slowly so I don't mess it up and have to start over.

Post by **chili** » October 27th, 2018, 9:02 am

Jeez, this sounds like a real project. I've personally always thought that bodies and such could benefit from some refcount resource management or something, but never went as far as to start modifying the actual box2d codebase

afaik b2d doesn't take advantage of much in the way of vectorization or MT. Lot of room for improvement there if it can be done right. Seen a project or two that attempted to MT the physics timestep, but don't know if they ever went anywhere.

Post by **albinopapa** » October 28th, 2018, 3:54 am

Yeah, there are a lot of notes both about design choices like about why they used a custom allocator, a dynamic tree and what not, some math notes and choices of algorithms, it's kind of a neat look into what went through the authors' minds.

I guess the project was something that was taken over by the current author to make it what it is now, so there are a few places where there is left over C style of code using pointers for in/out params where a reference would and should have been used.

I say that in the sense that these params weren't expected to be NULL and they were only used as single instances and not arrays of instances.

I'm finding one difficult thing about converting in/out params to return by value params is not being sure or clear whether the pointer being passed in was partially filled in by the calling function, so there is a lot of back-tracking, only made more difficult because of dynamic polymorphism. Using "View Call Hierarchy" was pretty useful for functions that aren't in derived classes.

There is no attempt on SIMD in the Box2D project iteself, nor MT aside from "locking" the time step using just an 'if' check. Instead, the authors have gone for avoiding cache misses by caching values such as position/rotation transforms and mass data ( mass, normal and inertia ).

It's crazy to see functions varying form mostly 200 to 600 lines of code. There are a lot of 2 to 20 lines functions, but they are mostly accessors or simple collision tests. Calculating the correction on some of these things seems pretty detailed. Though, some functions are just if/else or switch/case of the enum types ( circle,polygon,edge,chain,etc... ).

As for the reference counted resource, I've thought the same thing recently as I looked through the code base for the 40th or 50th time trying to follow all the pointers. std::shared_ptr would be cool, but they use two custom allocators, one is a small block allocator ( sizes range from 16 bytes to 640 bytes ) and the other is a stack allocator, which I haven't looked at too much to see how it's used or implemented. It uses malloc under the hood for allocating the pool and pretty sure the stack and anything larger than the 640 byte limit of the block allocator.

The small block allocator was kind of a neat little idea I suppose. Allocate a pool of memory, from that pool, divide the pool into one of the sizes ( say 16 byte chunks, giving you a total of 1024 chunks to allocate from that pool ) and do that for each block size, where each block size has their own pool of 16KB to allocate from. Though, I can see an issue of cache locality arising from say two separate objects of the same size being allocated from the same pool.

He keeps track of allocations using a doubly linked list, adding new allocations to the front of the list. Deallocations are stored in a doubly linked free list.

Needless to say, I'm not thrilled about the design choices here, especially when he says that new/delete were not an option because of the performance hit it would have.

I don't know how sound it is yet, but I have made a custom shared pointer resource thing lol. Basically, it uses three std::vectors. One for the objects ( you can call this my pool of memory ) one vector of pointers to each item in the resource vector, and the third is a reference counter for the resource pointers. So the objects are contiguous, the pointers are contiguous and the reference counters are contiguous, and you only interact with the shared resource, except when you use the custom iterators, which basically just the raw pointers to the resources directly.

I'll have to try using it on this project to really give it a test run.

Post by **albinopapa** » October 29th, 2018, 7:30 am

Dang man, almost all the core classes act as both object and linked-list.

b2Body for instance has previous and next b2Body pointers, almost sure these refer to the World::m_bodyList, but am not 100% yet.

I'm pretty sure he does this to avoid passing around the world object to access the body list, but damn it's confusing.

Sigh, why do I do this to myself?

You know, it would probably be easier to start from scratch than to rework the code as is. I would just need to copy all the math stuffs and have the same basic layout/structure of the project. Perhaps I'll try that route as I'm running myself in circles trying to figure out what change might break the original code.

Post by **albinopapa** » November 18th, 2018, 12:17 pm

I feel I'm making quite a bit of progress actually. I don't understand the math, but I do understand code ( mostly ). It's crazy how much you can remove once you separate some responsibilities.

Even though I know a linked list isn't the most efficient, I implemented a linked list class and am removing inline list handling such as node insertion and removal, forcing the classes to use RAII and swapping all dynamic polymorphism and with std::variant.

I'm trying to change the custom block allocator to a polymorphic memory resource, but VS is being a dick right now. I have #include'd <memory>, I have change the project properties to the latest SDK and set the language standard to C++17 ( also tried the latest draft ) and it's telling me that std namespace has no member memory_resource.

Earlier, I found that it was only allowing C++14 in debug mode ( found that out when I checked the _HAS_CXX17 macro, was set to 0 ). That's solved, switched back to C++14 manually, then save/exit go back to C++17 save/exit, but still no std::pmr::memory_resource access.

I think I've had this issue before when having the mainstream version and the preview versions installed side-by-side ( yes, I do have them both installed, again ) and so I may have to uninstall the preview and hopefully that will solve the issue. I may end up having to uninstall both and reinstall the release version, man I hope not.

There's a light at the end, it's faint, but it's there.

Post by **albinopapa** » November 18th, 2018, 7:05 pm

Hmm, seems the best advice of the day is R.T.F.M.

std::pmr::memory_resource is defined in <memory_resource>, imagine that right? Was thinking since it had to do with memory, you'd include memory, but no. It gets it's own header.

Post by **chili** » November 20th, 2018, 7:18 am

I haven't messed around too much with custom allocators. Not at all with pmr, but sounds interesting

Post by **albinopapa** » November 20th, 2018, 10:20 am

I'm still feeling my way around with allocators with the PMR stuff. I'm kind of looking through the STL as a reference, but every step is an unknown for me.

In Box2D, it uses a single class for allocations ( b2BlockAllocator ) and am not entirely sure if I might be able to use separate instances or am restricted. For instance, would I need to do as the OA did and create a derived PMR in b2World, then pass around the pointer to any class that needs it to instantiate their containers ( custom containers b2Buffer and b2List created by me )? Would it be better to make it a global/singleton? Or, could each custom container create and store it's own resource ( b2BlockMemoryResource ) and creating an allocator from that?

Currently, I have chosen the first route of creating a single instance and passing around a pointer to the resource. My reasoning is possibly having better data locality. Since the allocations are done by size, it's possible that all b2Body objects will allocate from the same pool, but in b2Body there are lists of these objects called b2Fixture. Since they are all the same size, all b2Fixture objects will be allocated from the same pool. Now, if b2Body and b2Fixture are the same size ( or actually near the same size as the blocks are predetermine in mostly multiples of 32 bytes: 200 bytes and 220 bytes would allocate from the same pool ) then there is going to be some issues with locality.

Not really sure this has much affect though since bodies and fixtures are stored at random based on their collision as well. If two bodies collide, then pointers to those bodies are stored in b2Contact, then those bodies fixtures are accessed through either accessor functions or directly through friendship. So locality can get pretty bad and is only reduced by copying all related data for a particular function to locals then stored back once all the calculations are done.

It's crazy to see all the tricks done to keep track of these objects. In most cases, a pointer is copied and an index is stored in that object of where it lives in some data structure.

Aside from the memory and layout issues, I'm struggling with a design decision, I guess redesign decision. Since I'm wanting to use std::variant, I a few options for implementing the interface.

1) Have the public interface incorporate ALL private interfaces and just have the functions do nothing where they do not apply.
2) Keep the public interface consistent with the original Box2D API, but copy all base class data to their derived classes.

Option 1 might be confusing as you might think all functions apply to all sub-types, when in fact they do not.
Option 2 would be less confusing as far as interface, but accessing the sub-types would require the user to also use the visitor pattern.

So as an example, b2Joint is the original base class and there are 11 derived joints.

Each derived joint had a set of overloaded functions, these would be no problem to emulate using simple overload resolution. The issue is these 11 other joints have their own functions, mostly get/set accessors. With option 1, some getters would return garbage ( b2Vec2( 0.f, 0.f ) ) and setters would basically be a no-op. With option 2, I could return the underlying joint variant and leave it up to the user to decide how to handle it. Either way, not very useful.

I'm kind of thinking of a third option though. What if, I also had a base class that allowed for dynamic polymorphism. I could use the std::variant to allow for storing in an std::vector so I'd get the data locality there, and the flexibility of allowing the user to be able to cast the base pointer to the derived type based on some enumerator.

This means a possible boost in overall performance depending on different circumstance, and still relying on current known behavior. Users already have to do this with the current implementation, and surely by now we all know that the biggest problem with dynamic polymorphism isn't the virtual functions but the lack of data locality due to randomly allocated spots in memory.

The only other thing I can think of is returning to the user a pointer to the underlying sub-type through a templated function during creation. This way, returning the underlying sub-type is guaranteed since it was just created. This would work, but it leaves it up to chance that if they destroy the b2Joint object, they also nullify the pointer to the underlying sub-type. I guess this wouldn't be an issue as long as they didn't store it and only used it to set parameters just after creation.

Ultimately, I think I'm leaning toward the combined static/dynamic polymorphism approach, performant, flexible and easy to use.

Now, the hard part is going to be untangling the other parts. Surely there is a way to consolidate all the needed data so that it's more accessible than having to copy around so many pointers and decouple a lot of the code. As I mentioned before, there's a lot of forward declarations and friendships going on.

Planet Chili

Taking a break from std::variant and templates

Taking a break from std::variant and templates

Re: Taking a break from std::variant and templates

Re: Taking a break from std::variant and templates

Re: Taking a break from std::variant and templates

Re: Taking a break from std::variant and templates

Re: Taking a break from std::variant and templates

Re: Taking a break from std::variant and templates

Re: Taking a break from std::variant and templates