SIMD library attempt inf

Post by **albinopapa** » September 18th, 2020, 2:47 am

I brought this up a long time ago, and again, I've gotten a lot further than I have in the past. I wanted to use the template system of C++ to make a sort of programming language. Well, I'm not sure if that is an accurate description, but "it is what it is" lol.

Code: Select all

using lerp_f1 = function<
	float, 
	param_list<parameter<0, float>, parameter<1, float>, parameter<2, float>>,
	function_body<subf<1, 0>, mulf<1, 2>, addf<0, 1>>
>;

int main( int argc, char* argv[] ) {
	// data here for now is a char array, so just a stack allocated buffer
	auto* dst = reinterpret_cast< float* >( data );

	// filling the buffer with some values
	dst[ 0 ] = 100.f;
	dst[ 1 ] = 100.f;
	dst[ 2 ] = 100.f;
	dst[ 3 ] = 100.f;

	dst[ 4 ] = 200.f;
	dst[ 5 ] = 200.f;
	dst[ 6 ] = 200.f;
	dst[ 7 ] = 200.f;

	dst[ 8 ] = .5f;
	dst[ 9 ] = .5f;
	dst[ 10 ] = .5f;
	dst[ 11 ] = .5f;

	// loading those values into pseudo registers ( actually calls _mm_loadu_si128 or _mm_loadu_ps )
	// The first param is a byte offset into the data buffer, 
	// The second param is which 'register' to load the data to
	execute_instructions<loadf<0, 0>, loadf<16, 1>, loadf<32,2>>();

	// a lerp test pseudo function defined above
	lerp_f1::exe();

	// storing values back into data buffer
	// first param is the register to store 
	// second param is offset into data buffer
	execute_instructions<storf<0, 0>>();

	// just to check/verify the result
	float lerped = dst[ 0 ];
	std::cout << lerped << '\n';

	return 0;
}

All the calling functions are parameterless and return void, the data buffer and pseudo registers are global, so when compiled in Release mode, all function calls are inlined and I'm left mostly just the intrinsics ( loadu, loadu, loadu, sub, mul, add, storeu ). There is very little overhead.

I'm still trying to figure out a few things to make it less cumbersome, but for now, you have to set all the registers and offsets manually. Some of that will stay, but I think once I figure out how to make and pass structures, some of that will go away.

For instance, the lerp function body:

Code: Select all

function_body<subf<1, 0>, mulf<1, 2>, addf<0, 1>>

Here, I'm saying:

subtract reg0 from reg1 assign to reg1: param1 = param1-param0
multiply reg1 with reg2 assign to reg1: param1 = param1*param2;
add reg1 to reg0 assign to reg0: param0 = param0+param1;

or: param0 = param0 + ( param1 - param0) * param2;
This function is pretty simple and since each parameter in param_list has the register position stored with it, I could probably figure out a way to just use those and make the pseudo function more flexible.

I'm getting close to figuring out a way of simulating structures. The part I'm getting hung up on is automatically generating offsets for members.

Code: Select all

template<std::size_t, typename...Ts> struct member_list;

template<std::size_t offset_, typename T, typename...Rest>
struct member_list<offset_, T, Rest...> {
	using member_type = member<offset_, T>;
	using type = list<
		member_type,
		typename member_list<offset_ + T::size, Rest>::member_type...
	>;
};

template<typename...Members>
struct structure
{
	using member_list = member_list<0, Members...>::type;
};

I was hoping this line here:

Code: Select all

	using type = list<
		member_type,
		typename member_list<offset_ + T::size, Rest>::member_type...
	>;

would add current offset and current unpacked T::size and create the next one and so on for each parameter in the pack, however, I really only get:
list<member<0, float3>, member<12, float3>, member<12, float3>>

So I'm kind of confused on where to go from here. I need to recursively create members and each member needs to be current offset + T::size apart. All my types are going to be 4 byte aligned, so I won't have to worry about alignment.

Any suggestions from the template pros?

Post by **albinopapa** » September 19th, 2020, 2:14 am

After realizing that pack expansion doesn't work as I originally thought, I tried inheritance.

Code: Select all

template<typename...Ts> 
struct member_list {
	using type = member_list<Ts...>;
};

template<std::size_t offset_, typename T, typename...Ts>
struct make_member_list : make_member_list<offset_ + T::size, Ts...> {
	using type = member_list<
		member<offset_, T>,
		typename make_member_list<offset_ + T::size, Ts...>::type
	>;
};

template<std::size_t offset_, typename...T>
struct make_member_list {
	using type = member_list<member<offset_, T>>;
};

A couple of issues come up though. One is now the type in make_member_list is
member_list<
member_list<member<0, float3>>,
member_list<member<12, float3>>,
member_list<member<24, uint1>>,
<error_constant>
>

I have the terminating case so I'm really not understanding where the issue is.

Post by **albinopapa** » September 19th, 2020, 7:06 pm

woohoo, thanks to Shirley Zekiela on chili's discord server, the issue has been resolved.

Post by **albinopapa** » September 21st, 2020, 10:23 am

Just a few more things to figure out and I think I'll have something usable.

As mentioned, thanks to Shirley from discord, structures are at least declarable.
Functions are getting close. I was able to create a small lerp function, only the three instructions, but it works.

As for the intrinsic types of this compile time language are

float1
float2
float3
float4
int1
int2
int3
int4
uint1
uint2
uint3
uint4

Instead of introducing more integer types, like char and short, I'm probably going to just implement bit operations like &, |, <<, >>.

The functions must be declared with return types, but they are currently not used for anything. I'm sure they will eventually be used to get the register index where the result is stored.

Code: Select all

	lerp<float3, param_list<
		parameter<0, float3>, 
		parameter<1, float3>, 
		parameter<2, float3>
		>>::exe();

Currently, this is how to execute a function, pretty much the same as any other instruction, which is the goal actually.

To create a function at the moment:
using func = function<float, param_list<parameter<>, ...>, function_body<instruction<>,...>>;

param_list requires a list of parameter<> types.
the parameter type requires two bits of information, the register index and the type of parameter. I'd like to make this part automated as well, and just have the types passed in, but that will have to wait a little while longer.

function_body requires a list of instructions.
The instructions are currently the four basic math operations add,sub,mul,div on either floating point or integer types.
The instruction types also need two parameters, the source registers for the operation. For instance, to add to floats that are being stored in registers 2 and 3: addf<2,3>. The results are always stored back into the left register index, using this example, the result is stored in register 2.

Literals are a little tricky since you can't pass floats as template parameters, so they have a special syntax, which I think I made backwards.

The instruction is liti<type, index> or litf<type, index> for integral and floating point values respectively. To create a literal, you have to make it a type:

struct pi_type{ static constexpr float value = 3.1415926f; };

The pass it in to litf:
using pi_literal = litf<pi_type, 0>;
This calls _mm_set_ps1( pi_type::value ) and loads it into register.fslots[0].
I'll have to make it so you can use maybe an array for float2-float4 or int1-int4 types.

To summarize:
intrinsic types: float1 - float4, int1 - int4, uint1 - uint4
literals: currently only supports single value literals
functions: currently a lot of manual setup and only with intrinsic types
structures: untested, I'll have to iterate through the member_list and load each member

There are a few things I haven't attempted yet, conditional blocks and loops are two that I'm putting off for now.

Another design decision I need to think about is maybe a stack. I have something that might work as a stack compile time stack, but I haven't fully thought about it yet. The reason for the stack would be to store stuff like register data before entering a function and restoring it afterward.

As long as I'm making progress, and don't run into too many hurdles, I shouldn't have a problem actually finishing this crazy idea of mine.

Post by **albinopapa** » September 21st, 2020, 11:50 pm

The reason for the stack would be to store stuff like register data before entering a function and restoring it afterward.

After running a test, doing a form of alpha blending ( not the optimized version, but more similar to the x86 version with shifts and ands ), I've come to realize that VS and by extension the VS team is pretty damned smart. Looking through the disassembly, a lot of the code is inlined. The point that I want to bring up is it seems that VS does the same as I was wanting to do with the stack, before calling a function, copy the registers to the stack, do the function, copy the stack back to the registers with the exception of the register where the function result will be.

Code: Select all

		using type = function_body<
			liti<uchar_max, V::src0 + 1>,
			cpyi<V::src0 + 2, U::src0>,		// -> left_blue
			cpyi<V::src0 + 3, U::src0>,		// -> left_green
			cpyi<V::src0 + 4, U::src0>,		// -> left_red
			cpyi<V::src0 + 5, U::src0>,		// -> left_alpha
			cpyi<V::src0 + 6, V::src0>,		// -> right_blue
			cpyi<V::src0 + 7, V::src0>,		// -> right_green
			cpyi<V::src0 + 8, V::src0>,		// -> right_red
			rshi<V::src0 + 5, 24>,			// -> left_alpha
			rshi<V::src0 + 4, 16>,			// -> left_red
			rshi<V::src0 + 3, 8>,			// -> left_green
			rshi<V::src0 + 8, 16>,			// -> right_red
			rshi<V::src0 + 7, 8>,			// -> right_green
			andi<V::src0 + 2, V::src0 + 1>,	// -> left_blue
			andi<V::src0 + 3, V::src0 + 1>, // -> left_green
			andi<V::src0 + 4, V::src0 + 1>, // -> left_red
			andi<V::src0 + 5, V::src0 + 1>, // -> left_alpha
			andi<V::src0 + 6, V::src0 + 1>, // -> right_blue
			andi<V::src0 + 7, V::src0 + 1>, // -> right_green
			andi<V::src0 + 8, V::src0 + 1>, // -> right_red
			subi<V::src0 + 1, V::src0 + 5>, // -> right_alpha
			muli<V::src0 + 2, V::src0 + 5>, // -> left_blue
			muli<V::src0 + 3, V::src0 + 5>, // -> left_green
			muli<V::src0 + 4, V::src0 + 5>, // -> left_red
			muli<V::src0 + 6, V::src0 + 1>, // -> right_blue
			muli<V::src0 + 7, V::src0 + 1>, // -> right_green
			muli<V::src0 + 8, V::src0 + 1>, // -> right_red
			addi<V::src0 + 1, V::src0 + 5>, // -> result_alpha
			addi<V::src0 + 2, V::src0 + 6>, // -> result_blue
			addi<V::src0 + 3, V::src0 + 7>, // -> result_green
			addi<V::src0 + 4, V::src0 + 8>, // -> result_red
			rshi<V::src0 + 1, 8>,			// -> result_alpha
			rshi<V::src0 + 2, 8>,			// -> result_blue
			rshi<V::src0 + 3, 8>,			// -> result_green
			rshi<V::src0 + 4, 8>,			// -> result_red
			lshi<V::src0 + 1, 24>,			// -> result_alpha
			lshi<V::src0 + 4, 16>,			// -> result_red
			lshi<V::src0 + 3, 8>,			// -> result_green
			ori<V::src0 + 1, V::src0 + 4>,	// -> result ar
			ori<V::src0 + 3, V::src0 + 2>,	// -> result gb
			ori<V::src0 + 1, V::src0 + 3>,	// -> result argb
			cpyi<0, V::src0 + 1>			// -> copy to register 0
		>;

36 pseudo instructions -> 1 real instruction each,
plus 6 pseudo instructions -> 12 real instructions each ( the custom muli function )
total of 42 pseudo instructions -> 108 real instructions

That's the list of instructions used and a lot of them are inlined. The muli<> instruction uses a custom multiplication function since the SSE2 version multiplies only two of the elements at a time, I just do the two multiplies and shuffle the elements then pack them back into a single SIMD register. This is also inlined. There is a call to this function_body<>::exe() on occasion, but when I try to step into it, I just go back to the beginning of the loop, so I'm not sure if it's really a call or just a jump to the beginning of the loop since debugging in Release isn't really reliable.

I don't have any comparisons with x86 performance, I need something a little more complex.
I don't have any comparisons with a hand rolled function using straight intrinsics, but from what I can remember from previous experiences, a hand rolled function seems to have a lot of copy from mem to register, back to mem then back to register. I'll have to test it out and see if that happens still.

All in all, I'm getting excited. Mostly because it's working as expected. I've tried this is the past, but didn't get very far because of lack of knowledge and experience, so it's nice to see things coming together.

NOTE: The corresponding x86 code

Code: Select all

// get left alpha from color
auto left_alpha = ( color1 >> 24 ) & 0xFF;  // 2 instructions
auto right_alpha = 255 - left_alpha;  // 1 instruction

auto left_red = ( color1 >> 16 ) & 0xFF;  // 2 instructions
auto left_green = ( color1 >> 8 ) & 0xFF;  // 2 instructions
auto left_blue = ( color1 & 0xFF );  // 1 instruction

auto right_red = ( color2 >> 16 ) & 0xFF;  // 2 instructions
auto right_green = ( color2 >> 8 ) & 0xFF;  // 2 instructions
auto right_blue = ( color2 & 0xFF );  // 1 instruction

left_red *= left_alpha  // 1 instruction
left_green *= left_alpha // 1 instruction
left_blue *= left_alpha // 1 instruction
right_red *= right_alpha // 1 instruction
right_green *= right_alpha // 1 instruction
right_blue *= right_alpha // 1 instruction

auto result_red = left_red + right_red;  // 1 instruction
auto result_green = left_green + right_green;  // 1 instruction
auto result_blue = left_blue + right_blue;  // 1 instruction

result_red >>= 8;  // 1 instruction
result_green >>= 8;  // 1 instruction
result_blue >>= 8;  // 1 instruction

auto result_alpha = left_alpha + right_alpha;   // 1 instruction

result_alpha <<= 24;  // 1 instruction
result_red <<= 16;  // 1 instruction
result_green <<= 8;  // 1 instruction

auto result = result_alpha | result_red | result_green | result_blue;  // 4 instructions

33 instructions for 1 alpha blended color.
To match SIMD, 33x4 = 132 instructions.

Not sure how many assembly instructions the x86 version would use though. It can cheat though using aliasing.

The reason my version has so many is the limitations I'm putting on the project. I don't want any ops requiring special knowledge of SSE instructions, so packing and unpacking is out. I kind of figure I'll mostly be using this for graphics filters. I kind of want to also try using it for some collision stuff, but not looking forward to writing all those instructions.

Post by **albinopapa** » September 23rd, 2020, 6:52 am

Decided to try and implement something neat before getting too much further since it might change a few underlying mechanisms.

While working on a 'call' instruction I thought through what it might do. My initial thoughts came up with handling of the copying of the register bank and the probable stack. After awhile, I realized I might be able to at this point decide to run a check on CPU capabilities and run the highest instruction set a persons cpu can handle. For instance, my old phenom II only had up to SSE 3. My A10 7870 APU could handle upto AVX. My current R5 2400G APU can do AVX2 ( even if it is emulated ). So far, things are looking promising, with the exception that templates are a nightmare sometimes.

My goal is to have an ambiguous set of instructions which are basically just tags with some constexpr data for register indices. Those tags will be transformed into another tagged type depending on instruction set support. If you cpu only supports SSE2, then every thing will be forwarded to instructions with the SSE2 template parameter. The check would only be done once before the instructions are executed and not for every instruction, which is pretty neat IMO.

Having a little trouble unpacking another parameter pack, but either I'll figure it out or maybe Shirley will be nice to me again.

Post by **albinopapa** » September 24th, 2020, 9:00 pm

Figured out the transformation of instructions, yay!

Now I am a little frustrated with myself for not implementing a few things ahead of time, or maybe it won't be such a big deal we'll see.

After spending a few hours both trying to figure out what I was doing wrong with the type transformations and converting everything over to handle the different SIMD types, I begrudgingly went to do a test. I wanted to see so far how easy it would be to realistically use this library. Currently, not easy. I'm going to have to come up with a way to tell some of the instructions where to pull the data from. For instance, all the math ops are hardcoded to use the register bank, which is fine and intended. However, loading data into registers is done by loadi and loadf using offsets into a stack allocated buffer. What I'm going to need though is to have a user supplied pointer to a buffer. Also, constant data which should only be loaded once before each shader is run would be a nicety as well. This way it doesn't have to be loaded each iteration.

It seems a little more thought is needed.

While I'd love for this to be mostly compile time constants, it's becoming increasingly difficult to accomplish some things. With small tests, the situation is simple: copy data to the stack allocated buffer, write a small shader that loads that data using the size of each data type as offsets into the buffer, run the shader, then store the result into some offset into the buffer than cast it to the type desired. The reason it is not a problem for smaller tests is because you know where the data lies, so offsets can be done manually. However, for a large data set, this is not going to be possible. What I have done previously though is use the manual offsets, then just memcpy from one buffer to the stack buffer, do the operations, then memcpy the result back into the original buffer. This works, but I had to do that for each iteration, which I'd imagine isn't the most efficient.

It feels like the finish line is so far away. I can tell by the end this project is going to look nothing like what I originally planned. I've thought about using std::variant for a little mix of compile time and runtime, but at the moment I'm trying to keep from doing that. I don't think I'd get the results I'm looking for, but it may be what is needed. Another thought, which I'm totally not ready to explore is to have a separate program to take in a string and convert it into the list of these instructions, basically a shader compiler. You write out a shader like you would HLSL, and the result is a text file that you can copy and paste into your project.

Now that I think about it, that would be so much easier for the end user. I'd have to figure out the compilation stuff, but it seems it would be worth the effort.

To be continued...

Post by **albinopapa** » September 25th, 2020, 8:25 pm

Yep, didn't think far enough ahead on this project.

While the instructions work and transforming them to a supported architecture works, I haven't thought about how I want the actual shaders to work. Surely I didn't think just shoving a bunch of data into a buffer and everything would be fine right? Yeah, that's exactly what I thought lol.

Well, I don't think that's gonna fly. So I'll need to think about how in the hell I can describe the types in the shader, load data from some buffer or buffers into registers and refer back to the aforementioned types as these will be offsets into the buffer or buffers given some memory location.

I suppose some responsibility can be offloaded to the user, but I feel that kind of defeats the purpose of a library. However, I suppose the biggest benefit here would be automatic calling of highest supported SIMD instructions from a higher level and not having to right multiple versions, so there's that.

I'll think on it and of course post again.

Post by **albinopapa** » September 28th, 2020, 9:02 pm

The closer I get to coming up with an interface, the more I'm thinking it's not really possible without defining some limitations IF I want to keep things as templates, which would be great because I've seen the disassembly using this code and it's pretty clean. The biggest issue I'm facing now is where to draw the line between what my library provides and what the user will need to provide.

I'm not sure if I outlined what I want, but here it is anyway:

User provides input and output buffers
User provides the list of instructions to execute on the buffers' data
User calls some yet to be defined function to execute the list of instructions by passing the list to this function
the yet to be defined function will determine highest supported SIMD instruction set and transform the list of instructions to said SIMD instruction set
data in buffer will be loaded into pseudo registers that the instructions will execute on
instructions are executed
user can retrieve results from user supplied output buffer

I'm sure there are steps missing that I don't know I need yet, but right now I'm hung up on 'data in input buffer will be loaded into pseudo registers...'
A couple reasons for this as I see it. Since I'm creating a library that is intended to operate on any instruction set ( SSE or AVX, 128 bit vs 256 bit ) I'm not quite sure how to handle loading of types. Say you want to load a structure with 2 floats. Well, okay hopefully you'll be using a buffer that either has 2 of them if using SSE or 4 of them if using AVX otherwise, the registers is going to be oversized and could cause issues during loading and storing. Either way, if I check and see you only have two float1 types or just a float2 type, then do I just repeat those two for the rest of the register or load them in and zero out the rest of the elements?

Right now I either don't have enough information during compile time to make that decision or I simply make a decision and document the behavior.

Let's say you wanted to use this for detecting rect on rect collision detection ( just detection no correction ):

First, some C++ code:

Code: Select all

struct RectF{ float left, top, right, bottom; };

bool is_overlapping( RectF a_, RectF b ){
    return
        a_.left < b_.right && a_.right >= b_.left &&
        a_.top < b_.bottom && a_.bottom >= b_.top;
}

RectF a, b;
const auto overlapped = is_overlapping( a, b );

Okay, let's see what that would look like using this library...maybe

Code: Select all

// describe the struct to the library
using mmRectF = structure<float1,float1,float1,float1>;
// The 0 in first parameter means load into register 0.  Assuming each member get's their own
// register, the second parameter would start at register 4
using mmIsOverlappingParam = param_list<parameter<0, mmRectF>, parameter<4, mmRectF>;
using mmIsOverlappingInstructions = function_body<
    ltf<0, 6>,  // a.left < b.right -> result stored at register[0]
    ltf<1, 7>,  // a.top < b.bottom -> result stored at register[1]
    gef<2, 4>,  // a.right >= b.left -> result stored at register[2]
    gef<3, 5>,  // a.bottom >= b.top -> result stored at register[3]
    andf<0, 1>,  // and first two results -> result stored at register[0]
    andf<2, 3>,  // and second pair of results -> result stored at register[2]
    andf<0, 2>  // and previous two results -> result stored at register[0]
>;

// Currently, the result isn't used, so I'm just gonna put float1 here
using mmIsOverlapping = function<float1, mmIsOverlappingParam, mmIsOverlappingInstructions>;

// Not implemented yet, but this is where you'd load a buffer with rectangles
auto input_buffer = simd_device.create_buffer( num_rectangles * sizeof( RectF ) );
auto output_buffer = simd_device.create_buffer( num_rectangles * sizeof( float ) );

// fill input buffer with num_rectangles...hopefully not just 1 or 2

// Then set the buffers
simd_device.set_buffer( input_buffer, output_buffer );

// Then call the shader
call<mmIsOverlapping>::exe();

Yes, you'd need to define both types of RectF structures. One for C++ that actually holds data, and one that describes that structure to the API.

If you look at the parameter list, param_list<parameter<0, mmRectF>, parameter<4,mmRectF>> here's where part of the issue is. I'm relying on an SSE type being chosen whereas I need it to be ambiguous so to let the API choose the registers. The reason I chose to originally include the base register as a parameter argument was so instructions could know where the parameters are to use them...like named variables.

Even if the registers are assigned by the api, the user wouldn't know which registers to use for the instructions. There'd have to be a deterministic way of telling the API that you are wanting parameter 0 or parameter 1, like named variables for other programming languages. Kind of hard to do with limited knowledge of the entire code like a compiler would have.

I might have to redesign the api in such a way that instead of calling instructions like: addf<0,1> it'd be something like: addf<source1_type, source2_type> where source1_type and source2_type have more information on where to get the data from...but I'm not sure if this approach would be any better.

I'm really strongly leaning toward learning how to make a simple compiler at this point. This way, the user programs in something similar to C then the api would just compile out the necessary C++ code.

I've also thought about saying fuck it try using std::variant. Basically, I'd take one of two paths. One would be a separate program that would create a text document with the required API template calls, or have just a function that took in a string or file with the C like shader code and create a vector of std::variant where each object in the vector would be a variant of the instructions. One would have the benefit of program efficiency ( the separate compiler ) the other would have the programmer efficiency ( the one using std::variant ).

Like I mentioned, either way I'm going to have to figure out making a simple compiler first.

Post by **albinopapa** » October 2nd, 2020, 12:08 am

Found some resources for helping learn how to compile my own language, but does anyone else have troubles translating example code to your own code?

Planet Chili

SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf

Re: SIMD library attempt inf