SimdCompute project

The Partridge Family were neither partridges nor a family. Discuss.
Post Reply
albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

SimdCompute project

Post by albinopapa » January 20th, 2018, 5:03 am

Project SimdCompute: Jan 19, 2018
Hello planet chili, how is everyone?

Wanted to share a bit about a project I'm working on. It's going to be a compute shader library using Single Instruction Multiple Data instruction (SIMD). You might have read about it from the "crickets" post started by MrGodin. I've got about 1,700 lines of code done so far. It took me awhile to piece together how to tell where data was coming from and where to store it, but I've made progress.

Currently, it can:
add, subtract, multiply and divide; gotta support the basics.
shuffle - useful for multiplying vectors and matrices together.
blend and insert - useful for mixing values from different registers

There are five data sources;
- user supplied multi-element buffer
- single element cache
- user supplied single element constant buffer
- literals ( gets stored in an SSE vector either single element or vector depending on operation and stored in a single element cache )
- system supplied intermediate multi-element buffer

There are three data destinations;
- user supplied multi-element buffer
- single element cache
- system supplied intermediate multi-element buffer

The number of structs for input and output must match, but the number of elements in the structs do not need to match.

I don't have a compiler setup yet, so I'm having to manually create parameters for every pseudo instruction:

Code: Select all

// This is all the instructions for multiplying a __m128 register by a 4x__m128 matrix of registers.

// swizzling with (0,0,0,0) copies the first element to all four elements in the register
// swizzling with (1,1,1,1) does the same for the second element and so on.

// pseudo code:
// buffer_slot_0 = {x,y,z,w}
// cbuffer_slot_0 = {matrix4x4.r0}
// cbuffer_slot_1 = {matrix4x4.r1}
// cbuffer_slot_2 = {matrix4x4.r2}
// cbuffer_slot_3 = {matrix4x4.r3}

	_dispatch_param()
		.set_operation<swizzleps>( make_shuf_mask( 0, 0, 0, 0 ) )
		.set_op_category( _op_cat::buffer_null_temp )
		.set_lhs_slot_num( 0 )
		.set_out_slot_num( 0 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<mulps>()
		.set_op_category( _op_cat::temp_const_temp )
		.set_lhs_slot_num( 0 )
		.set_rhs_slot_num( 0 )
		.set_out_slot_num( 0 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<swizzleps>( make_shuf_mask( 1, 1, 1, 1 ) )
		.set_op_category( _op_cat::buffer_null_temp )
		.set_lhs_slot_num( 0 )
		.set_out_slot_num( 1 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<mulps>()
		.set_op_category( _op_cat::temp_const_temp )
		.set_lhs_slot_num( 1 )
		.set_rhs_slot_num( 1 )
		.set_out_slot_num( 1 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<addps>()
		.set_op_category( _op_cat::temp_temp_temp )
		.set_lhs_slot_num( 0 )
		.set_rhs_slot_num( 1 )
		.set_out_slot_num( 0 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<swizzleps>( make_shuf_mask( 2, 2, 2, 2 ) )
		.set_op_category( _op_cat::buffer_null_temp )
		.set_lhs_slot_num( 0 )
		.set_out_slot_num( 2 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<mulps>()
		.set_op_category( _op_cat::temp_const_temp )
		.set_lhs_slot_num( 2 )
		.set_rhs_slot_num( 2 )
		.set_out_slot_num( 2 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<swizzleps>( make_shuf_mask( 3, 3, 3, 3 ) )
		.set_op_category( _op_cat::buffer_null_temp )
		.set_lhs_slot_num( 0 )
		.set_out_slot_num( 3 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<mulps>()
		.set_op_category( _op_cat::temp_const_temp )
		.set_lhs_slot_num( 3 )
		.set_rhs_slot_num( 3 )
		.set_out_slot_num( 3 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<addps>()
		.set_op_category( _op_cat::temp_temp_temp )
		.set_lhs_slot_num( 2 )
		.set_rhs_slot_num( 3 )
		.set_out_slot_num( 1 )
		.register_param( dispatch );

	_dispatch_param()
		.set_operation<addps>()
		.set_op_category( _op_cat::temp_temp_out )
		.set_lhs_slot_num( 0 )
		.set_rhs_slot_num( 1 )
		.set_out_slot_num( 0 )
		.register_param( dispatch );

// This is what the code is doing:
// temp_slot_0 = {x,x,x,x}
// temp_slot_0  = temp_slot_0 * cbuffer_slot_0
// temp_slot_1 = {y,y,y,y}
// temp_slot_1 = temp_slot_1 * cbuffer_slot_1
// temp_slot_0 = temp_slot_0 + temp_slot_1
// temp_slot_2 = {z,z,z,z}
// temp_slot_2 = temp_slot_2 * cbuffer_slot_2
// temp_slot_3 = {w,w,w,w}
// temp_slot_3 = temp_slot_3 * cbuffer_slot_3
// temp_slot_1 = temp_slot_2 + temp_slot_3
// out_slot_0 = temp_slot_0 + temp_slot_1
* buffer = user supplied multi-element buffer
* const = user supplied signel element buffer
* out = user supplied multi-element buffer for results
* temp = intermediate system supplied multi-element buffer
* cache = system supplied single element buffer
It's doing each operation over an entire array before moving on to the next instruction
My initial test isn't as exciting as I'd hoped, and I think I have an idea why.

Yes, I get vectorized operations, when adding two registers, it adds four source + four destination in a single instruction and typically in a single cpu cycle. The issue here is probably going to be the moving data from memory, do one operation then moving back to memory. This means there is more time waiting for loads and stores than actual calculations being done. Still, I'm probably going to keep down this path for now so I have a baseline to compare for optimizations later on.

One thing I am considering is unrolling the loops, this might hint to the compiler/cpu to prefetch some data and give the cpu a chance to perform a few tasks while waiting for more data to either load or store.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: SimdCompute project

Post by albinopapa » January 20th, 2018, 5:04 am

I do plan on creating a compiler for this, I'd hate to have to create a whole shader this way lol.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

MrGodin
Posts: 721
Joined: November 30th, 2013, 7:40 pm
Location: Merville, British Columbia Canada

Re: SimdCompute project

Post by MrGodin » January 20th, 2018, 8:07 pm

Doing good here papa, this looks cool and interesting. I'd be interested in see how your compiler works :)
Curiosity killed the cat, satisfaction brought him back

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: SimdCompute project

Post by albinopapa » January 20th, 2018, 10:10 pm

MrGodin wrote:Doing good here papa, this looks cool and interesting. I'd be interested in see how your compiler works :)
Lol, me too, having troubles figuring out where to start.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

MrGodin
Posts: 721
Joined: November 30th, 2013, 7:40 pm
Location: Merville, British Columbia Canada

Re: SimdCompute project

Post by MrGodin » January 20th, 2018, 11:38 pm

Curiosity killed the cat, satisfaction brought him back

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: SimdCompute project

Post by albinopapa » January 21st, 2018, 9:29 am

Still at ground 0 on that compiler, spend a few minutes thinking about it, but got distracted when I came up empty. Instead I decided to test a few things.

Code: Select all

100 test loops | 48,000 elements | 26 operations
------------------------------
loop unrolling = 1
0.643484, 0.646866, 0.660184 = 1.950534  
loop unrolling = 4
0.641434, 0.652234, 0.650237 = 1.943905
loop unrolling = 8
0.644900, 0.636254, 0.654657 = 1.935811
loop unrolling = 16
0.674694, 0.690410, 0.691741 = 2.056845
The times are in seconds. Three separate runs of the program, the last number are the sums of the three times. Looks like unrolling offers a little increase in speed, except with the 16 version. Cache lines are 64 bytes, so the best should be the 4, but I'm going to say that with x64 compilation you get 16 SSE registers which is probably why I am getting better results with the 8 version.

I'll have to compare this with straight SSE code, without going through this "interpreter" to see what kind of performance loss there is. I'll do that Sunday night, or Monday.

Thanks MrGodin for the vid link, Bisqwit has some pretty interesting vids. I've seen a couple of them probably about a year ago. It'll take me a while to wrap my head around what he's talking about.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: SimdCompute project

Post by albinopapa » January 22nd, 2018, 8:50 am

Project SimdCompute: Jan 22, 2018
Didn't run the comparison with straight SSE, I worked on my compiler instead.

Man, when you don't have a clear idea of what your doing, things get messy quick. I have random half written code all over this file. Progress is being made though.

I wrote a small vertex shader similar to what chili does in his latest 3D fundamentals video ( transform position and normals ). I can read in cbuffers and structs. Now, you may be wondering, "What's so hard about reading in some lines of text?". Well, when compiling the shader, I want to be able to lookup a variable's location by name. This way when I run across a variable with the name world, I can lookup world and find that it belongs to a cbuffer and it's index is 0 and it's at slot 0 ( first in list ) and it occupies four slots, and it's a matrix ( if I do multiplication with it, I need to know it's type ).

Well, come to think of it, I still haven't associated a variable with the storage type yet, world stored in cbuffer Transforms or cbuffer[0]. Guess I'll work on that tomorrow.

As far as the structs go, which are the in/outputs, I haven't marked them yet which I'm thinking I'll need to be able to read in functions before I can do that. The shader will need an entry point ( main function ). Anyway, just a quick update...still nothing to show.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: SimdCompute project

Post by albinopapa » January 22nd, 2018, 4:14 pm

Just ran a few benchmarks on straight SSE code. The shader version is about 20 times slower than just writing SSE code directly. I'll have to keep this in mind as I write the compiler. I'll probably want to look for patterns and group those operations together so they can be done as a group with a single shader operation instead of multiple. For instance, multiplying a matrix by a vector, or a*b+c (fused multiply accumulate ).
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: SimdCompute project

Post by albinopapa » January 25th, 2018, 7:51 am

Project SimdCompute: Jan 25, 2018
Well, made a small amount of progress toward compiling the code, I mean pointing and laughing small. So after some digging around and watching a few vids, compilation is broke up into three or more steps.

Step 1) Lexing:
I have come across a few different ideas on what this step involves, but I like the simplest approach. Lexing is just turning all your code into individual tokens.

The expression: "int d = a + ( b * c );" yields
"int", "d", "=", "a", "+","(","b","*","c",")",";"
or a vector<string>. I've got this part done. After a couple of iterations of trying to load in names and symbols and numbers, I ultimately made something that actually made things easier to lex. Before, I was going through each char in the string and doing if this or if that to determine if it was a space or one of the symbols,etc. I really didn't like coming up with ways to lex hex and float literals inside the main for loop. I really wanted to use std::stringstream and let someone else figure it out. Well, stringstream stops at white space; (' ' spaces, '\n' new lines,'\t' tabs), so if you wrote:
"int d=a+(b*c);" stringstream will extract the entire string because there's no white space. Well, the place I do the most thinking is on the crapper and it hit me...insert white space and stringstream will do the job for you.

Step 2) Parsing:
This is where I'm stuck a bit. This step is suppose to create a tree of left and right child nodes from the tokens created from lexing. Unfortunately, I'm having trouble understanding how it's suppose to be set up. There are diagrams and stuff on the web, but they mostly refer to math operations or spoken language syntax. Programs do have a lot of math operations, but what about declarations? Where do those go? How are they structured in this parse tree?

Step 3) Compiling/Interpreting
Unless I want to go the extra distance and learn assembly, I guess I'll stick with an interpreter. In this step, you convert your parse tree into a form you want your program to read and execute. So, for an interpreter, you could have enums for the operation like "call","assign","add","sub" and determine where the operands are coming from. In my case, I'll be working with buffers, so I'd want an identifier and an offset from the base address. The identifier will tell me where the source is such as a user buffer, a temp buffer, the cache buffer, a literal, and so on. The offset is going to be slot numbers. A slot in this case is for instance, a struct has different members and each member will take up a slot. I'm going to cap the number of slots to 16. For structs in the shader, they represent the data, but the data will actually laid out differently.

Consider the following:

Code: Select all

struct Foo
{
    Vec3 position;
    Vec3 orientation;
};
The Foo struct has two slots used, but the user buffer won't be setup as array(Foo,Foo,Foo,Foo,...)
it will be
Foo_0[slot0],Foo_1[slot0],Foo_2[slot0],Foo_3[slot0],Foo_n[slot0]
Foo_0[slot1],Foo_1[slot1],Foo_2[slot1],Foo_3[slot1],Foo_n[slot1]
Because SSE registers are 16 bytes, and aligned loads are faster, there will be padding inserted between each element in the arrays to make the size of each element 16 bytes. Wasted memory, but easier to handle.

As stated earlier, there are more steps that could be and probably should be taken, one being what's called an AST ( abstract syntax tree ) step. I'm going to pray that I can get away with just the parse tree, but the AST is used to enforce programming language grammer and syntax and provide debug info like "hey, you fucked up on line 32 of go_fuck_yourself.h". Perhaps when/if I reach that point, I'll consider doing something like that. Right now, having this much trouble understanding how to parse the code, I don't think I'll be able to handle making an AST and enforcing all the rules.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

albinopapa
Posts: 4373
Joined: February 28th, 2013, 3:23 am
Location: Oklahoma, United States

Re: SimdCompute project

Post by albinopapa » February 6th, 2018, 3:03 am

Project SimdCompute: Feb 5, 2018

After doing a lot more searching, I came across a git repo on github.com that does something similar to what I'm wanting to accomplish. I've spent a few days looking it over and I think I may be able to move forward a few steps. It will take me a while to nail things down as my understanding of the process is still weak at best.

Previously, I wasn't sure how to structure the parse tree and that is one of the things the project I came across ( HLSLParser ) actually does. I'll keep digging and posting as things become clearer.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com

Post Reply