Hello planet chili, how is everyone?
Wanted to share a bit about a project I'm working on. It's going to be a compute shader library using Single Instruction Multiple Data instruction (SIMD). You might have read about it from the "crickets" post started by MrGodin. I've got about 1,700 lines of code done so far. It took me awhile to piece together how to tell where data was coming from and where to store it, but I've made progress.
Currently, it can:
add, subtract, multiply and divide; gotta support the basics.
shuffle - useful for multiplying vectors and matrices together.
blend and insert - useful for mixing values from different registers
There are five data sources;
- user supplied multi-element buffer
- single element cache
- user supplied single element constant buffer
- literals ( gets stored in an SSE vector either single element or vector depending on operation and stored in a single element cache )
- system supplied intermediate multi-element buffer
There are three data destinations;
- user supplied multi-element buffer
- single element cache
- system supplied intermediate multi-element buffer
The number of structs for input and output must match, but the number of elements in the structs do not need to match.
I don't have a compiler setup yet, so I'm having to manually create parameters for every pseudo instruction:
Code: Select all
// This is all the instructions for multiplying a __m128 register by a 4x__m128 matrix of registers.
// swizzling with (0,0,0,0) copies the first element to all four elements in the register
// swizzling with (1,1,1,1) does the same for the second element and so on.
// pseudo code:
// buffer_slot_0 = {x,y,z,w}
// cbuffer_slot_0 = {matrix4x4.r0}
// cbuffer_slot_1 = {matrix4x4.r1}
// cbuffer_slot_2 = {matrix4x4.r2}
// cbuffer_slot_3 = {matrix4x4.r3}
_dispatch_param()
.set_operation<swizzleps>( make_shuf_mask( 0, 0, 0, 0 ) )
.set_op_category( _op_cat::buffer_null_temp )
.set_lhs_slot_num( 0 )
.set_out_slot_num( 0 )
.register_param( dispatch );
_dispatch_param()
.set_operation<mulps>()
.set_op_category( _op_cat::temp_const_temp )
.set_lhs_slot_num( 0 )
.set_rhs_slot_num( 0 )
.set_out_slot_num( 0 )
.register_param( dispatch );
_dispatch_param()
.set_operation<swizzleps>( make_shuf_mask( 1, 1, 1, 1 ) )
.set_op_category( _op_cat::buffer_null_temp )
.set_lhs_slot_num( 0 )
.set_out_slot_num( 1 )
.register_param( dispatch );
_dispatch_param()
.set_operation<mulps>()
.set_op_category( _op_cat::temp_const_temp )
.set_lhs_slot_num( 1 )
.set_rhs_slot_num( 1 )
.set_out_slot_num( 1 )
.register_param( dispatch );
_dispatch_param()
.set_operation<addps>()
.set_op_category( _op_cat::temp_temp_temp )
.set_lhs_slot_num( 0 )
.set_rhs_slot_num( 1 )
.set_out_slot_num( 0 )
.register_param( dispatch );
_dispatch_param()
.set_operation<swizzleps>( make_shuf_mask( 2, 2, 2, 2 ) )
.set_op_category( _op_cat::buffer_null_temp )
.set_lhs_slot_num( 0 )
.set_out_slot_num( 2 )
.register_param( dispatch );
_dispatch_param()
.set_operation<mulps>()
.set_op_category( _op_cat::temp_const_temp )
.set_lhs_slot_num( 2 )
.set_rhs_slot_num( 2 )
.set_out_slot_num( 2 )
.register_param( dispatch );
_dispatch_param()
.set_operation<swizzleps>( make_shuf_mask( 3, 3, 3, 3 ) )
.set_op_category( _op_cat::buffer_null_temp )
.set_lhs_slot_num( 0 )
.set_out_slot_num( 3 )
.register_param( dispatch );
_dispatch_param()
.set_operation<mulps>()
.set_op_category( _op_cat::temp_const_temp )
.set_lhs_slot_num( 3 )
.set_rhs_slot_num( 3 )
.set_out_slot_num( 3 )
.register_param( dispatch );
_dispatch_param()
.set_operation<addps>()
.set_op_category( _op_cat::temp_temp_temp )
.set_lhs_slot_num( 2 )
.set_rhs_slot_num( 3 )
.set_out_slot_num( 1 )
.register_param( dispatch );
_dispatch_param()
.set_operation<addps>()
.set_op_category( _op_cat::temp_temp_out )
.set_lhs_slot_num( 0 )
.set_rhs_slot_num( 1 )
.set_out_slot_num( 0 )
.register_param( dispatch );
// This is what the code is doing:
// temp_slot_0 = {x,x,x,x}
// temp_slot_0 = temp_slot_0 * cbuffer_slot_0
// temp_slot_1 = {y,y,y,y}
// temp_slot_1 = temp_slot_1 * cbuffer_slot_1
// temp_slot_0 = temp_slot_0 + temp_slot_1
// temp_slot_2 = {z,z,z,z}
// temp_slot_2 = temp_slot_2 * cbuffer_slot_2
// temp_slot_3 = {w,w,w,w}
// temp_slot_3 = temp_slot_3 * cbuffer_slot_3
// temp_slot_1 = temp_slot_2 + temp_slot_3
// out_slot_0 = temp_slot_0 + temp_slot_1
* const = user supplied signel element buffer
* out = user supplied multi-element buffer for results
* temp = intermediate system supplied multi-element buffer
* cache = system supplied single element buffer
It's doing each operation over an entire array before moving on to the next instruction
My initial test isn't as exciting as I'd hoped, and I think I have an idea why.
Yes, I get vectorized operations, when adding two registers, it adds four source + four destination in a single instruction and typically in a single cpu cycle. The issue here is probably going to be the moving data from memory, do one operation then moving back to memory. This means there is more time waiting for loads and stores than actual calculations being done. Still, I'm probably going to keep down this path for now so I have a baseline to compare for optimizations later on.
One thing I am considering is unrolling the loops, this might hint to the compiler/cpu to prefetch some data and give the cpu a chance to perform a few tasks while waiting for more data to either load or store.