I think I have a plan for my SIMD library. SimdCompute shader.
I've always wondered if instead of processing an algorithm one element at a time is faster or slower than breaking up operations...reminds me of something I think cyberyxmen once brought up. Now, in your main application code, it would be difficult to break up your operations like this and keep track of the results. It would also look pretty messy I think.
So my idea is to have users create buffers and write a shader that takes care of processing for you. The idea is to take and read in the shader file or string, interpret the operations and record them to a vector, thus the need for a common base. When the user needs to process the buffers, operations will be done one by one on the entire buffer for each operation. So, if you had a shader that computed the dot product of a Vec3, it would look something like:
Code: Select all
// First, the compilation
// A Vec3 dot product is three multiplies and two adds
ops.emplace_back(std::make_unique<mulps>(lhs_buffer_rhs_buffer));
ops.emplace_back(std::make_unique<mulps>(lhs_buffer_rhs_buffer));
ops.emplace_back(std::make_unique<mulps>(lhs_buffer_rhs_buffer));
ops.emplace_back(std::make_unique<addps>(lhs_buffer_rhs_buffer));
ops.emplace_back(std::make_unique<addps>(lhs_buffer_rhs_buffer));
// Then, when the user calls some compute function
dispatcher( float4* out, float4* lhs, float4* rhs, size_t Count )
{
for(const auto& op : ops)
{
// determine if op is unary, binary or generating
// ...is binary
// determine if parameters are both buffers, left buffer right single, left single right buffer
// ...both buffers
op->process( out, lhs, rhs );
}
}
// Example of multiplication op
struct mulps :binary_op_base
{
mulps( binary_param_type _par_type )
{
par_type = _par_type;
}
void process( float4* lhs, float4* rhs, float4* out, size_t Count )const override
{
for( int i = 0; i < Count; ++i )
{
out[ i ] = lhs[ i ] * rhs[ i ];
}
}
void process( float4 lhs, float4* rhs, float4* out, size_t Count )const override
{
// Not implemented
};
void process( float4* lhs, float4 rhs, float4* out, size_t Count )const override
{
// Not implemented
};
};
A few things I'll need to work out is using buffers and a constant buffer. Also, using a cache or common loaded values, so they don't have to be loaded repeatedly for the same operation. I'll also have to work out some things regarding data layout. Ideally, I'll want to have the data setup in such a way that for the dot product example, I can multiply and add without having to do any shuffling, so structure of arrays layout.
Still in very early stages, so we'll so how things progress.
If you think paging some data from disk into RAM is slow, try paging it into a simian cerebrum over a pair of optical nerves. - gameprogrammingpatterns.com