crickets..

SlevinKelevra · Post by **SlevinKelevra** » January 15th, 2018, 9:47 pm

albinopapa wrote:I know. Well, so goes the chili community. Everyone uses discord and drops comments on the vids, not much left to say here. Kind of sucks though for me as I don't want to keep up with the discord chat server. I like being able to have record of conversations here on the forum as well as being able to actually see when someone needs help. I don't have to constantly watch, I can pop in and out and help, participate or whatever. Discord the chats happen so fast, I don't really feel useful hehe.

Same here! This forum is dead and I dont like it. Just missing the good old times without discord and shit

There are only 1-2 posts per week and Chili himself is also pretty absent.

Post by **chili** » January 16th, 2018, 1:23 am

SlevKel: I am always watching you.

Papa: N-Body simulation is a fun one to do with SIMD. Another easy one is Mandelbrot. BTW, you see the 'new' Skylake-X processors? They got AVX-512 (32 x 64-byte registers, that's 2kB of register space!).

Post by **albinopapa** » January 16th, 2018, 4:37 am

Yeah, I messed with the N-Body code you posted a while ago, at the time I couldn't wrap my head around how to calculate all the interactions, I might give it another go though. The thing that threw me is how the inner and outer loops interact in regards to the SIMD register width. Your use of MACRO functions was hard to follow how you handled it since you can't step through macros. I'll probably move that code to a function just to be able to step through it.

I'm kind of sad that AMD hasn't joined the AVX512 bandwagon yet. With that much space and throughput, it's on par with GPU SIMD. Unfortunately, GPUs still have more "cores/threads/registers"?. I wish AMD would have come up with instrinsics for their APU chips. That would have made programming for the GPU side pretty easy. Instead we have to go through AMP/DirectCompute/OpenCL. AMP is pretty easy I suppose considering it's mostly C++ lol.

Post by **albinopapa** » January 17th, 2018, 9:37 am

I think I have a plan for my SIMD library. SimdCompute shader.

I've always wondered if instead of processing an algorithm one element at a time is faster or slower than breaking up operations...reminds me of something I think cyberyxmen once brought up. Now, in your main application code, it would be difficult to break up your operations like this and keep track of the results. It would also look pretty messy I think.

So my idea is to have users create buffers and write a shader that takes care of processing for you. The idea is to take and read in the shader file or string, interpret the operations and record them to a vector, thus the need for a common base. When the user needs to process the buffers, operations will be done one by one on the entire buffer for each operation. So, if you had a shader that computed the dot product of a Vec3, it would look something like:

Code: Select all

// First, the compilation
// A Vec3 dot product is three multiplies and two adds
ops.emplace_back(std::make_unique<mulps>(lhs_buffer_rhs_buffer));
ops.emplace_back(std::make_unique<mulps>(lhs_buffer_rhs_buffer));
ops.emplace_back(std::make_unique<mulps>(lhs_buffer_rhs_buffer));
ops.emplace_back(std::make_unique<addps>(lhs_buffer_rhs_buffer));
ops.emplace_back(std::make_unique<addps>(lhs_buffer_rhs_buffer));

// Then, when the user calls some compute function
dispatcher( float4* out, float4* lhs, float4* rhs, size_t Count )
{
	for(const auto& op : ops)
	{
		// determine if op is unary, binary or generating
		// ...is binary
		// determine if parameters are both buffers, left buffer right single, left single right buffer
		// ...both buffers
		op->process( out, lhs, rhs );
	}
}

// Example of multiplication op
struct mulps :binary_op_base
{
	mulps( binary_param_type _par_type )
	{
		par_type = _par_type;
	}
	void process( float4* lhs, float4* rhs, float4* out, size_t Count )const override
	{
		for( int i = 0; i < Count; ++i )
		{
			out[ i ] = lhs[ i ] * rhs[ i ];
		}
	}
	void process( float4 lhs, float4* rhs, float4* out, size_t Count )const override
	{
		// Not implemented
	};
	void process( float4* lhs, float4 rhs, float4* out, size_t Count )const override
	{
		// Not implemented
	};
};

A few things I'll need to work out is using buffers and a constant buffer. Also, using a cache or common loaded values, so they don't have to be loaded repeatedly for the same operation. I'll also have to work out some things regarding data layout. Ideally, I'll want to have the data setup in such a way that for the dot product example, I can multiply and add without having to do any shuffling, so structure of arrays layout.

Still in very early stages, so we'll so how things progress.

MrGodin · Post by **MrGodin** » January 17th, 2018, 5:09 pm

That looks interesting albinopapa

Planet Chili

crickets..

Re: crickets..

Re: crickets..

Re: crickets..

Re: crickets..

Re: crickets..