DX11 XMVECTOR & XMFLOAT3 .. ect

Post by **albinopapa** » December 23rd, 2015, 2:15 am

I found chili's NBody simulation and did some playing around. Here are my findings.

The tests were run with the 2000 particles that chili has for default in his original version
NBody without SSE (chili's code)
Average is about 30 ms per frame

NBody with SSE SoA (chili's code)
Average is about 3 ms per frame

NBody with SSE AoS (my code)
Average is about 8 ms per frame

NBody with SSE SoAoS (my code)
Average is about 4 ms per frame
*SoAoS, a structure of Vec2 arrays, which are structures themselves. The difference here is each element such as position, velocity and acceleration vectors were arrays themselves and therefor could be loaded into the registers two at a time. The AoS version the elements of the register had to be set, and I was able to still load them two at a time, since it's a vec2 and two fit in a register.

I'm just going to copy paste the functions instead of uploading the project just to show what I did. You can download chili's nBody here.

NBodyAoS

Code: Select all

	void Step()
	{
		__m128 grav = _mm_set1_ps(constGrav * mass);
		__m128 mAccel = _mm_set1_ps(maxAccel);
		__m128 temp0;
		Vec2 temp[2];

		for (int i = 0; i < nBodies; i += 2)
		{
			UINT ik = i + 1;
			// Load position vectors for outer loop
			__m128 iPos = _mm_setr_ps(body[i].pos.x, body[i].pos.y, body[ik].pos.x, body[ik].pos.y);
			// Load velocity vectors for outer loop
			__m128 iVel = _mm_setr_ps(body[i].vel.x, body[i].vel.y, body[ik].vel.x, body[ik].vel.y);
			// Load acceleration vectors for the outer loop
			__m128 iAccel = _mm_setr_ps(body[i].accel.x, body[i].accel.y, body[ik].accel.x, body[ik].accel.y);

			for (UINT j = 1; j < nBodies; j += 2)
			{
				UINT jk = j + 1;
				// Load position vectors for inner loop
				__m128 jPos = _mm_setr_ps(body[j].pos.x, body[j].pos.y, body[jk].pos.x, body[jk].pos.y);
				// Load acceleration vectors for the inner loop
				__m128 jAccel = _mm_setr_ps(body[j].accel.x, body[j].accel.y, body[jk].accel.x, body[jk].accel.y);

				// Calculate the mahattan distance between the two points
				__m128 delta = _mm_sub_ps(jPos, iPos);
				// Calculate the actual square distance using dot product
				temp0 = _mm_mul_ps(delta, delta);
				// X0, Y0, X1, Y1 -> Y0, X0, Y1, X1
				__m128 temp1 = _mm_shuffle_ps(temp0, temp0, _MM_SHUFFLE(2, 3, 0, 1));
				__m128 distSqr = _mm_add_ps(temp0, temp1);
				// Calculate normal by multiplying delat by the recripacol square root 
				// of the actual square distance
				__m128 normal = _mm_mul_ps(delta, _mm_rsqrt_ps(distSqr));
				// Limit the force to maxAccel
				__m128 force = _mm_min_ps(_mm_mul_ps(grav, _mm_rcp_ps(distSqr)), mAccel);
				// Calculate the new acceleration vector
				temp0 = _mm_mul_ps(normal, force);

				// Subtract new acceleration to inner loop acceleration
				jAccel = _mm_sub_ps(jAccel, temp0);

				// Store updated acceleration
				_mm_storeu_ps((float*)&temp, jAccel);
				body[j].accel = temp[0];
				body[jk].accel = temp[1];
			}
			// Add new acceleration to outer loop
			iAccel = _mm_add_ps(iAccel, temp0);
			// Add acceleration to velocity
			iVel = _mm_add_ps(iVel, iAccel);
			// Add velocity to position
			iPos = _mm_add_ps(iPos, iVel);

			// Store updated velocity and position and zero acceleration
			_mm_store_ps((float*)&temp, iVel);
			body[i].vel = temp[0];
			body[ik].vel = temp[1];
			_mm_store_ps((float*)&temp, iPos);
			body[i].pos = temp[0];
			body[ik].pos = temp[1];
			_mm_store_ps((float*)&temp, _mm_setzero_ps());
			body[i].accel = temp[0];
			body[ik].accel = temp[1];
		}
	}

NBodySoAoS

Code: Select all

	void Step()
	{		
		__m128 grav = _mm_set1_ps(constGrav * mass);
		__m128 mAccel = _mm_set1_ps(maxAccel);
		__m128 temp0;
		
		for (int i = 0; i < nBodies; i += 2)
		{
			__m128 iPos = _mm_load_ps((float*)&body.pos[i]);

			for (int j = 1; j < nBodies; j += 2)
			{
				// Load position vectors for inner loop
				__m128 jPos = _mm_loadu_ps((float*)&body.pos[j]);
				// Load acceleration vectors for the inner loop
				__m128 jAccel = _mm_loadu_ps((float*)&body.accel[j]);

				// Calculate the mahattan distance between the two points
				__m128 delta = _mm_sub_ps(jPos, iPos);
				// Calculate the actual square distance using dot product
				temp0 = _mm_mul_ps(delta, delta);
				// X0, Y0, X1, Y1 -> Y0, X0, Y1, X1
				__m128 temp1 = _mm_shuffle_ps(temp0, temp0, _MM_SHUFFLE(2, 3, 0, 1));
				__m128 distSqr = _mm_add_ps(temp0, temp1);
				// Calculate normal by multiplying delat by the recripacol square root 
				// of the actual square distance
				__m128 normal = _mm_mul_ps(delta, _mm_rsqrt_ps(distSqr));
				// Limit the force to maxAccel
				__m128 force = _mm_min_ps(_mm_mul_ps(grav, _mm_rcp_ps(distSqr)), mAccel);
				// Calculate the new acceleration vector
				temp0 = _mm_mul_ps(normal, force);

				// Subtract new acceleration to inner loop acceleration
				jAccel = _mm_sub_ps(jAccel, temp0);

				// Store updated acceleration
				_mm_storeu_ps((float*)&body.accel[j], jAccel);
			}
			// Load acceleration vectors for the outer loop
			__m128 iAccel = _mm_load_ps((float*)&body.accel[i]);
			// Add new acceleration to outer loop
			iAccel = _mm_add_ps(iAccel, temp0);
			// Load velocity vectors for outer loop
			__m128 iVel = _mm_load_ps((float*)&body.vel[i]);
			// Add acceleration to velocity
			iVel = _mm_add_ps(iVel, iAccel);
			// Add velocity to position
			iPos = _mm_add_ps(iPos, iVel);

			// Store updated velocity and position and zero acceleration
			_mm_store_ps((float*)&body.vel[i], iVel);
			_mm_store_ps((float*)&body.pos[i], iPos);
			_mm_store_ps((float*)&body.accel[i], _mm_setzero_ps());
		}
	}

Post by **albinopapa** » December 23rd, 2015, 2:22 am

Notice how in the first one, the index is beside body, body, and in the second one the index is by the elements of body body.pos. The Body struct isn't really necessary, it was just a way of containing the 3 arrays. In chili's code he just has arrays for posX, posY, velX, velY, accelX, accelY.

Even having an array of Vec2s is still an array of structures so I don't feel like it's cheating too much, but that is why I also did the worse case scenario of an array of structures that would be more typical a struct with 3 vec2s and make an array of those just to see what the effect would be. As you can see, you can still get a 300% boost in performance.

Post by **albinopapa** » December 23rd, 2015, 2:43 am

Just realized a flaw in my code. The particles only need to affect all other particles and not themselves. Will have to figure something out.

Guess I'm not sure how it's suppose to work. The code posted seems to work, but with a slight deviation where too much gravity is being applied, in other words the masses collide and combine pretty quick. If I change it so that j = i + 1 so there is no overlap then there isn't enough gravity and all the particles just stray off into oblivion.

What am I doing wrong?

Post by **chili** » December 23rd, 2015, 6:22 am

I'd have to take a deeper look at it, but with j = i + 1 I guess there is no calculation of the forces between the two elements that fit into an xmm register.

I'm planning on using the NBody example for an SSE lesson after I finish the current arc on integer SSE. It gets a nice boost with AVX as well (a little under 2x SSE).

MrGodin · Post by **MrGodin** » December 23rd, 2015, 6:31 pm

SSE isn't that important to you right now I would guess. Transformations for static meshes and the like should all be done on the GPU. SSE won't help you there. SSE comes more into play in skeletal animation (I think bone blending is done on the CPU and then vertex skinning is done on GPU) and physics stuff.

So SSE is best suited for large complex math equations .. I see, I'll keep that in mind thanks

.
So far am just rendering terrain with concentration on shading. Now i am going to try LOD which i think i understand. I think i'll try measuring the distance of the quad tree nodes to be rendered to the node the camera is in and then do this ..
LOD level 1

Code: Select all

deviceContext->PSSetShaderResources(0, 1, &colorTex1);
deviceContext->PSSetShaderResources(1, 1, &colorTex2);
deviceContext->PSSetShaderResources(2, 1, &colorTex3);
deviceContext->PSSetShaderResources(3, 1, &colorTex4);
deviceContext->PSSetShaderResources(4, 1, &alphaTex1);
deviceContext->PSSetShaderResources(5, 1, &normalMap1);
deviceContext->PSSetShaderResources(6, 1, &normalMap2);

LOD level 2

Code: Select all

deviceContext->PSSetShaderResources(0, 1, &colorTex1);
deviceContext->PSSetShaderResources(1, 1, &colorTex2);
deviceContext->PSSetShaderResources(2, 1, &colorTex3);
deviceContext->PSSetShaderResources(3, 1, &colorTex4);
deviceContext->PSSetShaderResources(4, 1, &alphaTex1);

LOD level 3

Code: Select all

deviceContext->PSSetShaderResources(0, 1, &colorTex1);

or something to this effect .. I think haha, i'll give it a try

Post by **albinopapa** » December 23rd, 2015, 9:02 pm

So the multiple colorTex resources, are those for texture blending? and dang, so that's how you use multiple textures, I figured having an array of textures and telling the system you were using N number of textures would do it but it wasn't working for me.

MrGodin · Post by **MrGodin** » December 23rd, 2015, 9:38 pm

Yes i have multiple textures and the alphamap is an image that is black, red, green and blue colors drawn on it in the areas you want blended, the shader reads the r,g,b values and uses them to blend. RasterTec is a wealth of resources to learn from. I studied what they did there and have implemented a whole tonne of his/her/their examples into my project. I am learning a lot, especially shaders , which i had little knowledge of.

clynch · Post by **clynch** » January 24th, 2017, 10:32 pm

XMVECTOR eyePosition = XMVectorSet(0.0f, 0.0f, 0.0f, 0.0f);
XMVECTOR focusPosition = XMVectorSet(0.0f, 0.0f, 1.0f, 0.0f);
XMVECTOR upDirection = XMVectorSet(1.0f, 0.0f, 0.0f, 0.0f);

Post by **albinopapa** » January 25th, 2017, 5:21 am

Code: Select all

// This is all 0's:
XMVECTOR eyePosition = XMVectorSet(0.0f, 0.0f, 0.0f, 0.0f);
// This is more efficient
XMVECTOR eyePosition = XMVectorZero();

// The next two you might be better off, keeping an XMFLOAT4A with these values laying around instead of using XMVectorSet.
// In the Camera class for instance, store the two XMFLOAT4A members
Camera::XMFLOAT4A focusPosition, upDirection;

// These get created and used in the Camera::Update or Camera::Render function
XMVECTOR xmFocusPosition = XMLoadFloat4A(&focusPosition);
XMVECTOR xmUpDirection   = XMLoadFloat4A(&upDirection);

The XMVectorSet first has to copy the 4 values into a 128 bit memory location, then copy to the SSE register. If you have an XMFLOAT4A with these values laying around, the values are already in memory and are aligned, so the copying will be more efficient.

Planet Chili

DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect

Re: DX11 XMVECTOR & XMFLOAT3 .. ect