Page 1 of 3

SIMD library attempt inf

Posted: September 7th, 2020, 5:03 pm
by albinopapa

Code: Select all

template<typename T, typename U>
struct common_type {
	using intermediate_type = typename get_higher_value_<T, U>::type;
	static constexpr auto use_float = is_simd_float<intermediate_type>::value;

	using integral_type = std::conditional_t<
		( !use_float ) && ( is_unsigned_v<T> || is_unsigned_v<U> ),
		make_simd_unsigned_t<intermediate_type>,
		intermediate_type
	>;

	using type = std::conditional_t<
		use_float,		// ?
		mm_float_t,		// :
		integral_type
	>;
};
Kind of lame, but I thought this was kind of funny. If you look at this struct, it kind of feels like it should be a function.

I'm making an SSE library. SIMD ( SSE/AVX ) doesn't distinguish between the different integral types, so it's kind of impossible to create math operator overloads for them. SSE has only one integral type, __m128i. There are add and subtract functions that handle this type as 4 distinct types, a few multiplication functions and no divide functions. So this library will attempt to separate and provide overloads for the integer types as well as interaction between integral types and floating point types.

For instance, when blending two colors the most straight forward approach is to have the alpha in the range of 0.f to 1.f. If you aren't comfortable or aren't familiar with bit manipulation it's just easier to convert the four color channels to floats, do the math then convert back to chars. While this is less efficient than just manipulating bits and approximation, it is more readable and straight forward.

With SSE, it becomes even more convoluted. You have to convert the chars to shorts manually, then do the normal stuff like extract the alpha, multiply alpha with the three color channels, add those results together, then shift right by 8 ( like dividing by 256 ), then convert the shorts back to chars. The steps are simple and straightforward, but figuring out which intrinsics to use and how to use them was a pain to learn and still are a pain to write and an eye sore in the code.

As the title jokingly states, this isn't my first attempt at making an SSE library, but I have a better understanding of templates and with features like constexpr-if it is a bit easier or less tedious.

Re: SIMD library attempt inf

Posted: September 9th, 2020, 6:10 am
by chili
avx-512 poggy
when rocket lake comes out, chili gonna make a new build baby

Re: SIMD library attempt inf

Posted: September 10th, 2020, 3:09 am
by albinopapa
Yeah, I'm still a little unhappy AMD hasn't jumped on that train yet and only as of Ryzen 1 got AVX2 and even then it was a hack. I'll be interested to see what kind of performance gains AVX512 instructions give over say AVX/2

Re: SIMD library attempt inf

Posted: September 10th, 2020, 6:35 am
by chili
It was funny, in skylake-x there was a whole debacle with it, but it seems that such will not be the case with Rocket Lake.

Re: SIMD library attempt inf

Posted: September 11th, 2020, 5:52 pm
by albinopapa
While writing this library, I decided to use or actually create some concepts to constrain the types that template functions could accept. While reading through some research material, and kind of remembering chili's video on concepts, I wanted to try a short hand version.

Code: Select all

auto operator+(simd_type auto lhs, simd_type auto rhs)noexcept{
	using T = decltype( lhs );
	using U = decltype( rhs );

	if constexpr( std::is_same_v<T, common_simd_type_t<T, U>> ) {
		return lhs + T{ rhs };
	}
	else {
		return U{ lhs } + rhs;
	}
}
If I remember correctly about chili's attempt at this, Intellisense didn't like this format. However, when I try it, it has no problems...until I compiled. When doing so, I was greeted with a few errors all centralising on 'auto'.
error C3533: a parameter cannot have a type that contains 'auto'
The rest of the errors were because of this one error, the fact that there is no attempt made to detect the types using such a function so the rest of the function relying on detection of the types is invalid.

I've not seen it before where intellisense decides something is valid, but isn't. Most of the time, it says something is invalid, but is in fact valid.

I like Bjarne's suggestion of being able to write: auto operator+( simd_type lhs, simd_type rhs );
Unfortunately, it seems there has been pushback on whether it is clear that this is a templated function or not. I find that ridiculous considering all the hubbub about the user shouldn't worry about implementation details. I understand somewhat that a concept isn't a type, so even doing something like: auto operator+( simd_type T lhs, simd_type U rhs ); would be acceptable to me.

Re: SIMD library attempt inf

Posted: September 11th, 2020, 7:29 pm
by albinopapa
Anyway, so far I have support for 8 SSE and SSE2 types. I separated the integer __m128i type into signed and unsigned 8, 16 and 32 bit types as well as float and double types.

I haven't gotten around to division of integer types yet, well not completely. I currently am just adding rhs to itself until it's greater than lhs. This would be VERY inefficient for something like:
( INT_MAX, 32, 69, 42 ) / ( 1, 16, 31, 2 )

I know I can cut this down to a max of 32 iterations for the 32 bit representations which is a far better case than INT_MAX / 1. I just haven't gotten around to doing the mental gymnastics yet...still stretching.

I did find a pretty straight forward formula for approximating sine and cosine. Haven't tested it yet, so I don't know how accurate it will be. Initial tests show in some cases it's accurate up to 5 decimal places in others only 2. I don't expect this to be used in anything more than just games ( or by anyone else lol ), so I think it will be an acceptable approximation and because it doesn't use any of the <cmath> library and a C++ version can be constexpr, which could be better than say a lookup table depending on how it's used.

Here's a list of math functions implemented:
  • template<simd_type T> T abs( T lhs )noexcept
  • template<simd_floating_point T> T ceil( T lhs )noexcept
  • template<simd_floating_point T> T floor( T lhs )noexcept
  • template<simd_floating_point T> T fmod( T lhs, T rhs )noexcept
  • mm_float_t __vectorcall rcp( mm_float_t lhs ) noexcept
  • mm_float_t __vectorcall rsqrt( mm_float_t lhs ) noexcept
  • template<simd_floating_point T> T __vectorcall sqrt( T lhs )noexcept
  • template<simd_floating_point T> T __vectorcall is_nan( T lhs, T rhs )noexcept
  • template<simd_floating_point T> T __vectorcall not_is_nan( T lhs, T rhs )noexcept
  • template<simd_type T> T __vectorcall max( T lhs, T rhs )noexcept
  • template<simd_type T> T __vectorcall min( T lhs, T rhs )noexcept
  • template<simd_floating_point T> std::pair<T, T> sincos( T x )noexcept
  • template<simd_floating_point T> T __vectorcall sin( T lhs )noexcept
  • template<simd_floating_point T> T __vectorcall cos( T lhs )noexcept
Aside from a "math library", each type has their own (+, -, *, /, +=, -=, *=, /= ) operator overloads.
The integer types have bit operations ( <<, >>, ~, ^, &, |, <<=, >>=, &=, |=, ^= )
All types have comparison operators ( ==, != )
As well as ordering operators ( <, >, <=, >= )

A few caveats:
  • As mentioned, I haven't finished with the division operator for the integer types.
  • The member operator~() for 'bitwise not' cannot have any parameters, so there really isn't a good alternative for sse's _mm_andnot_si128() intrinsic. If you want the use of the intrinsic, you'll have to use it directly.
  • The comparison operators and ordering operators would normally return bool for scalar types, however, since SIMD types contain multiple values within it's 128 bit registers, the best I can do is return a simd bool type which can be converted to bool if ALL elements satisfy some operation.
  • Because there are multiple ways of representing true using the intrinsics:
    for 8 bit each element gets a 1 so bits 0 - 15 would be set
    for 16 bit each element gets a 11, so every 2 bits from 0-15 would be set
    for 32 bit each element gets a 1111, so every 4 bits from 0-15 would be set
    for float each element gets a 1, so only the first 4 bits are used
    and for double each element gets a 1, only the first 2 bits are used.
    I don't see a way of having a single mm_bool_t type without some extra work in the constructor...which could be a possibility.
There are functions like: bool all(), bool any(), and bool none() in the bool types.


Once I get the integer division done, I'll be able to create a testbed/demo/benchmark and see how useful the lib is.

Re: SIMD library attempt inf

Posted: September 12th, 2020, 5:01 am
by albinopapa
Well, I kind of saw this coming, but ignored it. All the comparison intrinsics for integral types are signed, so in order to make comparisons work for unsigned, they have to be promoted compared, then pack the comparison back into the original integer type.

Example:
unsigned char a = 0;
unsigned char b = 255;
bool c = a<b; // <- this won't work as expected with simd instructions

under the hood, the comparison ends up being
char a = 0;
char b = -1;
bool c = a<b; // which is of course not what is expected

so to counteract this
unsigned char a = 0;
unsigned char b = 255;
short sa = static_cast<short>( a ); // 0 extends a
short sb = static_cast<short>( b ); // 0 extends b
bool c = sa < sb; // Now we get the correct answer

Though this may be trivial in C++, simd would look like:
__m128i a = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
__m128i b = _mm_setr_epi8( 255, 254, 253, 252, 251, 250, 249, 248, 247, 246, 245, 244, 243, 242, 241, 240 );

// promote to 16 bit by unpacking the lower half and upper half of 'a' and 'b'
__m128i alo = _mm_unpacklo_epi8( a, _mm_setzero_si128() );
__m128i ahi = _mm_unpackhi_epi8( a, _mm_setzero_si128() );
__m128i blo = _mm_unpacklo_epi8( b, _mm_setzero_si128() );
__m128i bhi = _mm_unpackhi_epi8( b, _mm_setzero_si128() );

// do the comparisons with the two lower and two upper halves
__m128i reslo = _mm_cmplt_epi16( alo, blo );
__m128i reshi = _mm_cmplt_epi16( ahi, bhi );

// pack results back into a single register with signed saturation
// we do signed saturation since the unsigned doesn't have negative
// numbers so if we packed unsigned, results would be all 0 for any value over 127
__m128i result = _mm_packs_epi16( reslo, reshi );

The integer simd data type is __m128i which is a union of arrays of the integer types equaling 16 bytes:
16 char/unsigned char
8 short/unsigned short
4 int/unsigned int
2 long long/unsigned long long
So I've always wondered how the CPU knew how to handle signedness. Well, it does it by intrinsic. You have epi# for signed instructions and epu# for unsigned instructions. To treat the data as unsigned even using signed instructions, you have to go up a type where all unsigned values of the smaller type fit into the signed range of the bigger type.

Hope this information is useful to anyone. When I first started working with SSE, I had no clue how any of this stuff worked...duh. I fumbled my way through for some time, stuck with the floats for a while because there were more intrinsics and for the most part more useful. This is why it has taken so long in creating a simple SSE library. I'd do the same for AVX, but currently one of the peeps I'm working with has a laptop that only supports up to SSE2, so not worth my time atm.

Re: SIMD library attempt inf

Posted: September 13th, 2020, 4:53 am
by albinopapa
Okay, now I remember why I quit writing my past libs, they don't translate very well when trying to use them. I don't keep usage in mind when creating these libs.

For instance, the only algorithm I know ( I'm lame I know ) is for alpha blending with the integer type.

Code: Select all

__m128i color1;
__m128i color2;

// unpack the colors into two 8 16 bit elements, zero extend
auto left_lo = _mm_unpacklo_epi8( color1, _mm_setzero_si128() );
auto left_hi = _mm_unpackhi_epi8( color1, _mm_setzero_si128() );
auto right_lo = _mm_unpacklo_epi8( color2, _mm_setzero_si128() );
auto right_hi = _mm_unpackhi_epi8( color2, _mm_setzero_si128() );

// extract alpha channel from first four and second four elements of color1
auto temp = _mm_shufflelo_epi16( left_lo, _MM_SHUFFLE( 3, 3, 3, 3 ) );
auto left_alpha_lo = _mm_shufflehi_epi16( temp, _MM_SHUFFLE( 3, 3, 3, 3 ) );
temp = _mm_shufflelo_epi16( left_hi, _MM_SHUFFLE( 3, 3, 3, 3 ) );
auto left_alpha_hi = _mm_shufflehi_epi16( temp, _MM_SHUFFLE( 3, 3, 3, 3 ) );

// get inverse alpha by subtracting left_alpha from 255
const auto schar_max = _mm_set1_epi16( 255 );
auto right_alpha_lo = _mm_sub_epi16( schar_max, left_alpha_lo );
auto right_alpha_hi = _mm_sub_epi16( schar_max, left_alpha_hi );

// multiply left color channels by left_alpha
auto premultiplied_left_color_lo = _mm_mullo_epi16( left_lo, left_alpha_lo );
auto premultiplied_left_color_hi = _mm_mullo_epi16( left_hi, left_alpha_hi );

// multiply right color channels by right_alpha
auto premultiplied_right_color_lo = _mm_mullo_epi16( right_lo, right_alpha_lo );
auto premultiplied_right_color_hi = _mm_mullo_epi16( right_hi, right_alpha_hi );

// add left and right premultiplied lows and highs
auto sum_lo = _mm_add_epi16( premultiplied_left_color_lo, premultiplied_right_color_lo );
auto sum_hi = _mm_add_epi16( premultiplied_left_color_hi, premultiplied_right_color_hi );

// shift right by 8
auto color_lo = _mm_srli_epi16( sum_lo, 8 );
auto color_hi = _mm_srli_epi16( sum_hi, 8 );

// pack the two 8 16 bit elements into 16 8 bit elements 
auto color = _mm_packus_epi16( color_lo, color_hi );
So that's all the steps needed to blend 8 colors together using SSE2 intrinsics. Despite the number of instructions, it's actually pretty fast.

Now, for how it looks in my libraries current state:

Code: Select all

	// helper function ( not part of library )
	auto broadcast_alpha = []( mm_uint16_t val ) {
		auto lo = _mm_shufflelo_epi16( val.value, _MM_SHUFFLE( 3, 3, 3, 3 ) );
		return mm_uint16_t{ _mm_shufflehi_epi16( lo, _MM_SHUFFLE( 3, 3, 3, 3 ) ) };
	};

	// constructors taking a single argument is same as calling
	// _mm_set1_epi# where # is 8, 16 or 32
	const auto schar_max = mm_uint16_t{ 255 };

	// conversion from mm_uint8_t to mm_uint16_t unpacks the lower half
	// so this shuffle of 32 bit elements moves the upper half to the 
	// lower half so we can unpack it as well
	auto left_lo = mm_uint16_t{ lhs };
	auto left_hi = mm_uint16_t{ _mm_shuffle_epi32( lhs.value, _MM_SHUFFLE( 1, 0, 1, 0 ) ) };
	auto right_lo = mm_uint16_t{ rhs };
	auto right_hi = mm_uint16_t{ _mm_shuffle_epi32( rhs.value, _MM_SHUFFLE( 1, 0, 1, 0 ) ) };

	// calls the helper function that still uses intrinsics
	auto left_alpha_lo = broadcast_alpha( left_lo );
	auto left_alpha_hi = broadcast_alpha( left_hi );
	
	// subtract left alpha by 255 to get inverse alpha
	auto right_alpha_lo = schar_max - left_alpha_lo;
	auto right_alpha_hi = schar_max - left_alpha_hi;

	// sum the products of left and right color channels and alpha channels
	auto premultiplied_color_lo = ( left_alpha_lo * left_lo ) + ( right_alpha_lo * right_lo );
	auto premultiplied_color_hi = ( left_alpha_hi * left_hi ) + ( right_alpha_hi * right_hi );

	// shift right by 8 
	auto color_lo = premultiplied_color_lo >> 8;
	auto color_hi = premultiplied_color_hi >> 8;

	// pack the two 8 16 bit elements into 16 8 bit elements 
	// still using intrinsics
	auto color = mm_uint8_t{ _mm_packus_epi16( color_lo.value, color_hi.value ) };
So, there is some niceties with the library, for math and bit shifting, but the promotions require extra steps and demotions still require intrinsic calls. As a matter of fact, a lot of the library unpacks and packs registers to perform operations which would waste a lot of CPU cycles.

One thing I've thought about doing to reduce this would be to waste memory instead. Basically, I'd promote the lower types to the next type up upon construction. That way the entirety of the values would be maintained and there wouldn't be a need for expansion and compaction at every operation.

There are 16 8 bit elements in the __m128i type. So to maintain that count and not size, I would need to store two __m128i members in mm_int16_t and mm_uint16_t and four __m128i members in mm_int32_t and mm_uint32_t or two and just treat them as mm_int16_t or mm_uint16_t types.

This would save time when dealing with char sized data or maintaining sign. However, I don't have a way after that to distinguish whether it's a packed simd type or if it's actually going to be used as the type desired. For instance, if I were to hold 4 __m128i members in mm_int32_t, and I call the constructor: mm_int32_t{ 1, 2, 3, 4 }, would I really just set each member to a single digit? Obviously, this defeats the purpose of SIMD and I need to rethink what I truly want/need from a SIMD library.

I know there are some recurring bits of code, one of the most common I've run into though is combining element base on some condition. An example would be min or max type functions.

A min function in C++: auto result = a < b ? a : b;
A min function in SSE:
auto lt_mask = _mm_cmplt_epi16( a, b );
auto a_mins = _mm_and_si128( lt_mask, a );
auto b_mins = _mm_andnot_si128( lt_mask, b );
auto result = _mm_or_si128( a_mins, b_mins );

Since this is a pretty common set of instructions, combining two registers using some conditional mask would be a prime candidate ( granted, min/max SIMD instructions exist in later implementations, but not for SSE2 ).

The function would be something like:
__m128i mm_conditional_si128( __m128i a, __m128i b, __m128i mask ){
return _mm_or_si128(
_mm_and_si128( mask, a ),
_mm_andnot_si128( mask, b )
);
}

Now, my min function would like like
__m128i mm_min_epi8( __m128i a, __m128i b ){
return mm_conditional_si128( a, b, _mm_cmplt_epi8( a, b ) );
}

and max
__m128i mm_max_epi8( __m128i a, __m128i b ){
return mm_conditional_si128( a, b, _mm_cmpgt_epi8( a, b ) );
}

Damn this sucks, I hate spending days on something and not have it turn out the way I wanted.

Re: SIMD library attempt inf

Posted: September 13th, 2020, 9:03 pm
by albinopapa
How about a shuffle function that can handle all 8 16 bit elements:

Code: Select all

template<
	std::size_t _0, std::size_t _1, std::size_t _2, std::size_t _3,
	std::size_t _4, std::size_t _5, std::size_t _6, std::size_t _7>
	__m128i mm_shuffle_epi16( __m128i val ) {
	constexpr auto mask = []() {
		constexpr auto keep = std::uint16_t( -1 );
		alignas( 16 ) std::array<std::uint16_t, 8> mask = {};
		if constexpr( _0 < 4 ) mask[ 0 ] = keep;
		if constexpr( _1 < 4 ) mask[ 1 ] = keep;
		if constexpr( _2 < 4 ) mask[ 2 ] = keep;
		if constexpr( _3 < 4 ) mask[ 3 ] = keep;
		if constexpr( _4 < 4 ) mask[ 4 ] = keep;
		if constexpr( _5 < 4 ) mask[ 5 ] = keep;
		if constexpr( _6 < 4 ) mask[ 6 ] = keep;
		if constexpr( _7 < 4 ) mask[ 7 ] = keep;
		return mask;
	}( );
	constexpr auto shlo_mask = _MM_SHUFFLE( _3 % 4, _2 % 4, _1 % 4, _0 % 4 );
	constexpr auto shhi_mask = _MM_SHUFFLE( _7 % 4, _6 % 4, _5 % 4, _4 % 4 );
	constexpr auto copy_lo = _MM_SHUFFLE( 1, 0, 1, 0 );
	constexpr auto copy_hi = _MM_SHUFFLE( 3, 2, 3, 2 );

	auto mMask = _mm_load_si128( reinterpret_cast< __m128i const* >( mask.data() ) );
	
	auto lo = _mm_shuffle_epi32( val, copy_lo_mask );
	auto hi = _mm_shuffle_epi32( val, copy_hi_mask );

	auto lolo = _mm_shufflelo_epi16( lo, shlo_mask );
	auto hilo = _mm_shufflelo_epi16( hi, shlo_mask );
	
	auto result_lo = _mm_shufflehi_epi16( lolo, shhi_mask );
	auto result_hi = _mm_shufflehi_epi16( hilo, shhi_mask );
	
	return mm_conditional( result_lo, result_hi, mMask );
}
I'm just going to say, wow! Constexpr can be awesome.

considering the mask array is created during compilation and the modulos are done during compilation, the intrinsics are really the only things being executed during runtime. This function allows you to shuffle between the upper four and lower four 16 bit elements.

Using the alpha blending example: ( which there is a bug in I think )
Normally, you'd have four shuffles
auto temp = _mm_shufflelo_epi16( left_lo, _MM_SHUFFLE( 3, 3, 3, 3 ) );
auto left_alpha_lo = _mm_shufflehi_epi16( temp, _MM_SHUFFLE( 3, 3, 3, 3 ) );
temp = _mm_shufflelo_epi16( left_hi, _MM_SHUFFLE( 3, 3, 3, 3 ) );
auto left_alpha_hi = _mm_shufflehi_epi16( temp, _MM_SHUFFLE( 3, 3, 3, 3 ) );

With this function, you'd have 2
auto left_alpha_lo = mm_shuffle_epi16<3, 3, 3, 3, 7, 7, 7, 7>( left_lo );
auto left_alpha_hi = mm_shuffle_epi16<3, 3, 3, 3, 7, 7, 7, 7>( left_hi );

Not a great use case for such a helper function, but it still works as expected.

A better use case would be if you wanted to for some reason swap upper and lower
auto a = _mm_set_epi16( 0, 1, 2, 3, 4, 5, 6, 7 ); // in memory ( 7, 6, 5, 4, 3, 2, 1, 0 )
auto b = mm_shuffle_epi16<4, 5, 6, 7, 0, 1, 2, 3>( a );
// result is ( 3, 2, 1, 0, 7, 6, 5, 4 )
or maybe reverse the order
auto c = mm_shuffle_epi16<7, 6, 5, 4, 3, 2, 1, 0>( a );
// result is ( 0, 1, 2, 3, 4, 5, 6, 7 )


At this point I'm just wasting time lol.

Re: SIMD library attempt inf

Posted: September 15th, 2020, 9:55 pm
by albinopapa
Well if anyone is wondering how to do integer division without using /, here's my solution along with the helper functions required:

Code: Select all

__m128i __vectorcall mm_div_epi32( __m128i dividend, __m128i divisor )
{
	const auto negone = _mm_set1_epi32( -1 );
	const auto posone = _mm_set1_epi32( 1 );

	auto quotient = _mm_setzero_si128();
	auto remainder = dividend;

	// mask for getting correct sign of result
	const auto neg_mask = _mm_xor_si128(
		_mm_cmplt_epi32( dividend, _mm_setzero_si128() ),
		_mm_cmplt_epi32( divisor, _mm_setzero_si128() )
	);

	// convert to absolutes for easy division
	dividend = mm_abs_epi32( dividend );
	divisor = mm_abs_epi32( divisor );

	for( auto mask = mm_cmpge_epi32( dividend, divisor );
		_mm_movemask_epi8( mask ) != 0;
		mask = mm_cmpge_epi32( remainder, divisor ) )
	{
		auto cur_quotient = _mm_setzero_si128();
		auto cur_mask = mm_cmpge_epi32( remainder, divisor );
		auto sum = _mm_setzero_si128();
		for( auto shift = 0; _mm_movemask_epi8( cur_mask ) != 0; ++shift ) {
			// multiply divisor by 2 while less than dividend
			auto temp_div = _mm_slli_epi32( divisor, shift );
			auto overflow = _mm_cmplt_epi32( temp_div, _mm_setzero_si128() );
			cur_mask = mm_cmple_epi32( temp_div, remainder );
			cur_mask = _mm_xor_si128( cur_mask, overflow );

			// multiply quotients by 2 
			auto cur_count = _mm_slli_epi32( posone, shift );

			// keep track of running quotient total
			cur_quotient = mm_ternary( cur_mask, cur_count, cur_quotient);
			sum = mm_ternary( cur_mask, temp_div, sum);
		}

		quotient = _mm_add_epi32( quotient, cur_quotient );
		remainder = _mm_sub_epi32( remainder, sum );
	}

	// collect elements that need inverted
	auto negatives = _mm_and_si128( neg_mask, quotient );
	// collect elements that don't need inverted
	auto positives = _mm_andnot_si128( neg_mask, quotient );
	
	auto inverted = mm_negate_epi32( negatives );
	return _mm_or_si128( inverted, positives );
};
Worse case scenario ( INT_MAX / 1 ) this does about 31 iterations of the outer loop during which only 527 iterations of the inner loop.
There is no special handling of divide by 0, will just return 0 for divisors that are 0.

It's literally long division.
  • multiply the divisor by 2 while it's less or equal to divisor
  • multiply by two each iteration gives you the current quotient for that step
  • subtract current quotient from dividend to get remainder
  • repeat until remainder is less than divisor
The presteps are getting the signs of the divisor and dividend and determining the sign of the result.
Make both dividend and divisor positive ( which will have consequences if either is INT_MIN ), we'll just call it undefined behavior

The post steps are changing the signs using the mask created in the presteps.

helper functions:
compare less or equal

Code: Select all

__m128i __vectorcall mm_cmple_epi32( __m128i a, __m128i b ) {
	auto lt = _mm_cmplt_epi32( a, b );
	auto eq = _mm_cmpeq_epi32( a, b );
	return _mm_or_si128( lt, eq );
};
compare greater or equal

Code: Select all

__m128i __vectorcall mm_cmpge_epi32( __m128i a, __m128i b ) {
	auto gt = _mm_cmpgt_epi32( a, b );
	auto eq = _mm_cmpeq_epi32( a, b );
	return _mm_or_si128( gt, eq );
};
negate ( positive -> negative, negative -> positive )

Code: Select all

__m128i __vectorcall mm_negate_epi32( __m128i lhs )noexcept {
	auto negone = _mm_cmpeq_epi32( _mm_setzero_si128(), _mm_setzero_si128() );
	auto posone = _mm_sub_epi32( _mm_setzero_si128(), negone );

	auto xresult = _mm_xor_si128( lhs, negone );
	return _mm_add_epi32( xresult, posone );
}
absolute

Code: Select all

__m128i __vectorcall mm_abs_epi32( __m128i lhs )noexcept {
	// get mask of negatives
	auto cmp = _mm_cmplt_epi32( lhs, _mm_setzero_si128() );
	
	return _mm_or_si128( 
		// flip negatives to positives
		mm_negate_epi32( _mm_and_si128( cmp, lhs ) ),
		// collect elements already positive
		_mm_andnot_si128( cmp, lhs )
	);
}
ternary ( condition ? true : false )

Code: Select all

template<sse_intrinsic T>
T __vectorcall mm_ternary( T condition, T a, T b )noexcept {
	if constexpr( std::is_same_v<T, __m128i> ) {
		return _mm_or_si128( 
			_mm_and_si128( mask, a ),
			_mm_andnot_si128( mask, b )
		);
	}
	else if constexpr( std::is_same_v<T, __m128> ) {
		return _mm_or_ps(
			_mm_and_ps( mask, a ),
			_mm_andnot_ps( mask, b )
		);
	}
	else {
		return _mm_or_pd(
			_mm_and_pd( mask, a ),
			_mm_andnot_pd( mask, b )
		);
	}
}