SSE and SSE2 are available in every single x86family CPU with 64bit support. You too can play around with SIMD, which is great fun! Unfortunately, SSE2 level in particular also happens to be what is probably the most maddeningly nonorthogonal SIMD instruction set in the world, where operations are either available or not available for particular data types with little rhyme or reason, especially where integers are involved. Later revisions (especially starting around SSE4.1) fill in some of the more annoying gaps, but plenty of us are stuck with supporting the older CPUs for at least a few more years, and besides – not to mess with the authentic SSE experience – even on AVX2supporting CPUs, there’s still a few of the classic gaps remaining.
So, here’s a list of tricks to get you around some of the more common, eh, “idiosyncrasies” of SSE and its descendants. This happens to be mostly focused on the integer side; the floatingpoint side is generally less, well, weird. I’ll keep the individual descriptions relatively brief since the whole point of this post is to collect lots of tricks. The assumption here is that you’re already somewhat familiar with the instructions, so I’ll not explain the basics (maybe another time). I’ll use the official Intel intrinsics (as exposed in C/C++) since that’s probably the most common way people interact with these instructions intentionally (awkward glance in the direction of autovectorization here. No making eye contact. Moving on.)
Branchless “select” (cond ? a : b)
The natural mode of operation in SIMD computations is to do things branchlessly. If some part of a computation is conditional, rather than doing the equivalent of an if
, it’s more typical to do both the computation for the “if” and “else” forks, and then merge the results based on the condition. The “select” I mean is the operation which takes the condition and both results and performs the rough equivalent of C’s ternary operator cond ? a : b
. You first evaluate both sides, giving a
and b
. You then evaluate the condition using a SIMD compare, which returns a vector containing a bit mask that is has all bits set for lanes that meet cond
, and all bits clear for lanes that don’t.
This select operation can always be done using a few bitwise operations (which is well known), but starting in SSE 4.1 we get slightly more efficient variants too (less well known, and the reason I mention this):
 Integer (all vers):
_mm_or_si128(_mm_and_si128(a, cond), _mm_andnot_si128(cond, b))
.  32bit float (all vers):
_mm_or_ps(_mm_and_ps(a, cond), _mm_andnot_ps(cond, b))
.  64bit float (all vers):
_mm_or_pd(_mm_and_pd(a, cond), _mm_andnot_pd(cond, b))
.  Integer (SSE4.1+):
_mm_blendv_epi8(a, b, cond)
.  32bit float (SSE4.1+):
_mm_blendv_ps(a, b, cond)
.  64bit float (SSE4.1+):
_mm_blendv_pd(a, b, cond)
.
The andnot
operations don’t come in handy very often, but they’re the best choice here (preSSE4.1).
If you don’t want to use cond
but its logical negation, just switch the positions of a
and b
, since (!cond) ? a : b
is the same as cond ? b : a
.
Unsigned integer compares
SSE, in all incarnations, offers precisely two types of integer comparisons: test for equality ( PCMPEQt
, _mm_cmpeq_T
, where t and T stand for various type suffixes) and test for signed greaterthan ( PCMPGTt
, _mm_cmpgt_T
). Most other comparison types can be produced using nothing but logical negation and standard identities:

a == b
is supported directly. 
a != b
is!(a == b)
. 
a > b
(signed) is supported directly. 
a < b
(signed) is the same asb > a
(swap a and b). 
a >= b
(signed) is!(a < b)
(which in turn is!(b > a)
). 
a <= b
(signed) is!(a > b)
.
See previous note on selection operations on how to get rid of the NOT in the most common use case. Conspicuously absent from that list is any type of unsigned ordered comparison. However, a trick that works is to bias both integers so that signed comparison does the right thing:

a > b
(unsigned, 8bit) is the same as(a  0x80) > (b  0x80)
(signed, 8bit). 
a > b
(unsigned, 16bit) is the same as(a  0x8000) > (b  0x8000)
(signed, 16bit). 
a > b
(unsigned, 32bit) is the same as(a  0x80000000) > (b  0x80000000)
(signed, 32bit).
The same argumentswapping and NOTing tricks as above still apply to give you the other compare types. In general, the trick is to add (or subtract, or XOR – they all do the same thing in this particular case) the INT_MIN for the respective type from both operands before doing the compare. This turns the smallest possible unsigned integer, 0, into the smallest possible signed integer for the given type; after that, the ordering works out. In particular, when comparing against a constant, this addition (or subtraction, or XOR) can be baked into the constant operand, so the unsigned compare “only” ends up doing one more operation than a signed compare (instead of two).
A completely different approach is to use the unsigned integer min/max instructions (more about those in a second) to build lessorequal or greaterorequal comparisons:

a <= b
if and only ifmax(a, b) == b
. 
a >= b
if and only ifmin(a, b) == b
.
The good news is that this reduces unsigned comparisons to either an unsigned min or a max, followed by an equality comparison, which is only 2 instead of 3 operations. The bad news is that the requisite unsigned min/max operations only exist for uint8s in SSE2. The uint16/uint32 variants were finally added with SSE4.1; if your minimum target is earlier, you’re stuck with the biasthencompare variants above.
Integer min and max
SSE4.1 has the full set of integer min/max for 8, 16 and 32bit types, both signed and unsigned. So if you’re targeting SSE4.1 or later, good for you!
If you’re stuck with anything earlier, you’re decidedly more limited. In SSE2, you get integer min/max for uint8 and int16. If you need min/max for int8, uint16, or anything 32bit, you’re on your own.
Luckily, we can just combine some of the techniques above to derive a solution. The general patterns here are:
max(a, b) == (a > b) ? a : b; min(a, b) == (a > b) ? b : a;
So this is just a combination of a compare and a “select” operation. When the compare is signed (the int8 and int32 cases), the comparison maps to a single SSE intrinsic. The unsigned compares (uint16 and uint32) can be solved using the biasthensignedcompare trick which in turn gives us an unsigned min/max.
32bit and 64bit loads/stores
This one has nothing to do with the actual instruction set and everything to do with the intrinsics: yes, SSE2 has 32bit ( MOVD
) and 64bit ( MOVQ
) loads and stores, the standard intrinsics just do their best to confuse you about it:
 64bit loads are
_mm_loadl_epi64
. This intrinsic takes a__m128i *
as an argument. Don’t take that seriously. The actual load is 64bit sized, not 128bit sized, and there is no alignment requirement.  64bit stores are
_mm_storel_epi64
. Again, the__m128i *
is confusing and does not mean that the actual store is 128bit or that there are alignment requirements. It isn’t and there are not.  32bit loads are even more hidden! Namely, you write
_mm_cvtsi32_si128(*x)
wherex
is a pointer to a 32bit integer. No direct load intrinsic, but compilers will turn this into aMOVD
with memory operand where applicable.  32bit stores, likewise:
*x = _mm_cvtsi128_si32(value)
. Now you know.
Multiplies
There’s lots of different ways to provide multiplies in a SIMD instruction set, and by now SSE has tried most of them in one form or another.
Let’s start with the (historically) first variant: multiplying 16bit numbers. The relevant instructions originated in the Pentium MMX and compute the low and high halves (bottom and top 16 bits) of a signed 16bit×16bit product. MMX only has signed multiplies, but SSE also added a “high half of unsigned 16bit times 16bit product” instruction (the low halves of signed and unsigned products are identical), so we’re not gonna have to worry about that particular problem, not yet anyway.
These instructions are fine if you want the low or high halves of the product. What if you want the full 32bit product of vectors of 16bit values? You compute the low and high halves and then merge them using the “unpack” instructions. This is the standard approach, but not very obvious if you haven’t deal with this kind of thing before. So for a full 16×16→32bit product (note this produces two vectors worth of results), we get:
// EITHER: a*b (16bit lanes), signed __m128i lo16 = _mm_mullo_epi16(a, b); __m128i hi16 = _mm_mulhi_epi16(a, b); // OR: a*b (16bit lanes), unsigned __m128i lo16 = _mm_mullo_epi16(a, b); __m128i hi16 = _mm_mulhi_epu16(a, b); // THEN: merge results __m128i res0 = _mm_unpacklo_epi16(lo16, hi16); // result lanes 0..3 __m128i res1 = _mm_unpackhi_epi16(lo16, hi16); // result lanes 4..7
But what if you’re working with 32bit values? There is a 32×32→32bit product ( PMULLD
/ _mm_mullo_epi32
), but it was only added with SSE4.1, and it’s significantly slower than the other SSE2 multiplies in many implementations. So you might either not want to set your minimum target that high, or you might be looking for something quicker.
There’s full 32×32→64bit products, which are available from SSE2 on as PMULDQ
/ _mm_mul_epi32
(signed) and PMULUDQ
/ _mm_mul_epu32
(unsigned), respectively. These ones only compute two products (between the even lanes of the two sources) and place them in a 128bit result. The odd 32bit lanes are ignored completely, so if you want four 32×32→32bit products, you need at least two of these multiplies and a lot of wrangling:
// res = _mm_mullo_epi32(a, b) equivalent using SSE2, via PMULDQ. // even and odd lane products __m128i evnp = _mm_mul_epi32(a, b); __m128i odda = _mm_srli_epi64(a, 32); __m128i oddb = _mm_srli_epi64(b, 32); __m128i oddp = _mm_mul_epi32(odda, oddb); // merge results __m128i evn_mask = _mm_setr_epi32(1, 0, 1, 0); __m128i evn_result = _mm_and_epi32(evnp, evn_mask); __m128i odd_result = _mm_slli_epi64(oddp, 32); __m128i res = _mm_or_si128(evn_result, odd_result);
It works, but it’s a mouthful.
But what if you’re using 32bit vector lanes, but happen to know that the numbers we’re trying to multiply are in fact in the range [32768,32767] (i.e. representable as signed 16bit integers)? We could try narrowing the 32bit lanes into 16 bits then using the 16×16→32 sequences above, but is that really the best we can do?
It is not: PMADDWD
( _mm_madd_epi16
), MMX/SSE2’s amazing and strange (but mostly amazing) dot product operation, has our back, for we can do this:
// a and b have 32bit lanes with values that fit in int16s. // produces the 32bit result // res[i] = a[i] * b[i] // clears high 16 bits of every 32bit lane __m128i bm = _mm_and_si128(b, _mm_set1_epi32(0xffff)); // after this, madd_epi16 does what we want! __m128i res = _mm_madd_epi16(a, bm); // can swap role of a and b above too, when convenient.
That’s a lot shorter than narrowing to 16bit first would be! Alas, it only works for int16 (signed). What if we’re working in 32bit lanes with values that fit inside a uint16 (unsigned)? It’s not quite as slick, but still, better than narrowing to 16bit first or dealing with the logistics when synthesizing 32×32→32bit muls from PMULDQ
/ PMULUDQ
:
// a and b have 32bit lanes with values that fit in uint16s, // i.e. a[i] == (uint16)a[i] and same for b[i]. // // produces the 32bit result // res[i] = a[i] * b[i] // compute low and high 16bit products __m128i lop = _mm_mullo_epi16(a, b); __m128i hip = _mm_mulhi_epu16(a, b); // merge results __m128i res = _mm_or_si128(lop, _mm_slli_epi32(hip, 16));
Horizontal adds, dot products etc. (float)
SSE3 adds horizontal adds HADDPS
( _mm_hadd_ps
) and HADDPD
( _mm_hadd_pd
) and SSE4.1 throws in the dotproduct instructions DPPS
( _mm_dp_ps
) and DPPD
(_mm_dp_pd).
Generally, don’t expect these operations to be magic. They exist in the instruction set but are fast precisely nowhere; in all x86 implementations I’m familiar with, they just turn into a canned sequence of more basic (SSE2level) operations. So more often that not, you will end up requiring a higher minimum CPU target for little to no speed gain. Caveat: these instructions are a smaller than their replacement instruction sequence, so using them can reduce code size slightly. But still, don’t expect this to be fast.
If you want good SIMD performance, don’t lean on horizontal and dotproduct style operations; process data in batches (not just one vec4 at a time) andtranspose on input, or use a SoA layout to begin with.
The other kind of horizontal adds, dot products etc. (integer)
SSE does have a bunch of horizontal add and dot productstyle operations that don’t suck, but they’re on the integer pipe, and not what you’d expect.
Nope, not PHADDW
/ PHADDD
( _mm_hadd_epi16
/ _mm_hadd_epi32
). These are SSSE3 and later only and OK but not particularly great (similar disclaimer as for the floatingpoint horizontal adds applies).
No, I’m talking about PMADDWD
( _mm_madd_epi16
, SSE2 with its ancestor around since the original MMX), PSADBW
( _mm_sad_epu8
, SSE2) and PMADDUBSW
( _mm_maddubs_epi16
, SSSE3). The official manuals describe what these instructions do, so I won’t bother going into too much detail, but here’s the basic idea: PMADDWD
and PMADDUBSW
are 2element dotproduct instructions between pairs of adjacent SIMD lanes. PMADDWD
computes two int16 by int16 products for each pair of 16bit lanes and sums the 32bit integer products to yield the 32bit result lanes. PMADDUBSW
computes two uint8 by int8 products for each pair of 8bit lanes and sums the 16bit integer products to yield the 16bit result lanes. These can be used to compute dot products of this problem; but they also have “degenerate” configurations that are very useful:

_mm_madd_epi16(x, _mm_set1_epi16(1))
sums the 16bit even and odd lanes of x in pairs to yield 32bit results. 
_mm_maddubs_epi16(_mm_unpacklo_epi8(a, b), _mm_setr_epi8(1, 1, 1, 1, ..., 1, 1))
happens to be the fastest way to compute the 16bit signed differences between 8bit unsigned vectorsa
andb
on processors that support SSSE3.  The 16bit multiply example above shows another special configuration.
Long story short, these dot product instructions are surprisingly versatile in decidedly nonobvious ways.
Finally, PSADBW
( _mm_sad_epu8
, SSE2). This one is intended for motion estimation in video codecs, but it also happens to be the one actually really fast horizontal add you get on x86. In particular, _mm_sad_epu8(x, _mm_setzero_si128())
computes two 16bit horizontal sums of groups of 8 uint8 lanes in a single, and quite fast, operation. We can do the same trick we did for compares in reverse to compute the sum of 8 int8s instead: add (or subtract, or XOR) _mm_set1_epi8(128)
to x
(before the PSADBW
), the subtract 128×8 from the resulting 16bit sums.
To be continued!
There’s a lot more of these, but this feels like enough to chew on for a single blog post. So there will be a sequel covering, at least, integer widening/narrowing and variable shifts. Until then!
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » SSE: mind the gap!