With help of PSHUFB we can get a vector that contains population count for 16 nibbles. To get a vector of population count for each 16 byte, instruction PSHUFB have to be called twice on vectors of lower and higher nibbles, and finally added together.
Following code shows the idea:
; xmm0 - input (16 bytes) ; xmm7 - POPCOUNT_4bit -- lookup table ; xmm6 - MASK_bits03 = packed_byte(0x0f) -- mask 4 lower bits movdqa %%xmm0, %%xmm1 psrlw $4, %%xmm1 pand %%xmm6, %%xmm0 ; xmm0 - lower nibbles pand %%xmm6, %%xmm1 ; xmm1 - higher nibbles movdqa %%xmm7, %%xmm2 ; since instruction pshufb modifies LUT movdqa %%xmm7, %%xmm3 ; it must be saved for further use pshufb %%xmm0, %%xmm2 ; xmm2 = vector of popcount for lower nibbles pshufb %%xmm1, %%xmm3 ; xmm3 = vector of popcount for higher nibbles paddb %%xmm3, %%xmm2 ; xmm2 += xmm3 -- vector of popcount for bytes
The last step is adding all bytes from vector.
Instruction PSADBW calculate sum of absolute differences of unsigned bytes — if the first arguments is full of zeros, then result is a sum of bytes from second argument. Unfortunately PSADBW invoked with 128-bits arguments calculate separate sums for bytes 0..7 and 8..15, and finally stores them in the lower and the higher quad words. Because of that few additional instructions are needed:
pxor %%xmm0, %%xmm0 ; xmm0 = packed_byte(0x00) psadbw %%xmm0, %%xmm3 ; xmm3 = [popcount of bytes 0..7 | popcount of bytes 8..15] movhlps %%xmm3, %%xmm0 ; xmm0 = [ 0 | popcount of bytes 0..7 ] paddd %%xmm3, %%xmm0 ; xmm0 = [ not needed | popcount of bytes 0..15]
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » AVX2 faster than native popcnt instruction on Haswell/Skylake