AVX2 faster than native popcnt instruction on Haswell/Skylake

Vector algorithm

With help of PSHUFB we can get a vector that contains population count for 16 nibbles. To get a vector of population count for each 16 byte, instruction PSHUFB have to be called twice on vectors of lower and higher nibbles, and finally added together.

Following code shows the idea:

`; xmm0 - input (16 bytes) ; xmm7 - POPCOUNT_4bit  -- lookup table ; xmm6 - MASK_bits03 = packed_byte(0x0f) -- mask 4 lower bits  movdqa  %%xmm0, %%xmm1 psrlw       \$4, %%xmm1  pand    %%xmm6, %%xmm0  ; xmm0 - lower nibbles pand    %%xmm6, %%xmm1  ; xmm1 - higher nibbles  movdqa  %%xmm7, %%xmm2  ; since instruction pshufb modifies LUT movdqa  %%xmm7, %%xmm3  ; it must be saved for further use  pshufb  %%xmm0, %%xmm2  ; xmm2 = vector of popcount for lower nibbles pshufb  %%xmm1, %%xmm3  ; xmm3 = vector of popcount for higher nibbles  paddb   %%xmm3, %%xmm2  ; xmm2 += xmm3 -- vector of popcount for bytes`

The last step is adding all bytes from vector.

Instruction PSADBW calculate sum of absolute differences of unsigned bytes — if the first arguments is full of zeros, then result is a sum of bytes from second argument. Unfortunately PSADBW invoked with 128-bits arguments calculate separate sums for bytes 0..7 and 8..15, and finally stores them in the lower and the higher quad words. Because of that few additional instructions are needed:

`pxor    %%xmm0, %%xmm0  ; xmm0 = packed_byte(0x00) psadbw  %%xmm0, %%xmm3  ; xmm3 = [popcount of bytes 0..7 | popcount of bytes 8..15] movhlps %%xmm3, %%xmm0  ; xmm0 = [         0             | popcount of bytes 0..7 ] paddd   %%xmm3, %%xmm0  ; xmm0 = [     not needed        | popcount of bytes 0..15]`