Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
378 views
in Technique[技术] by (71.8m points)

c - Keep only the 10 useful bits in 16-bit words

I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Here’s my attempt.

Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

The code only uses 3 constants. If you call this function in a loop, good compilers should load all three outside of the loop and keep them in registers.

// bitwise blend according to a mask
inline void combineHigh( __m256i& vec, __m256i high, const __m256i lowMask )
{
    vec = _mm256_and_si256( vec, lowMask );
    high = _mm256_andnot_si256( lowMask, high );
    vec = _mm256_or_si256( vec, high );
}

// Store 10-bit pieces from each of the 16-bit lanes of the AVX2 vector.
// The function writes 20 bytes to the pointer.
inline void store_10x16_avx2( __m256i v, uint8_t* rdi )
{
    // Pack pairs of 10 bits into 20, into 32-bit lanes
    __m256i high = _mm256_srli_epi32( v, 16 - 10 );
    const __m256i low10 = _mm256_set1_epi32( ( 1 << 10 ) - 1 ); // Bitmask of 10 lowest bits in 32-bit lanes
    combineHigh( v, high, low10 );

    // Now the vector contains 32-bit lanes with 20 payload bits / each
    // Pack pairs of 20 bits into 40, into 64-bit lanes
    high = _mm256_srli_epi64( v, 32 - 20 );
    const __m256i low20 = _mm256_set1_epi64x( ( 1 << 20 ) - 1 ); // Bitmask of 20 lowest bits in 64-bit lanes
    combineHigh( v, high, low20 );

    // Now the vector contains 64-bit lanes with 40 payload bits / each
    // 40 bits = 5 bytes, store initial 4 bytes of the result
    _mm_storeu_si32( rdi, _mm256_castsi256_si128( v ) );

    // Shuffle the remaining 16 bytes of payload into correct positions.
    // The indices of the payload bytes are [ 0 .. 4 ] and [ 8 .. 12 ]
    // _mm256_shuffle_epi8 can only move data within 16-byte lanes
    const __m256i shuffleIndices = _mm256_setr_epi8(
        // 6 remaining payload bytes from the lower half of the vector
        4, 8, 9, 10, 11, 12,
        // 10 bytes gap, will be zeros
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        // 6 bytes gap, will be zeros
        -1, -1, -1, -1, -1, -1,
        // 10 payload bytes from the higher half of the vector
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12
    );
    v = _mm256_shuffle_epi8( v, shuffleIndices );

    // Combine and store the final 16 bytes of payload
    const __m128i low16 = _mm256_castsi256_si128( v );
    const __m128i high16 = _mm256_extracti128_si256( v, 1 );
    const __m128i result = _mm_or_si128( low16, high16 );
    _mm_storeu_si128( ( __m128i* )( rdi + 4 ), result );
}

This code truncates unused higher 6 bits of the values.


If you want to saturate instead, you’ll need one more instruction, _mm256_min_epu16.

Also, if you do that, the first step of the function can use pmaddwd. Here’s the complete function which saturates the source numbers, with couple extra adjustments.

// Store 10-bit pieces from 16-bit lanes of the AVX2 vector, with saturation.
// The function writes 20 bytes to the pointer.
inline void store_10x16_avx2( __m256i v, uint8_t* rdi )
{
    const __m256i low10 = _mm256_set1_epi16( ( 1 << 10 ) - 1 );
#if 0
    // Truncate higher 6 bits; pmaddwd won't truncate, it needs zeroes in the unused higher bits.
    v = _mm256_and_si256( v, low10 );
#else
    // Saturate numbers into the range instead of truncating
    v = _mm256_min_epu16( v, low10 );
#endif

    // Pack pairs of 10 bits into 20, into 32-bit lanes
    // pmaddwd computes a[ 0 ] * b[ 0 ] + a[ 1 ] * b[ 1 ] for pairs of 16-bit lanes, making a single 32-bit number out of two pairs.
    // Initializing multiplier with pairs of [ 1, 2^10 ] to implement bit shifts + packing
    const __m256i multiplier = _mm256_set1_epi32( 1 | ( 1 << ( 10 + 16 ) ) );
    v = _mm256_madd_epi16( v, multiplier );

    // Now the vector contains 32-bit lanes with 20 payload bits / each
    // Pack pairs of 20 bits into 40 in 64-bit lanes
    __m256i low = _mm256_slli_epi32( v, 12 );
    v = _mm256_blend_epi32( v, low, 0b01010101 );
    v = _mm256_srli_epi64( v, 12 );

    // Now the vector contains 64-bit lanes with 40 payload bits / each
    // 40 bits = 5 bytes, store initial 4 bytes of the result
    _mm_storeu_si32( rdi, _mm256_castsi256_si128( v ) );

    // Shuffle the remaining 16 bytes of payload into correct positions.
    const __m256i shuffleIndices = _mm256_setr_epi8(
        // Lower half
        4, 8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        // Higher half
        -1, -1, -1, -1, -1, -1,
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12
    );
    v = _mm256_shuffle_epi8( v, shuffleIndices );

    // Combine and store the final 16 bytes of payload
    const __m128i low16 = _mm256_castsi256_si128( v );
    const __m128i high16 = _mm256_extracti128_si256( v, 1 );
    const __m128i result = _mm_or_si128( low16, high16 );
    _mm_storeu_si128( ( __m128i* )( rdi + 4 ), result );
}

This may be slightly faster or slower overall depending on the processor, compiler, and the code calling the function, but definitely helps with code size. No one cares about binary size anymore, but CPUs have limited L1I and μop caches.


For completeness here’s another one that uses SSE2 and optionally SSSE3 instead of AVX2, only slightly slower in practice.

// Compute v = ( v & lowMask ) | ( high & ( ~lowMask ) ), for 256 bits of data in two registers
inline void combineHigh( __m128i& v1, __m128i& v2, __m128i h1, __m128i h2, const __m128i lowMask )
{
    v1 = _mm_and_si128( v1, lowMask );
    v2 = _mm_and_si128( v2, lowMask );
    h1 = _mm_andnot_si128( lowMask, h1 );
    h2 = _mm_andnot_si128( lowMask, h2 );
    v1 = _mm_or_si128( v1, h1 );
    v2 = _mm_or_si128( v2, h2 );
}

inline void store_10x16_sse( __m128i v1, __m128i v2, uint8_t* rdi )
{
    // Pack pairs of 10 bits into 20, in 32-bit lanes
    __m128i h1 = _mm_srli_epi32( v1, 16 - 10 );
    __m128i h2 = _mm_srli_epi32( v2, 16 - 10 );
    const __m128i low10 = _mm_set1_epi32( ( 1 << 10 ) - 1 );
    combineHigh( v1, v2, h1, h2, low10 );

    // Pack pairs of 20 bits into 40, in 64-bit lanes
    h1 = _mm_srli_epi64( v1, 32 - 20 );
    h2 = _mm_srli_epi64( v2, 32 - 20 );
    const __m128i low20 = _mm_set1_epi64x( ( 1 << 20 ) - 1 );
    combineHigh( v1, v2, h1, h2, low20 );

#if 1
    // 40 bits is 5 bytes, for the final shuffle we use pshufb instruction from SSSE3 set
    // If you don't have SSSE3, below under `#else` there's SSE2-only workaround.
    const __m128i shuffleIndices = _mm_setr_epi8(
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1 );
    v1 = _mm_shuffle_epi8( v1, shuffleIndices );
    v2 = _mm_shuffle_epi8( v2, shuffleIndices );
#else
    // SSE2-only version of the above, uses 8 instructions + 2 constants to emulate 2 instructions + 1 constant
    // Need two constants because after this step we want zeros in the unused higher 6 bytes.
    h1 = _mm_srli_si128( v1, 3 );
    h2 = _mm_srli_si128( v2, 3 );
    const __m128i low40 = _mm_setr_epi8( -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 );
    const __m128i high40 = _mm_setr_epi8( 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0 );
    const __m128i l1 = _mm_and_si128( v1, low40 );
    const __m128i l2 = _mm_and_si128( v2, low40 );
    h1 = _mm_and_si128( h1, high40 );
    h2 = _mm_and_si128( h2, high40 );
    v1 = _mm_or_si128( h1, l1 );
    v2 = _mm_or_si128( h2, l2 );
#endif

    // Now v1 and v2 vectors contain densely packed 10 bytes / each.
    // Produce final result: 16 bytes in the low part, 4 bytes in the high part
    __m128i low16 = _mm_or_si128( v1, _mm_slli_si128( v2, 10 ) );
    __m128i high16 = _mm_srli_si128( v2, 6 );
    // Store these 20 bytes with 2 instructions
    _mm_storeu_si128( ( __m128i* )rdi, low16 );
    _mm_storeu_si32( rdi + 16, high16 );
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...