c++ - Emulating shifts on 32 bytes with AVX

Question

Welcome To Ask or Share your Answers For Others

c++ - Emulating shifts on 32 bytes with AVX

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Emulating shifts on 32 bytes with AVX

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics.

Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.)

Can you recommend me a short substitute ?

UPDATE:

_mm256_slli_si256 is efficiently achieved with

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)

or

_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)

for shifts larger than 16 bytes.

But the question remains for _mm256_srli_si256.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:08:33+0000

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8.

_mm256_slli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 - N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0))

16 < N < 32

_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), N - 16)

_mm256_srli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), A, N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1))

16 < N < 32

_mm256_srli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), N - 16)

Categories

c++ - Emulating shifts on 32 bytes with AVX

c++ - Emulating shifts on 32 bytes with AVX

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

_mm256_slli_si256(A, N)

_mm256_srli_si256(A, N)

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags