I think you have an endian issue with how you're printing your input and output.
(我认为您在打印输入和输出时遇到了一个字节序问题。)
The left-most bytes within each 64-bit half are the least-significant bytes in your actual output , so 0xfe << 4
becomes 0xe0
, with the f
shifting into a higher byte.
(每个64位半部分中最左边的字节是实际输出中的最低有效字节 ,因此0xfe << 4
变为0xe0
,而f
移至更高的字节。)
See Convention for displaying vector registers for more discussion of that.
(有关更多讨论,请参见显示向量寄存器的约定 。)
Your "expected" output matches what you'd get if you were printing values high element first (highest address when stored).
(您的“预期”输出与您首先打印高元素(存储时的最高地址)的值相符。)
But that's not what you're doing; (但这不是你在做什么;)
you're printing each byte separately in ascending memory order. (您将按升序分别打印每个字节。)
x86 is little-endian. (x86是Little-endian。)
This conflicts with the numeral system we use in English, where we read Arabic numerals from left to right, highest place-value on the left, effectively human big-endian. (这与我们在英语中使用的数字系统相冲突,在英语中,我们从左到右读取阿拉伯数字,在左侧是最高的位数值,实际上是人类的大端数字。)
Fun fact: The Arabic language reads from right to left so for them, written numbers are "human little-endian". (有趣的事实:阿拉伯语从右到左阅读,因此对于他们来说,书面数字是“人类的小端”。)
(And across elements, higher elements are at higher addresses; printing high elements first makes whole-vector shifts like _mm_bslli_si128
aka pslldq
make sense in the way it shifts bytes left between elements.)
((在元素之间,较高的元素位于较高的地址;首先打印较高的元素会使_mm_bslli_si128
类的全矢量移位(也称为pslldq
在将元素之间的字节左移的方式上很有意义。))
If you're using a debugger, you're probably printing within that.
(如果使用调试器,则可能在其中进行打印。)
If you're using debug-prints, see print a __m128i variable . (如果您使用的是调试打印,请参阅打印__m128i变量 。)
BTW, you can use _mm_set1_epi64x(4)
to put the same value in both elements of a vector, instead of using separate l
and r
variables with the same value.
(顺便说一句,您可以使用_mm_set1_epi64x(4)
将相同的值放入向量的两个元素中,而不是使用具有相同值的单独的l
和r
变量。)
In _mm_set
intrinsics, the high elements come first , matching the diagrams in Intel's asm manuals, and matching the semantic meaning of "left" shift moving bits/bytes to the left.
(在_mm_set
内部函数中,高位元素排在第一位 ,与Intel的asm手册中的图相匹配,并且与“左”的语义相匹配,即向左移动位/字节。)
(eg see Intel's diagrams an element-numbering for pshufd, _mm_shuffle_epi32
) ((例如,参见英特尔图, _mm_shuffle_epi32
的元素编号_mm_shuffle_epi32
))
BTW, AVX512 has vprolvq
rotates.
(顺便说一句,AVX512具有vprolvq
旋转功能。)
But yes, to emulate rotates you want a SIMD version of (x << n) | x >> (64-n)
(但是,是的,要模拟旋转,您需要SIMD版本(x << n) | x >> (64-n)
)
(x << n) | x >> (64-n)
. ((x << n) | x >> (64-n)
。)
Note that x86 SIMD shifts saturate the shift count, unlike scalar shifts which mask the count. (请注意,x86 SIMD移位会使移位计数饱和 ,这与掩盖该计数的标量移位不同。)
So x >> 64
will shift out all the bits. (因此, x >> 64
将移出所有位。)
If you want to support rotate counts above 63, you probably need to mask. (如果要支持大于63的循环计数,则可能需要屏蔽。)
( Best practices for circular shift (rotate) operations in C++ but you're using intrinsics so you don't have to worry about C shift-count UB, just the actual known hardware behaviour.)
(( C ++中循环移位(旋转)操作的最佳做??法,但您使用的是内部函数,因此您不必担心C移位计数UB,而不必担心实际的硬件行为。))