TL:DR: Loads from _mm_load_*
intrinsics can be folded (at compile time) into memory operands to other instructions. The AVX versions of vector instructions don't require alignment for memory operands, except for specifically-aligned load/store instructions like vmovdqa
.
In the legacy SSE encoding of vector instructions (like pxor xmm0, [src1]
) , unaligned 128 bit memory operands will fault except with the special unaligned load/store instructions (like movdqu
/ movups
).
The VEX-encoding of vector instructions (like vpxor xmm1, xmm0, [src1]
) doesn't fault with unaligned memory, except with the alignment-required load/store instructions (like vmovdqa
, or vmovntdq
).
The _mm_loadu_si128
vs. _mm_load_si128
(and store/storeu) intrinsics communicate alignment guarantees to the compiler, but doesn't force it to actually emit a stand-alone load instruction. (Or anything at all if it already has the data in a register, just like dereferencing a scalar pointer).
The as-if rule still applies when optimizing code that uses intrinsics. A load can be folded into a memory operand for the vector-ALU instruction that uses it, as long as that doesn't introduce the risk of a fault. This is advantageous for code-density reasons, and also fewer uops to track in parts of the CPU thanks to micro-fusion (see Agner Fog's microarch.pdf). The optimization pass that does this isn't enabled at -O0
, so an unoptimized build of your code probably would have faulted with unaligned src1.
(Conversely, this means _mm_loadu_*
can only fold into a memory operand with AVX, but not with SSE. So even on CPUs where movdqu
is as fast as movqda
when the pointer does happen to be aligned, _mm_loadu
can hurt performance because movqdu xmm1, [rsi]
/ pxor xmm0, xmm1
is 2 fused-domain uops for the front-end to issue while pxor xmm0, [rsi]
is only 1. And doesn't need a scratch register. See also Micro fusion and addressing modes).
The interpretation of the as-if rule in this case is that it's ok for the program to not fault in some cases where the naive translation into asm would have faulted. (Or for the same code to fault in an un-optimized build but not fault in an optimized build).
This is opposite from the rules for floating-point exceptions, where the compiler-generated code must still raise any and all exceptions that would have occurred on the C abstract machine. That's because there are well-defined mechanisms for handling FP exceptions, but not for handling segfaults.
Note that since stores can't fold into memory operands for ALU instructions, store
(not storeu
) intrinsics will compile into code that faults with unaligned pointers even when compiling for an AVX target.
To be specific: consider this code fragment:
// aligned version:
y = ...; // assume it's in xmm1
x = _mm_load_si128(Aptr); // Aligned pointer
res = _mm_or_si128(y, x);
// unaligned version: the same thing with _mm_loadu_si128(Uptr)
When targeting SSE (code that can run on CPUs without AVX support), the aligned version can fold the load into por xmm1, [Aptr]
, but the unaligned version has to use
movdqu xmm0, [Uptr]
/ por xmm0, xmm1
. The aligned version might do that too, if the old value of y
is still needed after the OR.
When targeting AVX (gcc -mavx
, or gcc -march=sandybridge
or later), all vector instructions emitted (including 128 bit) will use the VEX encoding. So you get different asm from the same _mm_...
intrinsics. Both versions can compile into vpor xmm0, xmm1, [ptr]
. (And the 3-operand non-destructive feature means that this actually happens except when the original value loaded is used multiple times).
Only one operand to ALU instructions can be a memory operand, so in your case one has to be loaded separately. Your code faults when the first pointer isn't aligned, but doesn't care about alignment for the second, so we can conclude that gcc chose to load the first operand with vmovdqa
and fold the second, rather than vice-versa.
You can see this happen in practice in your code on the Godbolt compiler explorer. Unfortunately gcc 4.9 (and 5.3) compile it to somewhat sub-optimal code that generates the return value in al
and then tests it, instead of just branching on the flags from vptest
:( clang-3.8 does a significantly better job.
.L36:
add rdi, 32
add rsi, 32
cmp rdi, rcx
je .L9
.L10:
vmovdqa xmm0, XMMWORD PTR [rdi] # first arg: loads that will fault on unaligned
xor eax, eax
vpxor xmm1, xmm0, XMMWORD PTR [rsi] # second arg: loads that don't care about alignment
vmovdqa xmm0, XMMWORD PTR [rdi+16] # first arg
vpxor xmm0, xmm0, XMMWORD PTR [rsi+16] # second arg
vpor xmm0, xmm1, xmm0
vptest xmm0, xmm0
sete al # generate a boolean in a reg
test eax, eax
jne .L36 # then test&branch on it. /facepalm
Note that your is_equal
is memcmp
. I think glibc's memcmp will do better than your implementation in many cases, since it has hand-written asm versions for SSE4.1 and others which handle various cases of the buffers being misaligned relative to each other. (e.g. one aligned, one not.) Note that glibc code is LGPLed, so you might not be able to just copy it. If your use-case has smaller buffers that are typically aligned, your implementation is probably good. Not needing a VZEROUPPER before calling it from other AVX code is also nice.
The compiler-generated byte-loop to clean up at the end is definitely sub-optimal. If the size is bigger than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you re-compared some bytes you've already checked.
Anyway, definitely benchmark your code with the system memcmp
. Besides the library implementation, gcc knows what memcmp does and has its own builtin definition that it can inline code for.