Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const
object. A not-taken if()
is sufficient to "hide" / "protect" things that would be UB if they executed. The only thing an if(false)
can't save you from is an ill-formed program, e.g. syntax errors or trying to use extensions that don't exist on this compiler or target arch.
Compilers aren't in general allowed to invent writes with if-conversion to branchless code.
Casting away const
is legal, as long as you don't actually assign through it. e.g. for passing a pointer to a function that isn't const-correct, and takes a read-only input with a non-const
pointer. The answer you linked on Is it allowed to cast away const on a const-defined object as long as it is not actually modified? is correct.
ICC's behaviour here is not evidence for UB in ISO C++ or C. I think your reasoning is sound, and this is well-defined. You've found an ICC bug. If anyone cares, report it on their forums: https://software.intel.com/en-us/forums/intel-c-compiler. Existing bug reports in that section of their forum have been accepted by developers, e.g. this one.
We can construct an example where it auto-vectorizes the same way (with unconditional and non-atomic read/maybe-modify/rewrite) where it's clearly illegal, because the read / rewrite is happening on a 2nd string that the C abstract machine doesn't even read.
Thus, we can't trust ICC's code-gen to tell us anything about when we've caused UB, because it will make crashing code even in clearly legal cases.
Godbolt: ICC19.0.1 -O2 -march=skylake
(Older ICC only understood options like -xcore-avx2
, but modern ICC understands the same -march
as GCC/clang.)
#include <stddef.h>
void replace(const char *str1, char *str2, size_t len) {
for (size_t i = 0; i < len; i++) {
if (str1[i] == '/') {
str2[i] = '_';
}
}
}
It checks for overlap between str1[0..len-1]
and str2[0..len-1]
, but for large enough len
and no overlap it will use this inner loop:
..B1.15: # Preds ..B1.15 ..B1.14 //do{
vmovdqu ymm2, YMMWORD PTR [rsi+r8] #6.13 // load from str2
vpcmpeqb ymm3, ymm0, YMMWORD PTR [rdi+r8] #5.24 // compare vs. str1
vpblendvb ymm4, ymm2, ymm1, ymm3 #6.13 // blend
vmovdqu YMMWORD PTR [r8+rsi], ymm4 #6.13 // store to str2
add r8, 32 #4.5 // i+=32
cmp r8, rax #4.5
jb ..B1.15 # Prob 82% #4.5 // }while(i<len);
For thread-safety, it's well known that inventing write via non-atomic read/rewrite is unsafe.
The C++ abstract machine never touches str2
at all, so that invalidates any arguments for the one-string version about data-race UB being impossible because reading str
at the same time another thread is writing it was already UB. Even C++20 std::atomic_ref
doesn't change that, because we're reading through a non-atomic pointer.
But even worse than that, str2
can be nullptr
. Or pointing to close to the end of an object (which happens to be stored near the end of a page), with str1
containing chars such that no writes past the end of str2
/ the page will happen. We could even arrange for only the very last byte (str2[len-1]
) to be in a new page, so that it's one-past-the-end of a valid object. It's even legal to construct such a pointer (as long as you don't deref). But it would be legal to pass str2=nullptr
; code behind an if()
that doesn't run doesn't cause UB.
Or another thread is running the same search/replace function in parallel, with a different key/replacement that will only write different elements of str2
. The non-atomic load/store of unmodified values will step on modified values from the other thread. It's definitely allowed, according to the C++11 memory model, for different threads to simultaneously touch different elements of the same array. C++ memory model and race conditions on char arrays. (This is why char
must be as large as the smallest unit of memory the target machine can write without a non-atomic RMW. An internal atomic RMW for a byte stores into cache is fine, though, and doesn't stop byte-store instructions from being useful.)
(This example is only legal with the separate str1/str2 version, because reading every element means the threads would be reading array elements the other thread could be in the middle of writing, which is data-race UB.)
As Herb Sutter mentioned in atomic<>
Weapons: The C++ Memory Model and Modern Hardware Part 2: Restrictions on compilers and hardware (incl. common bugs); code generation and performance on x86/x64, IA64, POWER, ARM, and more; relaxed atomics; volatile: weeding out non-atomic RMW code-gen has been an ongoing issue for compilers after C++11 was standardized. We're most of the way there, but highly-aggressive and less-mainstream compilers like ICC clearly still have bugs.
(However, I'm pretty confident that Intel compiler devs would consider this a bug.)
Some less-plausible (to see in a real program) examples that this would also break:
Besides nullptr
, you could pass a pointer to (an array of) std::atomic<T>
or a mutex where a non-atomic read/rewrite breaks things by inventing writes. (char*
can alias anything).
Or str2
points to a buffer that you've carved up for dynamic allocation, and the early part of str1
will have some matches, but later parts of str1
won't have any matches, and that part of str2
is being used by other threads. (And for some reason you can't easily calculate a length that stops the loop short).
For future readers: If you want to let compilers auto-vectorize this way:
You can write source like str2[i] = x ? replacement : str2[i];
that always writes the string in the C++ abstract machine. IIRC, that lets gcc/clang vectorize the way ICC does after doing its unsafe if-conversion to blend.
In theory an optimizing compiler can turn it back into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations like x86 cmov
, PowerPC isel
, or AArch64 csel
. ARM32 predicated instructions are architecturally a NOP if the predicate is false).
Or if an x86 compiler chose to use AVX512 masked stores, that would also make it safe to vectorize the way ICC does: masked stores do fault suppression, and never actually store to elements where the mask is false. (When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?).
vpcmpeqb k1, zmm0, [rdi] ; compare from memory into mask
vmovdqu8 [rsi]{k1}, zmm1 ; masked store that only writes elements where the mask is true
ICC19 actually does basically this (but with indexed addressing modes) with -march=skylake-avx512
. But with ymm vectors because 512-bit lowers max turbo too much to be worth it unless your whole program is heavily using AVX512, on Skylake Xeons anyway.
So I think ICC19 is safe when vectorizing this with AVX512, but not AVX2. Unless there are problems in its cleanup code where it does something more complicated with vpcmpuq
and kshift
/ kor
, a zero-masked load, and a masked compare into another mask reg.
AVX1 has masked stores (vmaskmovps/pd
) with fault-suppression and everything, but until AVX512BW there's no granularity narrower than 32 bits. The AVX2 integer versions are only available in dword/qword granularity, vpmaskmovd/q
.