c++ - Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Question

Welcome To Ask or Share your Answers For Others

c++ - Crash with icc: can the compiler invent writes where none existed in the abstract machine?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Consider the following simple program:

#include <cstring>
#include <cstdio>
#include <cstdlib>

void replace(char *str, size_t len) {
    for (size_t i = 0; i < len; i++) {
        if (str[i] == '/') {
            str[i] = '_';
        }
    }
}

const char *global_str = "the quick brown fox jumps over the lazy dog";

int main(int argc, char **argv) {
  const char *str = argc > 1 ? argv[1] : global_str;
  replace(const_cast<char *>(str), std::strlen(str));
  puts(str);
  return EXIT_SUCCESS;
}

It takes an (optional) string on the command line and prints it, with / characters replaced by _. This replacement functionality is implemented by the c_repl function¹. For example, a.out foo/bar prints:

foo_bar

Elementary stuff so far, right?

If you don't specify a string, it conveniently uses the global string the quick brown fox jumps over the lazy dog, which doesn't contain any / characters, and so doesn't undergo any replacement.

Of course, string constants are const char[], so I need to cast away the constness first - that's the const_cast you see. Since the string is never actually modified, I am under the impression this is legal.

gcc and clang compile a binary that has the expected behavior, with or without passing a string on the command line. icc crashes, when you don't provide a string, however:

icc -xcore-avx2 char_replace.cpp && ./a.out
Segmentation fault (core dumped)

The underlying cause is the main loop for c_repl which looks like this:

  400c0c:       vmovdqu ymm2,YMMWORD PTR [rsi]
  400c10:       add    rbx,0x20
  400c14:       vpcmpeqb ymm3,ymm0,ymm2
  400c18:       vpblendvb ymm4,ymm2,ymm1,ymm3
  400c1e:       vmovdqu YMMWORD PTR [rsi],ymm4
  400c22:       add    rsi,0x20
  400c26:       cmp    rbx,rcx
  400c29:       jb     400c0c <main+0xfc>

It is a vectorized loop. The basic idea is that 32 bytes are loaded, and then compared against the / character, forming a mask value with a byte set for each byte that matched, and then the existing string is blended against a vector containing 32 _ characters, effectively replacing only the / characters. Finally, the updated register is written back to the string, with the vmovdqu YMMWORD PTR [rsi],ymm4 instruction.

This final store crashes, because the string is read-only and allocated in the .rodata section of the binary, which is loaded using read-only pages. Of course, the store was a logical "no op", writing back the same characters it read, but the CPU doesn't care!

Is my code legal C++ and therefore I should blame icc for miscompiling this, or I am wading into the UB swamp somewhere?

¹ The same crash from the same issue occurs with std::replace on a std::string rather than my "C-like" code, but I wanted to simplify the analysis as much as possible and make it entirely self-contained.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:14:52+0000

Your program is well-formed and free of undefined behaviour, as far as I can tell. The C++ abstract machine never actually assigns to a const object. A not-taken if() is sufficient to "hide" / "protect" things that would be UB if they executed. The only thing an if(false) can't save you from is an ill-formed program, e.g. syntax errors or trying to use extensions that don't exist on this compiler or target arch.

Compilers aren't in general allowed to invent writes with if-conversion to branchless code.

Casting away const is legal, as long as you don't actually assign through it. e.g. for passing a pointer to a function that isn't const-correct, and takes a read-only input with a non-const pointer. The answer you linked on Is it allowed to cast away const on a const-defined object as long as it is not actually modified? is correct.

ICC's behaviour here is not evidence for UB in ISO C++ or C. I think your reasoning is sound, and this is well-defined. You've found an ICC bug. If anyone cares, report it on their forums: https://software.intel.com/en-us/forums/intel-c-compiler. Existing bug reports in that section of their forum have been accepted by developers, e.g. this one.

We can construct an example where it auto-vectorizes the same way (with unconditional and non-atomic read/maybe-modify/rewrite) where it's clearly illegal, because the read / rewrite is happening on a 2nd string that the C abstract machine doesn't even read.

Thus, we can't trust ICC's code-gen to tell us anything about when we've caused UB, because it will make crashing code even in clearly legal cases.

Godbolt: ICC19.0.1 -O2 -march=skylake (Older ICC only understood options like -xcore-avx2, but modern ICC understands the same -march as GCC/clang.)

#include <stddef.h>

void replace(const char *str1, char *str2, size_t len) {
    for (size_t i = 0; i < len; i++) {
        if (str1[i] == '/') {
            str2[i] = '_';
        }
    }
}

It checks for overlap between str1[0..len-1] and str2[0..len-1], but for large enough len and no overlap it will use this inner loop:

..B1.15:                        # Preds ..B1.15 ..B1.14                //do{
        vmovdqu   ymm2, YMMWORD PTR [rsi+r8]                    #6.13   // load from str2
        vpcmpeqb  ymm3, ymm0, YMMWORD PTR [rdi+r8]              #5.24   // compare vs. str1
        vpblendvb ymm4, ymm2, ymm1, ymm3                        #6.13   // blend
        vmovdqu   YMMWORD PTR [r8+rsi], ymm4                    #6.13   // store to str2
        add       r8, 32                                        #4.5    // i+=32
        cmp       r8, rax                                       #4.5
        jb        ..B1.15       # Prob 82%                      #4.5   // }while(i<len);

For thread-safety, it's well known that inventing write via non-atomic read/rewrite is unsafe.

The C++ abstract machine never touches str2 at all, so that invalidates any arguments for the one-string version about data-race UB being impossible because reading str at the same time another thread is writing it was already UB. Even C++20 std::atomic_ref doesn't change that, because we're reading through a non-atomic pointer.

But even worse than that, str2 can be nullptr. Or pointing to close to the end of an object (which happens to be stored near the end of a page), with str1 containing chars such that no writes past the end of str2 / the page will happen. We could even arrange for only the very last byte (str2[len-1]) to be in a new page, so that it's one-past-the-end of a valid object. It's even legal to construct such a pointer (as long as you don't deref). But it would be legal to pass str2=nullptr; code behind an if() that doesn't run doesn't cause UB.

Or another thread is running the same search/replace function in parallel, with a different key/replacement that will only write different elements of str2. The non-atomic load/store of unmodified values will step on modified values from the other thread. It's definitely allowed, according to the C++11 memory model, for different threads to simultaneously touch different elements of the same array. C++ memory model and race conditions on char arrays. (This is why char must be as large as the smallest unit of memory the target machine can write without a non-atomic RMW. An internal atomic RMW for a byte stores into cache is fine, though, and doesn't stop byte-store instructions from being useful.)

(This example is only legal with the separate str1/str2 version, because reading every element means the threads would be reading array elements the other thread could be in the middle of writing, which is data-race UB.)

As Herb Sutter mentioned in atomic<> Weapons: The C++ Memory Model and Modern Hardware Part 2: Restrictions on compilers and hardware (incl. common bugs); code generation and performance on x86/x64, IA64, POWER, ARM, and more; relaxed atomics; volatile: weeding out non-atomic RMW code-gen has been an ongoing issue for compilers after C++11 was standardized. We're most of the way there, but highly-aggressive and less-mainstream compilers like ICC clearly still have bugs.

(However, I'm pretty confident that Intel compiler devs would consider this a bug.)

Some less-plausible (to see in a real program) examples that this would also break:

Besides nullptr, you could pass a pointer to (an array of) std::atomic<T> or a mutex where a non-atomic read/rewrite breaks things by inventing writes. (char* can alias anything).

Or str2 points to a buffer that you've carved up for dynamic allocation, and the early part of str1 will have some matches, but later parts of str1 won't have any matches, and that part of str2 is being used by other threads. (And for some reason you can't easily calculate a length that stops the loop short).

For future readers: If you want to let compilers auto-vectorize this way:

You can write source like str2[i] = x ? replacement : str2[i]; that always writes the string in the C++ abstract machine. IIRC, that lets gcc/clang vectorize the way ICC does after doing its unsafe if-conversion to blend.

In theory an optimizing compiler can turn it back into a conditional branch in the scalar cleanup or whatever to avoid dirtying memory unnecessarily. (Or if targeting an ISA like ARM32 where a predicated store is possible, instead of only ALU select operations like x86 cmov, PowerPC isel, or AArch64 csel. ARM32 predicated instructions are architecturally a NOP if the predicate is false).

Or if an x86 compiler chose to use AVX512 masked stores, that would also make it safe to vectorize the way ICC does: masked stores do fault suppression, and never actually store to elements where the mask is false. (When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?).

vpcmpeqb k1, zmm0, [rdi]   ; compare from memory into mask
vmovdqu8 [rsi]{k1}, zmm1   ; masked store that only writes elements where the mask is true

ICC19 actually does basically this (but with indexed addressing modes) with -march=skylake-avx512. But with ymm vectors because 512-bit lowers max turbo too much to be worth it unless your whole program is heavily using AVX512, on Skylake Xeons anyway.

So I think ICC19 is safe when vectorizing this with AVX512, but not AVX2. Unless there are problems in its cleanup code where it does something more complicated with vpcmpuq and kshift / kor, a zero-masked load, and a masked compare into another mask reg.

AVX1 has masked stores (vmaskmovps/pd) with fault-suppression and everything, but until AVX512BW there's no granularity narrower than 32 bits. The AVX2 integer versions are only available in dword/qword granularity, vpmaskmovd/q.

Categories

c++ - Crash with icc: can the compiler invent writes where none existed in the abstract machine?

c++ - Crash with icc: can the compiler invent writes where none existed in the abstract machine?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags