For the code below, clang 6.0 and 11.0 have a subtle difference in their compiled assembly.
#include <stdint.h>
#define SIZE (1L << 16)
void test(uint8_t * restrict a, uint8_t * restrict b) {
uint64_t i;
for (i = 0; i < SIZE; i++) {
a[i] += b[i];
}
}
When I compile with arguments -O1
in clang 6.0, I get the following output:
test: # @test
mov rax, -65536
.LBB0_1: # =>This Inner Loop Header: Depth=1
movzx ecx, byte ptr [rsi + rax + 65536]
add byte ptr [rdi + rax + 65536], cl
add rax, 1
jne .LBB0_1
ret
Notice that the compiler changes the loop from a '0 to 65536' index to '-65536 to 0'. I thought this was very clever, because it makes use off the fact that add
in assembly will set the ZF
flag if the result is zero, saving an instruction. Unfortunately when I run the same code with the same arguments in clang 11.0, I get the following code:
test: # @test
xor eax, eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
movzx ecx, byte ptr [rsi + rax]
add byte ptr [rdi + rax], cl
add rax, 1
cmp rax, 65536
jne .LBB0_1
ret
Notice this time, it keeps the '0 to 65536' index, and adds a cmp
instruction at the end of each loop. Also, while this is a specific example, this is not unique to the code I wrote. It persists with -O3
and vectorization enabled as well
What gives? Was the original optimization not actually effective? Did processors change to obviate the trick?
question from:
https://stackoverflow.com/questions/65891676/why-does-the-clang-6-0-compiler-optimize-by-starting-indexes-at-n-and-counting 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…