clang - Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

Question

Welcome To Ask or Share your Answers For Others

clang - Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

clang - Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown:

    .text
test:
    #xorps  %xmm0, %xmm0
    cvtsi2ss    %edi, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    retq
    .global test

This function follows GCC/Clang's x86-64 calling convention for the function declaration extern "C" float test(int); Note the commented out xorps instruction. uncommenting this instruction dramatically improves the performance of the function. Testing it using my machine with an i7-8700K, Google benchmark shows the function without the xorps instruction takes 8.54ns (CPU), while the function with the xorps instruction takes 1.48ns. I've tested this on multiple computers with various OS's, processors, processor generations, and different processor manufacturers (Intel and AMD), and they all exhibit a similar performance difference. Repeating the addss instruction makes the slowdown more pronounced (to a point), and this slowdown still occurs using other instructions here (eg. mulss) or even a mix of instructions so long as they all depend on the value in %xmm0 in some way. It's worth pointing out that only calling xorps each function call results in the performance improvement. Sampling the performance with a loop (as Google Benchmark does) with the xorps call outside the loop still shows the slower performance.

Since this is a case where exclusively adding instructions improves performance, this appears to be caused by something really low-level in the CPU. Since it occurs across a wide variety of CPU's, it seems like this must be intentional. However, I couldn't find any documentation that explains why this happens. Does anybody have an explanation for what's going on here? The issue seems to be dependent on complicated factors, as the slowdown I saw in my original code only occurred on a specific optimization level (-O2, sometimes -O1, but not -Os), without inlining, and using a specific compiler (Clang, but not GCC).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T03:08:49+0000

cvtsi2ss %edi, %xmm0 merges the float into the low element of XMM0 so it has a false dependency on the old value. (Across repeated calls to the same function, creating one long loop-carried dependency chain.)

xor-zeroing breaks the dep chain, allowing out-of-order exec to work its magic. So you bottleneck on addss throughput (0.5 cycles) instead of latency (4 cycles).

Your CPU is a Skylake derivative so those are the numbers; earlier Intel have 3 cycle latency, 1 cycle throughput using a dedicated FP-add execution unit instead of running it on the FMA units. https://agner.org/optimize/. Probably function call/ret overhead prevents you from seeing the full 8x expected speedup from the latency * bandwidth product of 8 in-flight addss uops in the pipelined FMA units; you should get that speedup if you remove xorps dep-breaking from a loop within a single function.

GCC tends to be very "careful" about false dependencies, spending extra instructions (front-end bandwidth) to break them just in case. In code that bottlenecks on the front-end (or where total code size / uop-cache footprint is a factor) this costs performance if the register was actually ready in time anyway.

Clang/LLVM is reckless and cavalier about it, typically not bothering to avoid false dependencies on registers not written in the current function. (i.e. assuming / pretending that registers are "cold" on function entry). As you show in comments, clang does avoid creating a loop-carried dep chain by xor-zeroing when looping inside one function, instead of via multiple calls to the same function.

Clang even uses 8-bit GP-integer partial registers for no reason in some cases where that doesn't save any code-size or instructions vs. 32-bit regs. Usually it's probably fine, but there's a risk of coupling into a long dep chain or creating a loop-carried dependency chain if the caller (or a sibling function call) still has a cache-miss load in flight to that reg when we're called, for example.

See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for more about how OoO exec can overlap short to medium length independent dep chains. Also related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) is about unrolling a dot-product with multiple accumulators to hide FMA latency.

https://www.uops.info/html-instr/CVTSI2SS_XMM_R32.html has performance details for this instruction across various uarches.

You can avoid this if you can use AVX, with vcvtsi2ss %edi, %xmm7, %xmm0 (where xmm7 is any register you haven't written recently, or which is earlier in a dep chain that leads to the current value of EDI).

As I mentioned in Why does the latency of the sqrtsd instruction change based on the input? Intel processors

This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd %src,%src, %dst at least for register sources if not memory, and similarly vcvtsi2sd %eax, %cold_reg, %dst for the similarly near-sightedly designed scalar int->fp conversion instructions.
(GCC missed-optimization reports: 80586, 89071, 80571.)

If cvtsi2ss/sd had zeroed the upper elements of registers we wouldn't have this stupid problem / wouldn't need to sprinkle xor-zeroing instruction around; thanks Intel. (Another strategy is to use SSE2 movd %eax, %xmm0 which does zero-extend, then packed int->fp conversion which operates on the whole 128-bit vector. This can break even for float where the int->fp scalar conversion is 2 uops, and the vector strategy is 1+1. But not double where the int->fp packed conversion costs a shuffle + FP uop.)

This is exactly the problem that AMD64 avoided by making writes to 32-bit integer registers implicitly zero-extend to the full 64-bit register instead of leaving it unmodified (aka merging). Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register? (writing 8 and 16-bit registers do cause false dependencies on AMD CPUs, and Intel since Haswell).

Categories

clang - Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

clang - Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags