cvtsi2ss %edi, %xmm0
merges the float into the low element of XMM0 so it has a false dependency on the old value. (Across repeated calls to the same function, creating one long loop-carried dependency chain.)
xor-zeroing breaks the dep chain, allowing out-of-order exec to work its magic. So you bottleneck on addss
throughput (0.5 cycles) instead of latency (4 cycles).
Your CPU is a Skylake derivative so those are the numbers; earlier Intel have 3 cycle latency, 1 cycle throughput using a dedicated FP-add execution unit instead of running it on the FMA units. https://agner.org/optimize/. Probably function call/ret overhead prevents you from seeing the full 8x expected speedup from the latency * bandwidth product of 8 in-flight addss
uops in the pipelined FMA units; you should get that speedup if you remove xorps
dep-breaking from a loop within a single function.
GCC tends to be very "careful" about false dependencies, spending extra instructions (front-end bandwidth) to break them just in case. In code that bottlenecks on the front-end (or where total code size / uop-cache footprint is a factor) this costs performance if the register was actually ready in time anyway.
Clang/LLVM is reckless and cavalier about it, typically not bothering to avoid false dependencies on registers not written in the current function. (i.e. assuming / pretending that registers are "cold" on function entry). As you show in comments, clang does avoid creating a loop-carried dep chain by xor-zeroing when looping inside one function, instead of via multiple calls to the same function.
Clang even uses 8-bit GP-integer partial registers for no reason in some cases where that doesn't save any code-size or instructions vs. 32-bit regs. Usually it's probably fine, but there's a risk of coupling into a long dep chain or creating a loop-carried dependency chain if the caller (or a sibling function call) still has a cache-miss load in flight to that reg when we're called, for example.
See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for more about how OoO exec can overlap short to medium length independent dep chains. Also related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) is about unrolling a dot-product with multiple accumulators to hide FMA latency.
https://www.uops.info/html-instr/CVTSI2SS_XMM_R32.html has performance details for this instruction across various uarches.
You can avoid this if you can use AVX, with vcvtsi2ss %edi, %xmm7, %xmm0
(where xmm7 is any register you haven't written recently, or which is earlier in a dep chain that leads to the current value of EDI).
As I mentioned in Why does the latency of the sqrtsd instruction change based on the input? Intel processors
This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss
a false dependency). AVX finally lets us avoid this with vsqrtsd %src,%src, %dst
at least for register sources if not memory, and similarly vcvtsi2sd %eax, %cold_reg, %dst
for the similarly near-sightedly designed scalar int->fp conversion instructions.
(GCC missed-optimization reports: 80586, 89071, 80571.)
If cvtsi2ss
/sd
had zeroed the upper elements of registers we wouldn't have this stupid problem / wouldn't need to sprinkle xor-zeroing instruction around; thanks Intel. (Another strategy is to use SSE2 movd %eax, %xmm0
which does zero-extend, then packed int->fp conversion which operates on the whole 128-bit vector. This can break even for float where the int->fp scalar conversion is 2 uops, and the vector strategy is 1+1. But not double where the int->fp packed conversion costs a shuffle + FP uop.)
This is exactly the problem that AMD64 avoided by making writes to 32-bit integer registers implicitly zero-extend to the full 64-bit register instead of leaving it unmodified (aka merging). Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register? (writing 8 and 16-bit registers do cause false dependencies on AMD CPUs, and Intel since Haswell).