I'm guessing that clang devs tested which uarches it was good on, and found it was only SnB-family.
That sounds right, because of a funky stall on P6-family, and AMD's different dividers.
Using the flag result from a shift imm8 (not a shift-by-implicit-1) on P6-family causes the front-end to stall before issuing the flag-reading instruction until the shift is retired. (Because the P6 decoders don't check for the imm8=0 case for leaving flags unmodified, while SnB does). INC instruction vs ADD 1: Does it matter?. That might be why clang doesn't use it for P6-family.
Probably a different way of checking the relevant condition that didn't cause this stall (like a test rcx,rcx
before the je
, would make it worth it on Core2/Nehalem). But if clang devs didn't realize the reason it was slow on P6-family, they wouldn't have thought to fix it, and just left it not done for pre-SnB targets. (Nobody added me to a patch review or bug CC list about this one, unfortunately; this is the first I've seen of clang doing this optimization. Although I think I might have mentioned shift flag stalls in comments on some other LLVM review or bug. Anyway, might be fun to try adding a test
and see if that makes it worthwhile on Nehalem.)
AMD's dividers have the same best-case div performance regardless of operand-size, presumably depending only on the actual magnitude of the inputs, according to Agner Fog. Only the worst-case grows with operand-size. So I think it's harmless to run idiv r64
with small inputs sign-extended to 128 / 64-bit on AMD. (div/idiv on AMD is 2 uops for all operand sizes (except 8-bit where it's one because it only has to write one output register: AH and AL = AX. Unlike Intel's microcoded integer division.)
Intel is very different: idiv r32
is 9 uops, vs. idiv r64
being 59 uops, with a best-case throughput that's 3x worse, on Haswell. Other members of SnB-family are similar.
Why don't other compilers like GCC or ICC do it?
Probably because clang developers thought of it, and gcc/icc haven't copied them yet. If you've watched Chandler Carruth's talks about perf
, one example he used was playing around with a branch to skip a div
. I'm guessing this optimization was his idea. Looks nifty. :)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…