That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m
and imul r, r/m, imm
are both 3 cycle latency, one per 1c throughput on Intel SnB-family and AMD Ryzen, even for 64bit operand size.
On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (The slower times for 64-bit operand-size).
Data from Agner Fog's instruction tables. See also other stuff in the x86 tag wiki.
The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier. How modern X86 processors actually compute multiplications?).
Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time (https://en.wikipedia.org/wiki/Dark_silicon). e.g. you can't saturate the pext
/pdep
unit, the integer multiplier, and the vector FMA units all at once on an Intel CPU, because many of them are on the same execution ports.
Fun fact: imul r64
is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32
is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.
Compilers typically optimize for latency, and typically don't know how to optimize short independent dependency chains for throughput vs. long loop-carried dependency chains that bottleneck on latency.
gcc and clang3.8 and later use up to two LEA
instructions instead of imul r, r/m, imm
. I think gcc will use imul
if the alternative is 3 or more instructions (not including mov
), though.
That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul
on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.
clang3.7 and earlier tends to favour imul
except for multipliers that only require a single LEA or shift. So clang very recently changed to optimizing for latency instead of throughput for multiplies by small constants. (Or maybe for other reasons, like not competing with other things that are only on the same port as the multiplier.)
e.g. this code on the Godbolt compiler explorer:
int foo (int a) { return a * 63; }
# gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
mov eax, edi # tmp91, a
sal eax, 6 # tmp91,
sub eax, edi # tmp92, a
ret
clang3.8 and later makes the same code.