Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
967 views
in Technique[技术] by (71.8m points)

assembly - Multiplication with constant - imul or shl-add-combination

This question is about how we multiply an integer with a constant. So let's look at a simple function:

int f(int x) {
    return 10*x;
}

How can that function be optimized best, especially when inlined into a caller?

Approach 1 (produced by most optimizing compilers (e.g. on Godbolt))

    lea    (%rdi,%rdi,4), %eax
    add    %eax, %eax

Approach 2 (produced with clang3.6 and earlier, with -O3)

    imul   $10, %edi, %eax

Approach 3 (produced with g++6.2 without optimization, removing stores/reloads)

    mov    %edi, %eax
    sal    $2, %eax
    add    %edi, %eax
    add    %eax, %eax

Which version is fastest, and why? Primarily interested in Intel Haswell.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

According to Agner Fog's testing (and other stuff like AIDA64) Intel CPUs since Core2 have had imul r32,r32, imm latency of 3c, throughput one per 1c. Since Nehalem, 64-bit multiplies are also that fast. (Agner says Nehalem's imul r64,r64,imm slower (2c throughput) than imul r64,r64, but that doesn't match other results. Instlatx64 says 1c.)

AMD CPUs before Ryzen are slower, e.g. Steamroller has lat=4c tput=one per 2c for 32-bit multiply. For 64-bit multiply, lat=6c tput=one per 4c. AMD Ryzen has the same excellent multiply performance as Intel.


LEA with 2 components in the addressing mode (base + index, but no constant displacement) runs in 1c latency on all Intel CPUs1, except maybe for Atom where LEA runs in a different stage of the pipeline (in the actual AGU, not the ALU) and needs its input ready 4c earlier than a "normal" ALU instruction. Conversely, its input is ready sooner so the ADD can use the result the same cycle, I think. (I haven't tested this, and don't have any Atom HW.)

On Intel SnB-family, simple-LEA can run on ports 1 or 5, so it has twice the throughput of IMUL.

ADD can run on any ALU port on any CPU. HSW introduced a 4th ALU port (vs. IvyBridge), so it can sustain 4 ALU uops per clock (in theory).

So the LEA+ADD version has 2c latency on most x86 CPUs, and on Haswell can run two multiplies per clock.

Footnote 1: On AMD (including Zen / Zen2), a scaled-index makes an LEA "slow" (2 cycle latency and runs on fewer ports). e.g. lea r32, [r64+r64*2] measured at 2 cycle latency on Zen2 vs. 1 cycle on Skylake. (Agner Fog also mentions that lea r32, [r64...] is slower on AMD, but that might only have been a Bulldozer effect; it's not apparent in https://uops.info/'s results for Zen / Zen2.)


But if the multiply is only one small part of a bigger surrounding loop that bottlenecks on total uop throughput, not multiply latency or throughput, the IMUL version is better.


If your multiply constant is too big for two LEAs, or a SHL+LEA, then you're probably better off with IMUL, especially when tuning primarily for Intel CPUs with their extremely high performance integer multipliers.

SHL+LEA or SHL+SUB might be useful e.g. to multiply by 63. (from Godbolt: gcc6.2 -O3 -march=haswell)

    movl    %edi, %eax
    sall    $6, %eax
    subl    %edi, %eax

On Haswell, where MOV is zero-latency, this has only 2c latency. But it's 3 fused-domain uops vs. 1 for imull $63, %edi, %eax. So it's more uops in the pipeline, reducing how far ahead the CPU can "see" to do out-of-order execution. It also increases pressure on the uop cache, and L1 I-cache, for a compiler to consistently pick this strategy, because it's more instruction bytes.

On CPUs before IvyBridge, this is strictly worse than IMUL unless something else is competing for port1, because it's 3c latency (the MOV is on the critical path dependency chain, and has 1c latency).

As usual, none of the asm fragments can be said to be optimal for all situations. It depends on what the bottleneck is in the surrounding code: latency, throughput, or uops.

The answer will be different for the same surrounding code on different microarchitectures, too.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...