Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
142 views
in Technique[技术] by (71.8m points)

c++ - Why does division by 3 require a rightshift (and other oddities) on x86?

I have the following C/C++ function:

unsigned div3(unsigned x) {
    return x / 3;
}

When compiled using clang 10 at -O3, this results in:

div3(unsigned int):
        mov     ecx, edi         # tmp = x
        mov     eax, 2863311531  # result = 3^-1
        imul    rax, rcx         # result *= tmp
        shr     rax, 33          # result >>= 33
        ret

What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.

There are some things that I don't understand though:

  1. Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
  2. Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
  3. Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
  4. What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.

Edit 1

For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here. For example:

// multiplying with inverse of 3:
15 * 2863311531      = 42949672965
42949672965 mod 2^32 = 5

// using fixed-point multiplication
15 * 2863311531      = 42949672965
42949672965 >> 33    = 5

// simply dividing by 3
15 / 3               = 5

So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.

Edit 2

I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.

Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.

unsigned div3(unsigned x) {
    __builtin_assume(x % 3 == 0);
    return x / 3;
}

(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
  1. Can't we multiply rax with edi directly?

We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.

(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).

Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.

  1. Why do we multiply in 64-bit mode?

(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)

You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF) (Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)

Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)

GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/

# gcc8.3 -O3  (default -mtune=generic)
div3(unsigned int):
        mov     eax, edi                 # 1 uop, stupid wasted instruction
        mov     edx, -1431655765         # 1 uop  (same 32-bit constant, just printed differently)
        mul     edx                      # 3 uops on Sandybridge-family
        mov     eax, edx                 # 1 uop
        shr     eax                      # 1 uop
        ret
                                  # total of 7 uops on SnB-family

Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:

# gcc9.3 -O3  (default -mtune=generic)
div3(unsigned int):
        mov     eax, edi                # 1 uop
        mov     edi, 2863311531         # 1 uop
        imul    rax, rdi                # 1 uop
        shr     rax, 33                 # 1 uop
        ret
                      # total 4 uops, not counting ret

Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.

  1. Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.

It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.

Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.

Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.

  1. What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.

No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):

See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.

For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed

// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
    int64_t invDivisor = 0x1999999A;
    return (int32_t) ((invDivisor * dividend) >> 32);
}

Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)


From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.

As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.

unsigned div3_exact_only(unsigned x) {
    __builtin_assume(x % 3 == 0);  // or an equivalent with if() __builtin_unreachable()
    return x / 3;
}

Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.

div3_exact_only:
    imul  eax, edi, 0xAAAAAAAB        # 1 uop, 3c latency
    ret

However, it is something you can do yourself in source code, at least for known type widths like uint32_t:

uint32_t div3_exact_only(uint32_t x) {
    return x * 0xaaaaaaabU;
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...