Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
118 views
in Technique[技术] by (71.8m points)

c++ - Why does MSVC emit a useless MOVSX before performing this Bit Test?

Compiling the following code in MSVC 2013, 64-bit release build, /O2 optimization:

while (*s == ' ' || *s == ',' || *s == '
' || *s == '
') {
    ++s;
}

I get the following code - which has a really cool optimization using a 64-bit register as a lookup table with the bt (bit test) instruction.

    mov     rcx, 17596481020928             ; 0000100100002400H
    npad    5
$LL82@myFunc:
    movzx   eax, BYTE PTR [rsi]
    cmp     al, 44                          ; 0000002cH
    ja      SHORT $LN81@myFunc
    movsx   rax, al
    bt      rcx, rax
    jae     SHORT $LN81@myFunc
    inc     rsi
    jmp     SHORT $LL82@myFunc
$LN81@myFunc:
    ; code after loop...

But my question is: what is the purpose of the movsx rax, al after the first branch?

First we load a byte from the string into rax and zero-extend it:

movzx eax, BYTE PTR [rsi]

Then the cmp/ja pair performs an unsigned comparison between al and 44, and branches forwards if al is greater.

So now, we know 0 <= al <= 44 in unsigned numbers. Therefore, the highest bit of al could not possibly be set!

Nonetheless, the next instruction is movsx rax, al. This is a sign-extended move. But since:

  • al is the lowest byte of rax
  • we already know the other 7 bytes of rax are zeroed
  • we just proved that al's highest bit could not possibly be set

this movsx must be a no-op.

Why does MSVC do it? I'm assuming it's not for padding, since in that case another npad would make the meaning clearer. Is it flushing data dependencies or something?

(By the way, this bt optimization really makes me happy. Some interesting facts: it runs in 0.6x the time of the 4 cmp/je pairs you might expect, it's way faster than strspn or std::string::find_first_not_of, and it only happens in 64-bit builds even if the characters of interest have values under 32.)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You surely recognize that this optimization was produced by very specific code in the optimizer that looked for the pattern. Just the generation of the bit-mask gives it away. Yes, nice trick.

There are two basic codegen cases here. First one is the more universal one, where (charmax - charmin <= 64) but charmax >= 64. The optimizer needs to generate different code from what you saw, it needs to subtract charmin. That version does not have the MOVSX instruction. You can see it by replacing *s == ' ' by *s == 'A'.

Then there's the special case that you tested, all character codes to test happen to be < 64. The Microsoft programmer did deal with this in his code, he made sure not to generate a silly SUB EAX,0 instruction. But overlooked that generating the MOVSX wasn't necessary. Surely missed by only checking for optimal code in the general case. And a general function call in the code, so easy to overlook, note how the instruction changes to MOVZX when you compile with /J. Otherwise easily deemed necessary, there is no BT instruction that takes an 8-bit register as the 2nd operand so the AL register load isn't enough by itself.

There could be a hypothetical post-optimizer optimizer that optimizes the optimized code generated by the optimizer. And decided to keep MOVSX to improve superscalar execution. I seriously doubt it exists.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...