Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
782 views
in Technique[技术] by (71.8m points)

assembly - Hard to debug SEGV due to skipped cmov from out-of-bounds memory

I'm trying to code a few high-performance assembly functions as an exercise, and have encountered a weird segfault that happens when running the program, but not in valgrind or nemiver.

Basically a cmov that shouldn't be run, with an out-of-bound address, makes me segfault even if the condition is always false

I have a fast and a slow version. The slow one works all the time. The fast one works, unless it receives a non-ascii char, at which point it crashes horribly, unless I'm running on adb or nemiver.

ascii_flags is simply a 128 bytes array (with a bit of room at the end) containing flags on all ASCII characters (alpha, numeric, printable, etc.)

this works:

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    jnz .error
    mov EAX, [rel ascii_flags + EDI]    ; load ascii table if input fits
    and EAX, 0b00001000         ; get specific bit
.error:
    ret

but this doesn't:

ft_isprint:
    xor EAX, EAX                ; empty EAX
    test EDI, ~127              ; check for non-ascii (>127) input
    cmovz EAX, [rel ascii_flags + EDI]  ; load ascii table if input fits
    and EAX, flag_print         ; get specific bit
    ret

Valgrind does actually crash, but with no other information than memory addresses, since I've not managed to get more debugging information.

Edit:

I've written three versions of the functions to take in account the wonderful answers:

ft_isprint:
    mov RAX, 128                            ; load default index
    test RDI, ~127                          ; check for non-ascii (>127) input
    cmovz RAX, RDI                          ; if none are found, load correct index
    mov AL, byte [ascii_flags + RAX]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit (and zeros rest of RAX)
    ret

ft_isprint_branch:
    test RDI, ~127                          ; check for non-ascii (>127) input
    jnz .out_of_bounds                      ; if non-ascii, jump to error handling
    mov AL, byte [ascii_flags + RDI]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit (and zeros rest of RAX)
    ret
.out_of_bounds:
    xor RAX, RAX                            ; zeros return value
    ret

ft_isprint_compact:
    xor RAX, RAX                            ; zeros return value preemptively
    test RDI, ~127                          ; check for non-ascii (>127) input
    jnz .out_of_bounds                      ; if non-ascii was found, skip dereferenciation
    mov AL, byte [ascii_flags + RDI]        ; dereference index into least sig. byte
    and RAX, flag_print                     ; get specific bit
.out_of_bounds:
    ret

After extensive testing, the branching functions are definitely faster than the cmov function by a factor of about 5-15% on all types of data. The difference between the compact and non-compact version is, as expected minimal. Compact is ever slightly faster on a predictable data set, while the non compact is just as slightly faster on non-predictable data.

I tried various different ways to skip the 'xor EAX, EAX' instruction, but couldn't find any that works.

Edit: after more testing, I've updated the code to three new versions:

ft_isprint_compact:
    sub EDI, 32                             ; substract 32 from input, to overflow any value < ' '
    xor EAX, EAX                            ; set return value to 0
    cmp EDI, 94                             ; check if input <= '~' - 32
    setbe AL                                ; if so, set return value to 1
    ret

ft_isprint_branch:
    xor EAX, EAX                            ; set return value to 0
    cmp EDI, 127                            ; check for non-ascii (>127) input
    ja .out_of_bounds                       ; if non-ascii was found, skip dereferenciation
    mov AL, byte [rel ascii_flags + EDI]    ; dereference index into least sig. byte
.out_of_bounds:
    ret

ft_isprint:
    mov EAX, 128                            ; load default index
    cmp EDI, EAX                            ; check if ascii
    cmovae EDI, EAX                         ; replace with 128 if outside 0..127
                                            ; cmov also zero-extends EDI into RDI
;   movzx EAX, byte [ascii_flags + RDI]     ; alternative to two following instruction if masking is removed
    mov AL, byte [ascii_flags + RDI]        ; load table entry
    and EAX, flag_print                     ; apply mask to get correct bit and zero rest of EAX
    ret

The performances are as follow, in microseconds. The 1-2-3 show the order of execution, to avoid a caching advantage:

-O3 a.out
1 cond 153185, 2 branch 238341 3 no_table 145436
1 cond 148928, 3 branch 248954 2 no_table 116629
2 cond 149599, 1 branch 226222 3 no_table 117428
2 cond 117258, 3 branch 241118 1 no_table 147053
3 cond 117635, 1 branch 228209 2 no_table 147263
3 cond 146212, 2 branch 220900 1 no_table 147377
-O3 main.c
1 cond 132964, 2 branch 157963 3 no_table 131826
1 cond 133697, 3 branch 159629 2 no_table 105961
2 cond 133825, 1 branch 139360 3 no_table 108185
2 cond 113039, 3 branch 162261 1 no_table 142454
3 cond 106407, 1 branch 133979 2 no_table 137602
3 cond 134306, 2 branch 148205 1 no_table 141934
-O0 a.out
1 cond 255904, 2 branch 320505 3 no_table 257241
1 cond 262288, 3 branch 325310 2 no_table 249576
2 cond 247948, 1 branch 340220 3 no_table 250163
2 cond 256020, 3 branch 415632 1 no_table 256492
3 cond 250690, 1 branch 316983 2 no_table 257726
3 cond 249331, 2 branch 325226 1 no_table 250227
-O0 main.c
1 cond 225019, 2 branch 224297 3 no_table 229554
1 cond 235607, 3 branch 199806 2 no_table 226286
2 cond 226739, 1 branch 210179 3 no_table 238690
2 cond 237532, 3 branch 223877 1 no_table 234103
3 cond 225485, 1 branch 201246 2 no_table 230591
3 cond 228824, 2 branch 202015 1 no_table 226788

The no table version is as about as fast as the cmov, but doesn't allow for easily implementable locals. The branching algorithm is worse unless on predictable data in zero optimization ? I've got no explanations there.

I'll keep the cmov version, which is both the most elegant and easily updatable. Thanks for all the help.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

cmov is an ALU select operation that always reads both sources before checking the condition. Using a memory source doesn't change this. It's not like an ARM predicated instruction that acts like a NOP if the condition was false. cmovz eax, [mem] also unconditionally writes EAX, zero-extending into RAX regardless of the condition.

As far as the most of the CPU is concerned (the out-of-order scheduler and so on), cmovcc reg, [mem] is handled exactly like adc reg, [mem]: a 3-input 1-output ALU instruction. (adc writes flags, unlike cmov, but nevermind that.) The micro-fused memory source operand is a separate uop that just happens to be part of the same x86 instruction. This is how the ISA rules for it work, too.

So really, a more appropriate mnemonic for cmovz as a selectz


x86's only conditional loads (that don't fault on bad addresses, just potentially run slowly) are:

  • Normal loads protected by conditional branches. Branch mis-prediction or other mis-speculations leading to running a faulting load are handled fairly efficiently (maybe starting a page walk, but once the mis-speculation is identified, execution of the correct flow of instructions doesn't have to wait for any memory operation started by speculative execution).

    If there was a TLB hit on a page you can't read, then not much more happens until a faulting load reaches retirement (known to be non-speculative and thus actually taking a #PF page-fault exception which is unavoidably going to be slow). On some CPUs, this fast handling leads to the Meltdown attack. >.< See http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/.

  • rep lodsd with RCX=0 or 1. (not fast or efficient, but microcode branches are special and can't benefit from branch prediction, on Intel CPUs. See What setup does REP do?. Andy Glew mentions microcode branch mispredictions, but I think those are different from normal branch misses because there seems to be a fixed cost.)

  • AVX2 vpmaskmovd/q / AVX1 vmaskmovps/pd. Faults are suppressed for elements where the mask is 0. A mask-load with an all-0 mask even from a legal address requires a ~200 cycle microcode assist with a base+index addressing mode.) See section 12.9 CONDITIONAL SIMD PACKED LOADS AND STORES and Table C-8 in Intel's optimization manual. (On Skylake, stores to an illegal address with an all-zero mask also need an assist.)

    The earlier MMX/SSE2 maskmovdqu is store-only (and has an NT hint). Only the similar AVX instruction with dword/qword (instead of byte) elements has a load form.

  • AVX512 masked loads

  • AVX2 gathers with some / all mask elements cleared.

... and maybe others I'm forgetting. Normal loads inside TSX / RTM transactions: a fault aborts the transaction instead of raising a #PF. But you can't count on a bad index faulting instead of just reading bogus data from somewhere nearby, so it's not really a conditional load. It's also not super fast.


An alternative might be to cmov an address that you use unconditionally, selecting which address to load from. e.g. if you had a 0 to load from somewhere else, that would work. But then you'd have to calculate the table indexing in a register, not using an addressing mode, so you could cmov the final address.

Or just CMOV the index and pad the table with some zero bytes at the end so you can load from table + 128.

Or use a branch, it will probably predict well for a lot of cases. But maybe not for languages like French where you'll find a mix of low-128 and higher Unicode code-points in common text.


Code Review

Note that [rel] only works when there's no register (other than RIP) involved in the addressing mode. RIP-relative addressing replaces one of the 2 redundant ways (in 32-bit code) to encode a [disp32]. It uses the shorter non-SIB encoding, while a ModRM+SIB can still encode an absolute [disp32] with no registers. (Useful for addresses like [fs: 16] for small offsets relative to thread-local storage with segment bases.)

If you just want to use RIP-relative addressing when possible, use default rel at the top of your file. [symbol] will be RIP-relative, but [symbol + rax] won't. Unfortunately, NASM and YASM default to default abs.

[reg + disp32] is a very efficient way to index static data in position-dependent code, just don't fool yourself into thinking that it can be RIP-relative. See 32-bit absolute addresses no longer allowed in x86-64 Linux?.

[rel ascii_flags + EDI] is also weird because you're using a 32-bit register in an addressing mode in x86-64 code. There's usually no reason to spend an address-size prefix to truncate addresses to 32-bit.

However, in this case if your table is in the low 32-bits of virtual address space, and your function arg is only specified as 32 bits (so the caller is allowed to leave garbage in the upper 32 of RDI), it is actually a win to use [disp32 + edi] instead of a mov esi,edi or something to zero-extend. If you're doing that on purpose, definitely comment why you're using a 32-bit addressing mode.

But in this case, using a cmov on the index will zero-extend to 64-bit for you.

It's also weird to use a DWORD load from a table of bytes. You'll occasionally cross a cache-line boundary and suffer extra latency.


@fuz showed a version using a RIP-relative LEA and a CMOV on the index.

In position-dependent code where 32-bit absolute addresses are ok, by all means use it to save instructions. [disp32] addressing modes are worse than RIP-relative (1 byte longer), but [reg + disp32] addressing modes are perfectly fine when position-dependent code and 32-bit absolute addresses are ok. (e.g. x86-64 Linux, but not OS X where executable are always mapped outside the low 32 bits.) Just be aware that it's not rel.

; position-dependent version taking advantage of 32-bit absolute [reg + disp32] addressing
; not usable in shared libraries, only non-PIE executables.
ft_isprint:
    mov     eax, 128               ; offset of dummy entry for "not ASCII"
    cmp     edi, eax               ; check if ascii
    cmovae  edi, eax               ; replace with 128 if outside 0..127
              ; cmov also zero-extends EDI into RDI
    movzx   eax, byte [ascii_flags + rdi] ; load table entry
    and     al, flag_print         ; mask the desired flag
      ; if the caller is only going to read / test AL anyway, might as well save bytes here
    ret

If any existing entry in your table has the same flags you want for high inputs, e.g. maybe entry 0 which you'll never see in implicit-length strings, you could still xor-zero EAX and keep your tables at 128 bytes, not 129.

test r32, imm32 takes more code bytes than you need. ~127 = 0xFFFFFF80 would fit in a sign-extended byte, but is no TEST r/m32, sign-extended-imm8 encoding. There is such an encoding for cmp, though, like essentially all other immediate instructions.

You could instead check for unsigned above 127, with cmp edi, 127 / cmovbe eax, edi or cmova edi, eax. This saves 3 bytes of code-size. Or we can save 4 bytes by using cmp reg,reg using the 128 we used for a table index.

A range-check before array indexing is also more intuitive for most humans than checking high bits anyway.

and al, imm8 is only 2 bytes, vs. 3 bytes for and r/m32, sign-extended-imm8. It's not slower on any CPUs, as long as the caller only reads AL. On Intel CPUs before Sandybridge, reading EAX after ANDing into AL could cause a partial-register stall / slowdown. Sandybridge doesn't rename partial registers for read-modify-write operations, if I recall correctly, and IvB and later don't rename low8 partial regs at all.

You might also use mov al, [table] instead of movzx to save another code byte. An earlier mov eax, 128 already broke any false dependency on the old value of EAX so it shouldn't have a performance downside. But movzx is not a bad idea.

When all else is equal, smaller code-size is almost always better (for instruction-cache footprint, and sometimes for packing into the uop cache). If it cost any extra uops or introduced any false dependencies, it wouldn't be worth it when optimizing for speed, though.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...