Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
825 views
in Technique[技术] by (71.8m points)

performance - Packing BCD to DPD: How to improve this amd64 assembly routine?

I'm writing a routine to convert between BCD (4 bits per decimal digit) and Densely Packed Decimal (DPD) (10 bits per 3 decimal digits). DPD is further documented (with the suggestion for software to use lookup-tables) on Mike Cowlishaw's web site.


This routine only ever requires the lower 16 bit of the registers it uses, yet for shorter instruction encoding I have used 32 bit instructions wherever possible. Is a speed penalty associated with code like:

mov data,%eax # high 16 bit of data are cleared
...
shl %al
shr %eax

or

and $0x888,%edi         #   = 0000 a000 e000 i000
imul $0x0490,%di        #   = aei0 0000 0000 0000

where the alternative to a 16 bit imul would be either a 32 bit imul and a subsequent and or a series of lea instructions and a final and.

The whole code in my routine can be found below. Is there anything in it where performance is worse than it could be due to me mixing word and dword instructions?

        .section .text
        .type bcd2dpd_mul,@function
        .globl bcd2dpd_mul

        # convert BCD to DPD with multiplication tricks
        # input abcd efgh iklm in edi
        .align 8
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jz 1f

        and $0x888,%edi         #   = 0000 a000 e000 i000
        imul $0x0490,%di        #   = aei0 0000 0000 0000
        mov %eax,%esi
        and $0x66,%esi          # q = 0000 0000 0fg0 0kl0
        shr $13,%edi            # u = 0000 0000 0000 0aei
        imul tab-8(,%rdi,4),%si # v = q * tab[u-2][0]
        and $0x397,%eax         # r = 0000 00bc d00h 0klm
        xor %esi,%eax           # w = r ^ v
        or tab-6(,%rdi,4),%ax   # x = w | tab[u-2][1]
        and $0x3ff,%eax         #   = 0000 00xx xxxx xxxx
1:      ret

        .size bcd2dpd_mul,.-bcd2dpd_mul

        .section .rodata
        .align 4
tab:
        .short 0x0011 ; .short 0x000a
        .short 0x0000 ; .short 0x004e
        .short 0x0081 ; .short 0x000c
        .short 0x0008 ; .short 0x002e
        .short 0x0081 ; .short 0x000e
        .short 0x0000 ; .short 0x006e
        .size tab,.-tab

Improved Code

After applying some suggestions from the answer and comments and some other trickery, here is my improved code.

        .section .text
        .type bcd2dpd_mul,@function
        .globl bcd2dpd_mul

        # convert BCD to DPD with multiplication tricks
        # input abcd efgh iklm in edi
        .align 8
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jnz 1f
        ret

        .align 8
1:      and $0x888,%edi         #   = 0000 a000 e000 i000
        imul $0x49,%edi         #   = 0ae0 aei0 ei00 i000
        mov %eax,%esi
        and $0x66,%esi          # q = 0000 0000 0fg0 0kl0
        shr $8,%edi             #   = 0000 0000 0ae0 aei0
        and $0xe,%edi           #   = 0000 0000 0000 aei0
        movzwl lookup-4(%rdi),%edx
        movzbl %dl,%edi
        imul %edi,%esi          # v = q * tab[u-2][0]
        and $0x397,%eax         # r = 0000 00bc d00h 0klm
        xor %esi,%eax           # w = r ^ v
        or %dh,%al              #   = w | tab[u-2][1]
        and $0x3ff,%eax         #   = 0000 00xx xxxx xxxx
        ret

        .size bcd2dpd_mul,.-bcd2dpd_mul

        .section .rodata
        .align 4
lookup:
        .byte 0x11
        .byte 0x0a
        .byte 0x00
        .byte 0x4e
        .byte 0x81
        .byte 0x0c
        .byte 0x08
        .byte 0x2e
        .byte 0x81
        .byte 0x0e
        .byte 0x00
        .byte 0x6e
        .size lookup,.-lookup
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

TYVM for commenting the code clearly and well, BTW. It made is super easy to figure out what was going on, and where the bits were going. I'd never heard of DPD before, so puzzling it out from uncommented code and the wikipedia article would have sucked.


The relevant gotchas are:

  • Avoid 16bit operand size for instructions with immediate constants, on Intel CPUs. (LCP stalls)
  • avoid reading the full 32 or 64bit register after writing only the low 8 or 16, on Intel pre-IvyBridge. (partial-register extra uop). (IvB still has that slowdown if you modify an upper8 reg like AH, but Haswell removes that too). It's not just an extra uop: the penalty on Core2 is 2 to 3 cycles, according to Agner Fog. I might be measuring it wrong, but it seems a lot less bad on SnB.

See http://agner.org/optimize/ for full details.

Other than that, there's no general problem with mixing in some instructions using the operand-size prefix to make them 16-bit.


You should maybe write this as inline asm, rather than as a called function. You only use a couple registers, and the fast-path case is very few instructions.


I had a look at the code. I didn't look into achieving the same result with significantly different logic, just at optimizing the logic you do have.


Possible code suggestions: Switch the branching so the fast-path has the not-taken branch. Actually, it might make no diff either way in this case, or might improve the alignment of the slow-path code.

.p2align 4,,10   # align to 16, unless we're already in the first 6 bytes of a block of 16
bcd2dpd_mul:
        mov %edi,%eax           #   = 0000 abcd efgh iklm
        shl %al                 #   = 0000 abcd fghi klm0
        shr %eax                #   = 0000 0abc dfgh iklm
        test $0x880,%edi        # fast path for a = e = 0
        jnz .Lslow_path
        ret

.p2align 4    # Maybe fine-tune this alignment based on how the rest of the code assembles.    
.Lslow_path:

        ...
        ret

It's sometimes better to duplicate return instructions than to absolutely minimize code-size. The compare-and-branch in this case is the 4th uop of the function, though, so a taken branch wouldn't have prevented 4 uops from issuing in the first clock cycle, and a correctly-predicted branch would still issue the return on the 2nd clock cycle.


You should use a 32bit imul for the one with the table source. (see next section about aligning the table so reading an extra 2B is ok). 32bit imul is one uop instead of two on Intel SnB-family microarches. The result in the low16 should be the same, since the sign bit can't be set. The upper16 gets zeroed by the final and before ret, and doesn't get used in any way where garbage in the upper16 matters while it's there.

Your imul with an immediate operand is problematic, though.

It causes an LCP stall when decoding on Intel, and it writes the the low16 of a register that is later read at full width. Its upper16 would be a problem if not masked off (since it's used as a table index). Its operands are large enough that they will put garbage into the upper16, so it does need to be discarded.

I thought your way of doing it would be optimal for some architectures, but it turns out imul r16,r16,imm16 itself is slower than imul r32,r32,imm32 on every architecture except VIA Nano, AMD K7 (where it's faster than imul32), and Intel P6 (where using it from 32bit / 64bit mode will LCP-stall, and where partial-reg slowdowns are a problem).

On Intel SnB-family CPUs, where imul r16,r16,imm16 is two uops, imul32/movzx would be strictly better, with no downside except code size. On P6-family CPUs (i.e. PPro to Nehalem), imul r16,r16,imm16 is one uop, but those CPUs don't have a uop cache, so the LCP stall is probably critical (except maybe Nehalem calling this in a tight loop, fitting in the 28 uop loop buffer). And for those CPUs, the explicit movzx is probably better from the perspective of the partial-reg stall. Agner Fog says something about there being an extra cycle while the CPU inserts the merging uop, which might mean a cycle where that extra uop is issued alone.

On AMD K8-Steamroller, imul imm16 is 2 m-ops instead of 1 for imul imm32, so imul32/movzx is about equal to imul16 there. They don't suffer from LCP stalls, or from partial-reg problems.

On Intel Silvermont, imul imm16 is 2 uops (with one per 4 clocks throughput), vs. imul imm32 being 1 uops (with one per 1 clock throughput). Same thing on Atom (the in-order predecessor to Silvermont): imul16 is an extra uop and much slower. On most other microarchitectures, throughput isn't worse, just latency.

So if you're willing to increase the code-size in bytes where it will give a speedup, you should use a 32bit imul and a movzwl %di, %edi. On some architectures, this will be about the same speed as the imul imm16, while on others it will be much faster. It might be slightly worse on AMD bulldozer-family, which isn't very good at using both integer execution units at once, apparently, so a 2 m-op instruction for EX1 might be better than two 1 m-op instructions where one of them is still an EX1-only instruction. Benchmark this if you care.


Align tab to at least a 32B boundary, so your 32bit imul and or can do a 4B load from any 2B-aligned entry in it without crossing a cache-line boundary. Unaligned accesses have no penalty on all recent CPUs (Nehalem and later, and recent AMD), as long as they don't span two cache lines.

Making the operations that read from the table 32bit avoids the partial-register penalty that Intel CPUs have. AMD CPUs, and Silvermont, don't track partial-registers separately, so even instructions that write-only to the low16 have to wait for the result in the rest of the reg. This stops 16bit insns from breaking dependency chains. Intel P6 and SnB microarch families track partial regs. Haswell does full dual bookkeeping or something, because there's no penalty when merging is needed, like after you shift al, then shift eax. SnB will insert an extra uop there, and there may be a penalty of a cycle or two while it does this. I'm not sure, and haven't tested. However, I don't see a nice way to avoid this.

The shl %al could be replaced with a add %al, %al. That can run on more ports. Probably no difference, since port0/5 (or port0/6 on Haswell and later) probably aren't saturated. They have the same effect on the bits, but set flags differently. Otherwise they could be decoded to the same uop.


changes: split the pext/pdep / vectorize version into a separate answer, partly so it can have its own comment thread.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...