Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

assembly - Increasing Efficiency of binary -> gray code for 8086

I am a beginner in assembly and This is a code that I designed to convert from binary and to gray, and prints the resulting bit-pattern in hex.

      mov      al, a               
      mov      bl, al
      shr      bl, 1
      xor      al, bl                         

Although the program is working, I want to learn about other simpler methods to increase the efficiency, I tried many other ways but it is affecting the output.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

(This answer was written based on the first version of the question, where there was some not-already-optimal hex-print code, and it was complete source for a .exe program. The updates to the question have removed the only parts that had room to be optimized except for ILP which is irrelevant on 8086, so I don't intend to remove those parts of the answer.)

Code size optimizations (which correlate with speed on 8086 and especially 8088, see this retrocomputing answer):

  • bin2Gray part: no change, unless you count reloading from mem or making a constant. Just re-ordering instructions for ILP on Pentium and later. Or maybe a table for xlat.
  • byte->hex digits: 21 bytes, down from 32 (and fewer bytes of code fetch)
  • exit: 1 byte (ret) down from 4 (mov ah / int). Works for at least .com executables, which are also smaller.

I probably should have counted code size in total bytes that need to get fetched (i.e. bytes of instructions executed), rather than static code size, although that's also useful to optimize for.

Dropping the looping from the 21-byte bin2hex part would cost a couple bytes of static code size, but reduce dynamic byte count by about 5, IIRC. And avoid any prefetch buffer discarding on taken branch, except for the int 10h of course.


int 20h can also exit (without needing any AH setup), but only from a .com executable. The interesting part to me is computing the ASCII digits in the desired register with compact code, but if you want a tiny overall program, a .com is the way to go. That also avoids setup of DS. (Although you wouldn't need to set up DS if you make a and EQU or = constant.)

Not attempted: taking advantage of initial register values, which apparently are reliable across some DOS versions. If you're pretending like you're writing a block that might be useful as part of a larger program, that's not viable.


Your program basically has two separate parts; calculating a Gray code, and bin->hex for a 1-byte value in a register. Extracting the nibbles separately doesn't usefully optimize backwards into the Gray-code calculation, so I think we can just keep those fully separated.


There are multiple different ways to make a Gray code (only one bit flips between consecutive values). AFAIK, x ^ (x>>1) is the cheapest to compute from binary, but it's not impossible there could be something you could do with only 2 instructions, given input in a register.

Also related: Gray code algorithm (32 bit or less) for gray->binary points out that the standard x ^ (x>>1) is multiplication in GF(2k). So on very recent CPUs with Galois-Field instructions, you can do 16 bytes at a time with gf2p8affineqb, I think. (gf2p8mulb uses a fixed polynomial which I think isn't the one we want for this.)


8088 performance is mostly about memory access (including code fetch)

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html shows instruction timings, but those timings are only execution, not code-fetch. 8088 has a 4-byte prefetch buffer (and only an 8-bit data bus), 6-byte on 8086 with its 16-bit bus. Supercat suggested in an answer there:

On the original 8088 processor, the easiest way to estimate execution speed is generally to ignore cycle counts and instead count memory accesses, including instruction fetches, and multiply by four.

I think it's about the same for 8086, except each access can be a whole 2-byte word. So straight-line code (no branches) can get fetched 2 bytes at a time.

For simplicity, I just annotated asm source with instruction size and cycle counts from the table without trying to model how the prefetch buffer would behave.

xlat (like al = ds:[bx+al]) is only 1 byte, and can be worth using if you don't mind having a 256-byte table. It takes 11 bytes to execute, but that includes the data access it makes. Not counting code-fetch, mov bl,al / shr al,1 / xor al,bl is 2+2+3 cycles, but 3 words of code size will cost 12 cycles to fetch. xlat takes almost that long, but when it finishes the prefetch buffer will have had some time to fetch later instructions, so I think it's even more of a win.

Still, it does require that table to come from somewhere, either from disk when your executable was loaded, or you have to pre-compute it. And you need to get a pointer into BX, so it may only be a win if you can do this in a loop.

But if you're using a table, you can combine both parts of the problem and look up both ASCII hex digit characters for the gray-code for a given binary, e.g. with mov dx, [bx + si] with table pointer in SI, binary byte in BL, and BH=0. (DX sets you up for outputting DL with a DOS call.) This of course would require your table to be 256 words (512 bytes). Having a tiny executable may be more valuable than saving a few cycles here; the actual I/O to the screen or a file is probably slow enough for it to not matter much. If you're doing this for multiple bytes, though, copying pairs of ASCII bytes into a buffer can be good.

There's one optimization that will help on more modern CPUs (starting with Pentium) that can run more than 1 instruction in parallel: copy the register, then shift the original so that can happen in the same cycle as the copy.

; optimized for Instruction-level Parallelism
;; input: AL    output: AL = bin_to_gray(AL)
;; clobbers: DL
   mov  dl, al         ; 2B  2 cycles (not counting code-fetch bottlenecks)
   shr  al, 1          ; 2B  2c
   xor  al, dl         ; 2B  3c

(For more about modern CPUs, see https://agner.org/optimize/. And also Can x86's MOV really be "free"? Why can't I reproduce this at all? - mov-elimination doesn't work on byte or word registers because that's merging into the low part of EDX. So even on CPUs with mov-elimination in general, it can't apply here so this optimization saves latency.)

I'm pretty sure there's no further room for improvement in bin -> gray. Even modern x86 doesn't have a copy-and-right-shift (except with the count in another register, BMI2 shrx, or for SIMD registers with AVX, but only for word/dword/qword element sizes). Also no right-shift-and-xor, so there's no avoiding a mov, and the shr and xor are obviously also necessary. XOR is add-without-carry, but I don't think that helps. Unless you had carryless multiply (pclmulqdq) and a multiplier constant to get two copies of the input at the right offsets from each other into the high half of a multiply result, you're going to need to do those operations separately. Or with Galois-Field New Instructions (GFNI): What are the AVX-512 Galois-field-related instructions for?

Still, if you want to exhaustively check, https://en.wikipedia.org/wiki/Superoptimization - ask a superoptimizer to look for sequences that produce the same AL result as the mov/shr/xor sequence.


In real use-cases, you'd normally want code that took data in a register, because that's how you should pass data to functions. After mov al, a, that's what your code is doing.

But if it was a global in memory, you could save a byte of code size by loading it twice instead of copying a register with mov, at the expense of speed. Or even better, make it an assemble-time constant. (Although if you do that, the next step is mov al, a ^ (a>>1) to do the calculation at assemble time.)

; a equ 0ACh         ; makes the following instructions 2 bytes each
;;; otherwise, with a being static storage, loading from memory twice sucks
    mov  al, a
    shr  al, 1    ; 2B, 2 cycles
    xor  al, a    ; reg,imm: 2B, 4 cycles on 8088.    reg,mem: 3B, 13+6 cycles

Byte -> 2 ASCII hex digits

This is the more interesting part.

Looping for only 2 iterations is sometimes not worth it, especially when you can save some work if you do separate things to each half. (e.g. low nibble is x & 0xf, high is x >> 4. Using rol/mov/and is not optimal.)

Tricks:

  • Prefer insn al, imm - x86 has short-form special cases for immediate operands with AL. (Also AX,imm16).

  • Wanting to do stuff in AL means it's more efficient to print with BIOS int 10h / AH=0Eh teletype output which takes its input in AL, and doesn't destroy any other registers. I think BIOS output will ignore DOS I/O redirection like foo > outfile.txt and always print to the screen.

  • There's an evil hack that abuses DAS to turn a 0..15 integer into an ASCII hex digit '0'..'9' or 'A'..'F' without branching. On 8086 (unlike modern x86) DAS is just as fast as typical integer instruction. See this codegolf.SE answer for a breakdown on exactly why it works; it's highly non-obvious, but it avoid branching so it's actually a big speedup on 8086.

  • BIOS / DOS calls generally don't modify AH, so setting it can be done outside the loop.

  • Then for code-size, instead of just unrolling, use the cl=4 as a loop counter to loop back and re-run some of the earlier code once (not including the shift). sub cl, 2 / jnz would work, but using the parity flag is a way to use dec cx (1B) / jpe to jump backwards once, then fall through the next time.

  • DOS programs (or at least .com


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...