Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
886 views
in Technique[技术] by (71.8m points)

algorithm of addressing a triangle matrices memory using assembly

I was doing a project in ASM about pascal triangle using NASM

so in the project you need to calculate pascal triangle from line 0 to line 63

my first problem is where to store the results of calculation -> memory

second problem what type of storage I use in memory, to understand what I mean I have 3 way first declare a full matrices so will be like this way

memoryLabl:  resd 63*64     ; 63 rows of 64 columns each

but the problem in this way that half of matrices is not used that make my program not efficient so let's go the second method is available

which is declare for every line a label for memory for example :

line0: dd   1
line1: dd  1,1
line2: dd 1,2,1      ; with pre-filled data for example purposes
...
line63: resd 64      ; reserve space for 64 dword entries

this way of doing it is like do it by hand,

some other from the class try to use macro as you can see here but i don't get it

so far so good

let's go to the last one that i have used which is like the first one but i use a triangle matrices , how is that, by declaring only the amount of memory that i need

so to store line 0 to line 63 line of pascal triangle, it's give me a triangle matrices because every new line I add a cell

I have allocate 2080 dword for the triangle matrices how is that ?? explain by 2080 dword:

                okey we have line0 have 1 dword /* 1 number in first line */
                             line1 have 2 dword /* 2 numbers in second line */
                             line2 have 3 dword /* 3 numbers in third line */
                             ...
                             line63 have 64 dword /* 64 numbers in final line*/
                  so in the end we have 2080   as the sum of them 

I have give every number 1 dword

okey now we have create the memory to store results let's start calculation

first# in pascal triangle you have all the cells in row 0 have value 1

I will do it in pseudo code so you understand how I put one in all cells of row 0:

s=0
for(i=0;i<64;i++):
     s = s+i
     mov dword[x+s*4],1   /* x is addresses of  triangle matrices */

second part in pascal triangle is to have the last row of each line equal to 1

I will use pseudo code to make it simple

s=0
for(i=2;i<64;i++):
     s = s+i
     mov dword[x+s*4],1

I start from i equal to 2 because i = 0 (i=1) is line0 (line1) and line0 (line1)is full because is hold only one (tow) value as I say in above explanation

so the tow pseudo code will make my rectangle look like in memory :

1
1  1
1      1
1          1
1             1
1                 1
1                     1
1                         1
1                              1
1                                 1
...
1                                        1

now come the hard part is the calculation using this value in triangle to fill all the triangle cells

let's start with the idea here

let's take cell[line][row]

we have cell[2][1] = cell[1][0]+cell[1][1]
and     cell[3][1]= cell[2][0]+cell[2][1]
        cell[3][2]= cell[2][1]+cell[2][2]

in **general** we have 
        cell[line][row]= cell[line-1][row-1]+cell[line-1][row]

my problem I could not break this relation using ASM instruction because i have a

triangle matrices which weird to work with can any one help me to break it using a relation or very basic pseudo code or asm code ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

TL:DR: you just need to traverse the array sequentially, so you don't have to work out the indexing. See the 2nd section.


To random access index into a (lower) triangular matrix, row r starts after a triangle of size r-1. A triangle of size n has n*(n+1)/2 total elements, using Gauss's formula for the sum of numbers from 1 to n-1. So a triangle of size r-1 has (r-1)*r/2 elements. Indexing a column within a row is of course trivial, once we know the address of the start of a row.

Each DWORD element is 4 bytes wide, and we can take care of that scaling as part of the multiply, because lea lets us shift and add as well as put the result in a different register. We simplify n*(n-1)/2 elements * 4 bytes / elem to n*(n-1) * 2 bytes.

The above reasoning works for 1-based indexing, where row 1 has 1 element. We have to adjust for that if we want zero-based indexing by adding 1 to row indices before the calculation, so we want the size of a triangle with r+1 - 1 rows, thus r*(r+1)/2 * 4 bytes. It helps to put the linear array index into a triangle to quickly double-check the formula

 0
 4   8
12  16  20
24  28  32  36
40  44  48  52  56
60  64  68  72  76  80
84  88  92  96 100  104  108

The 4th row, which we're calling "row 3", starts 24 bytes from the start of the whole array. That's (3+1)*(3+1-1) * 2 = (3+1)*3 * 2; yes the r*(r+1)/2 formula works.

;; given a row number in EDI, and column in ESI (zero-extended into RSI)
;; load triangle[row][col] into eax

lea    ecx, [2*rdi + 2]
imul   ecx, edi                        ; ecx = r*(r+1) * 2 bytes

mov    eax, [triangle + rcx + rsi*4]

This assuming 32-bit absolute addressing is ok (32-bit absolute addresses no longer allowed in x86-64 Linux?). If not, use a RIP-relative LEA to get the triangle base address in a register, and add that to rsi*4. x86 addressing modes can only have 3 components when one of them is a constant. But that is the case here for your static triangle, so we can take full advantage by using a scaled index for the column, and base as our calculated row offset, and the actual array address as the displacement.


Calculating the triangle

The trick here is that you only need to loop over it sequentially; you don't need random access to a given row/column.

You read one row while writing the one below. When you get to the end of a row, the next element is the start of the next row. The source and destination pointers will get farther and farther from each other as you go down the rows, because the destination is always 1 whole row ahead. And you know the length of a row = row number, so you can actually use the row counter as the offset.

global _start
_start:
    mov  esi, triangle         ; src = address of triangle[0,0]
    lea  rdi, [rsi+4]          ; dst = address of triangle[1,0]

    mov  dword [rsi], 1      ; triangle[0,0] = 1  special case: there is no source
.pascal_row:                   ; do {
    mov  rcx, rdi               ; RCX = one-past-end of src row = start of dst row
    xor  eax, eax               ; EAX = triangle[row-1][col-1] = 0 for first iteration
    ;; RSI points to start of src row: triangle[row-1][0]
    ;; RDI points to start of dst row: triangle[row  ][0]
  .column:
     mov   edx, [rsi]           ; tri[r-1, c]           ; will load 1 on the first iteration
     add   eax, edx             ; eax = tri[r-1, c-1] + tri[r-1, c]
     mov  [rdi], eax            ; store to triangle[row, col]

     add  rdi, 4                ; ++dst
     add  rsi, 4                ; ++src
     mov  eax, edx              ; becomes col-1 src value for next iteration

     cmp  rsi, rcx
     jb   .column              ; }while(src < end_src)

    ;; RSI points to one-past-end of src row, i.e. start of next row = src for next iteration
    ;; RDI points to last element of dst row  (because dst row is 1 element longer than src row)

    mov  dword [rdi], 1        ;  [r,r] = 1   end of a row
    add  rdi, 4                ;  this is where dst-src distance grows each iteration

    cmp  rdi, end_triangle
    jb  .pascal_row

       ;;; triangle is constructed.  Set a breakpoint here to look at it with a debugger

    xor   edi,edi
    mov   eax, 231
    syscall               ; Linux sys_exit_group(0), 64-bit ABI



section .bss

; you could just as well use  resd 64*65/2
; but put a label on each row for debugging convenience.

ALIGN 16
triangle:
%assign i 0
%rep    64
    row %+ i:  resd  i + 1
%assign i i+1
%endrep
end_triangle:

I tested this and it works: correct values in memory, and it stops at the right place. But note that integer overflow happens before you get down to the last row. This would be avoided if you used 64-bit integers (simple change to register names and offsets, and don't forget resd to resq). 64 choose 32 is 1832624140942590534 = 2^60.66.

The %rep block to reserve space and label each row as row0, row1, etc. is from my answer to the question you linked about macros, much more sane than the other answer IMO.

You tagged this NASM, so that's what I used because I'm familiar with it. The syntax you used in your question was MASM (until the last edit). The main logic is the same in MASM, but remember that you need OFFSET triangle to get the address as an immediate, instead of loading from it.

I used x86-64 because 32-bit is obsolete, but I avoided too many registers, so you can easily port this to 32-bit if needed. Don't forget to save/restore call-preserved registers if you put this in a function instead of a stand-alone program.

Unrolling the inner loop could save some instructions copying registers around, as well as the loop overhead. This is a somewhat optimized implementation, but I mostly limited it to optimizations that make the code simpler as well as smaller / faster. (Except maybe for using pointer increments instead of indexing.) It took a while to make it this clean and simple. :P

Different ways of doing the array indexing would be faster on different CPUs. e.g. perhaps use an indexed addressing mode (relative to dst) for the loads in the inner loop, so only one pointer increment is needed. But if you want it to run fast, SSE2 or AVX2 vpaddd could be good. Shuffling with palignr might be useful, but probably also unaligned loads instead of some of the shuffling, especially with AVX2 or AVX512.

But anyway, this is my version; I'm not trying to write it the way you would, you need to write your own for your assignment. I'm writing for future readers who might learn something about what's efficient on x86. (See also the performance section in the x86 tag wiki.)


How I wrote that:

I started writing the code from the top, but quickly realized that off-by-one errors were going to be tricky, and I didn't want to just write it the stupid way with branches inside the loops for special cases.

What ended up helping was writing the comments for the pre and post conditions on the pointers for the inner loop. That made it clear I needed to enter the loop with eax=0, instead of with eax=1 and storing eax as the first operation inside the loop, or something like that.

Obviously each source value only needs to be read once, so I didn't want to write an inner loop that reads [rsi] and [rsi+4] or something. Besides, that would have made it harder to get the boundary condition right (where a non-existant value has to read as 0).

It took some time to decide whether I was going to have an actual counter in a register for row length or row number, before I ended up just using an end-pointer for the whole triangle. It wasn't obvious before I finished that using pure pointer increments / compares was going to save so many instructions (and registers when the upper bound is a build-time constant like end_triangle), but it worked out nicely.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...