c - 64bit/32bit division faster algorithm for ARM / NEON?

Question

Welcome To Ask or Share your Answers For Others

c - 64bit/32bit division faster algorithm for ARM / NEON?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - 64bit/32bit division faster algorithm for ARM / NEON?

I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. These two places are together taking more than 20% of my total time taken. So I feel like if I could remove the 64 bit division, I could optimize the code well. In NEON we can have some 64 bit instructions. Can any one suggest some routine to get the bottleneck resolved by using some faster implementation.

Or if I could make the 64 bit/32 bit division in terms of 32bit/32 bit division in C, that also is fine?

If any one has some idea, could you please help me out?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:22:55+0000

I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. If you google for 'ARM division' you will find tons of great links and discussion about this issue.

The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here:

http://www.peter-teichmann.de/adiv2e.html

This assembly code is very old, and your assembler may not understand the syntax of it. It is however worth porting the code to your toolchain. It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-)

Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated.

This code doesn't use NEON. A NEON port would be interesting. Not sure if it will improve the performance much though.

Edit:

I found the code with assembler ported to GAS (GNU Toolchain). This code is working and tested:

Divide.S

.section ".text"

.global udiv64

udiv64:
    adds      r0,r0,r0
    adc       r1,r1,r1

    .rept 31
        cmp     r1,r2   
        subcs   r1,r1,r2  
        adcs    r0,r0,r0
        adc     r1,r1,r1
    .endr

    cmp     r1,r2
    subcs   r1,r1,r2
    adcs    r0,r0,r0

    bx      lr

C-Code:

extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);

int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
{
  int q;
  int sign = (a^b) < 0; /* different signs */
  uint32_t l,h;
  a = a<0 ? -a:a;
  b = b<0 ? -b:b;
  l = (a << 24);
  h = (a >> 8);
  q = udiv64 (l,h,b);
  if (sign) q = -q;
  return q;
}

Categories

c - 64bit/32bit division faster algorithm for ARM / NEON?

c - 64bit/32bit division faster algorithm for ARM / NEON?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags