floating point - correctly-rounded double-precision division

Question

Welcome To Ask or Share your Answers For Others

floating point - correctly-rounded double-precision division

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

floating point - correctly-rounded double-precision division

I am using the following algorithm for double-precision division and trying to make it correctly rounded in software emulation of floating-point. Let a be the dividend and b is the divisor.

All operations are performed in Q2.62.

Initial approximation to the reciprocal is .

b/2 is the significand of b with its implicit bit added, and shifted one right. For what follows, when written a or b it is meant by the significand of a or b with its implicit bit added.

The is approximated with 0x17504f333f9de6 (0x5D413CCCFE779800 in Q2.62).

After, the reciprocal is approximated with Newton-Raphson iterations:

There are 6 such iterations for the reciprocal, r. The quotient, q, is computed by multiplying r by (the significand of) a.

Additional adjustment step for the quotient:

The final rounding is:

if a <= (a - q * b/2):
      result = final_biased_exponent | q
else
      result = final_biased_exponent | adjusted_q

This works correctly except for two cases: a) the result is subnormal or b) both a and b are subnormals. In those cases, it is not correctly-rounded and the result is off by 1 bit (comparing to x86 results). (The numbers a and b are normalized and the exponent is scaled accordingly when either of a or b is normalized.)

What can I do to correctly round it for those cases too?

What it seems to me is that the precision is lost at step x5. Since at step x4 the reciprocal is approximated with ~54 bits and it fits 64-bits variable. While in step x5, the reciprocal is approximated by ~108 bits. So at step x5 the full width of the reciprocal does not fit 64 bits. I get the feeling that I do not take this into account properly when I truncate the 128 bits after multiplication to 64 bits.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:39:33+0000

To examine the rounding issue (in round-to-nearest-or-even mode only), I built the emulation code for IEEE-754 binary32 division from scratch for ease of exposition. Once that was working, I mechanically transformed the code into emulation code for IEEE-754 binary64 division. The ISO-C99 code for both, including my test frame work, is shown below. The approach differs slightly from asker's algorithm in that it performs intermediate computations in Q1.63 arithmetic for maximum accuracy and uses a table of either 8-bit or 16-bit entries for the starting approximation of the reciprocal.

The rounding step basically subtracts the product of the raw quotient and the divisor from the dividend to form a remainder rem_raw. It also forms the remainder rem_inc that would result from incrementing the quotient by 1 ulp. By construction, we know that the raw quotient is sufficiently accurate that either it or its incremented value is the correctly rounded result. The remainders can be both positive, both negative, or mixed negative / positive. The remainder smaller in magnitude corresponds to the correctly rounded quotient.

The only difference that exists between rounding normals and subnormals (other than the denormalization step inherent in the latter) is that tie cases cannot occur for normal results, while they can occur for subnormal results. See, for example,

Milo? D. Ercegovac and Tomás Lang, "Digital Arithmetic", Morgan Kaufman, 2004, p. 452

When computing in fixed-point arithmetic, the product of raw quotient and divisor is a double-length product. To compute the remainder precisely without any bits lost, we therefore change the fixed-point representation on the fly to provide additional fraction bits. For that the dividend is left shifted by the appropriate number of bits. But because we know from construction of the algorithm that the preliminary quotient is very close to the true result, we know that during subtraction from the dividend all the high-order bits will cancel. So we only need to compute and subtract the low-order product bits to compute the two remainders.

Because the division of two values each in [1,2) results in a quotient in (0.5, 2), the computation of the quotient may involve a normalization step to get back into the interval [1,2), accompanied by an exponent correction. We need to account for this when lining up the dividend and the product of quotient and divisor for subtraction, see normalization_shift in the code below.

Since the code below is of an exploratory nature, it was not written with extreme optimization in mind. Various tweaks are possible, as are replacements of portable code with platform-specific intrinsics or inline assembly. Likewise, the basic test framework below could be strengthened by incorporating various techniques for generating hard-to-round cases from the literature. For example, I have used division test vectors accompanying the following paper in the past:

Brigitte Verdonk, Annie Cuyt, and Dennis Verschaeren. "A precision- and range-independent tool for testing floating-point arithmetic I: basic operations, square root, and remainder." ACM Transactions on Mathematical Software, Vol. 27, No. 1, March 2001, pp. 92-118.

The pattern-based test vectors of my test framework were motivated by the following publication:

N. L. Schryer, "A Test of a Computer's Floating-Point Unit." Computer Science Technical Report no. 89, AT&T Bell Laboratories, Murray Hill, N.J. (1981).

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <limits.h>

#define TEST_FP32_DIV     (0) /* 0: binary64 division; 1: binary32 division */
#define PURELY_RANDOM     (1)
#define PATTERN_BASED     (2)
#define TEST_MODE         (PATTERN_BASED)
#define ITO_TAKAGI_YAJIMA (1) /* more accurate recip. starting approximation */

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t umul32hi (uint32_t a, uint32_t b)
{
    return (uint32_t)(((uint64_t)a*b) >> 32);
}

int clz32 (uint32_t a)
{
    uint32_t r = 32;
    if (a >= 0x00010000) { a >>= 16; r -= 16; }
    if (a >= 0x00000100) { a >>=  8; r -=  8; }
    if (a >= 0x00000010) { a >>=  4; r -=  4; }
    if (a >= 0x00000004) { a >>=  2; r -=  2; }
    r -= a - (a & (a >> 1));
    return r;
}

#if ITO_TAKAGI_YAJIMA
/* Masayuki Ito, Naofumi Takagi, Shuzo Yajima, "Efficient Initial Approximation
   for Multiplicative Division and Square Root by a Multiplication with Operand
   Modification". IEEE Transactions on Computers, Vol. 46, No. 4, April 1997,
   pp. 495-498.
*/
#define LOG2_TAB_ENTRIES (6)
#define TAB_ENTRIES      (1 << LOG2_TAB_ENTRIES)
#define TAB_ENTRY_BITS   (16)
/* approx. err. ~= 5.146e-5 */
const uint16_t b1tab [64] =
{
    0xfc0f, 0xf46b, 0xed20, 0xe626, 0xdf7a, 0xd918, 0xd2fa, 0xcd1e,
    0xc77f, 0xc21b, 0xbcee, 0xb7f5, 0xb32e, 0xae96, 0xaa2a, 0xa5e9,
    0xa1d1, 0x9dde, 0x9a11, 0x9665, 0x92dc, 0x8f71, 0x8c25, 0x88f6,
    0x85e2, 0x82e8, 0x8008, 0x7d3f, 0x7a8e, 0x77f2, 0x756c, 0x72f9,
    0x709b, 0x6e4e, 0x6c14, 0x69eb, 0x67d2, 0x65c8, 0x63cf, 0x61e3,
    0x6006, 0x5e36, 0x5c73, 0x5abd, 0x5913, 0x5774, 0x55e1, 0x5458,
    0x52da, 0x5166, 0x4ffc, 0x4e9b, 0x4d43, 0x4bf3, 0x4aad, 0x496e,
    0x4837, 0x4708, 0x45e0, 0x44c0, 0x43a6, 0x4293, 0x4187, 0x4081
};
#else // ITO_TAKAGI_YAJIMA
#define LOG2_TAB_ENTRIES (7)
#define TAB_ENTRIES      (1 << LOG2_TAB_ENTRIES)
#define TAB_ENTRY_BITS   (8)
/* approx. err. ~= 5.585e-3 */
const uint8_t rcp_tab [TAB_ENTRIES] =
{
    0xff, 0xfd, 0xfb, 0xf9, 0xf7, 0xf5, 0xf4, 0xf2,
    0xf0, 0xee, 0xed, 0xeb, 0xe9, 0xe8, 0xe6, 0xe4,
    0xe3, 0xe1, 0xe0, 0xde, 0xdd, 0xdb, 0xda, 0xd8,
    0xd7, 0xd5, 0xd4, 0xd3, 0xd1, 0xd0, 0xcf, 0xcd,
    0xcc, 0xcb, 0xca, 0xc8, 0xc7, 0xc6, 0xc5, 0xc4,
    0xc2, 0xc1, 0xc0, 0xbf, 0xbe, 0xbd, 0xbc, 0xbb,
    0xba, 0xb9, 0xb8, 0xb7, 0xb6, 0xb5, 0xb4, 0xb3,
    0xb2, 0xb1, 0xb0, 0xaf, 0xae, 0xad, 0xac, 0xab,
    0xaa, 0xa9, 0xa8, 0xa8, 0xa7, 0xa6, 0xa5, 0xa4,
    0xa3, 0xa3, 0xa2, 0xa1, 0xa0, 0x9f, 0x9f, 0x9e,
    0x9d, 0x9c, 0x9c, 0x9b, 0x9a, 0x99, 0x99, 0x98,
    0x97, 0x97, 0x96, 0x95, 0x95, 0x94, 0x93, 0x93,
    0x92, 0x91, 0x91, 0x90, 0x8f, 0x8f, 0x8e, 0x8e,
    0x8d, 0x8c, 0x8c, 0x8b, 0x8b, 0x8a, 0x89, 0x89,
    0x88, 0x88, 0x87, 0x87, 0x86, 0x85, 0x85, 0x84,
    0x84, 0x83, 0x83, 0x82, 0x82, 0x81, 0x81, 0x80
};
#endif // ITO_TAKAGI_YAJIMA

#define FP32_MANT_BITS       (23)
#define FP32_EXPO_BITS       (8)
#define FP32_SIGN_MASK       (0x80000000u)
#define FP32_MANT_MASK       (0x007fffffu)
#define FP32_EXPO_MASK       (0x7f800000u)
#define FP32_MAX_EXPO        (0xff)
#define FP32_EXPO_BIAS       (127)
#define FP32_INFTY           (0x7f800000u)
#define FP32_QNAN_BIT        (0x00400000u)
#define FP32_QNAN_INDEFINITE (0xffc00000u)
#define FP32_MANT_INT_BIT    (0x00800000u)

uint32_t fp32_div_core (uint32_t x, uint32_t y)
{
    uint32_t expo_x, expo_y, expo_r, sign_r;
    uint32_t abs_x, abs_y, f, l, p, r, z, s;
    int x_is_zero, y_is_zero, normalization_shift;

    expo_x = (x & FP32_EXPO_MASK) >> FP32_MANT_BITS;
    expo_y = (y & FP32_EXPO_MASK) >> FP32_MANT_BITS;
    sign_r = (x ^ y) & FP32_SIGN_MASK;

    abs_x = x & ~FP32_SIGN_MASK;
    abs_y = y & ~FP32_SIGN_MASK;
    x_is_zero = (abs_x == 0);
    y_is_zero = (abs_y == 0);

    if ((expo_x == FP32_MAX_EXPO) || (expo_y == FP32_MAX_EXPO) || 
        x_is_zero || y_is_zero) {
        int x_is_nan = (abs_x >  FP32_INFTY);
        int x_is_inf = (abs_x == FP32_INFTY);
        int y_is_nan = (abs_y >  FP32_INFTY);
        int y_is_inf = (abs_y == FP32_INFTY);
        if (x_is_nan) {
            r = x | FP32_QNAN_BIT;
        } else if (y_is_nan) {
            r = y | FP32_QNAN_BIT;
        } else if ((x_is_zero && y_is_zero) || (x_is_inf && y_is_inf)) {
            r = FP32_QNAN_INDEFINITE;
        } else if (x_is_zero || y_is_inf) {
            r = sign_r;
        } else if (y_is_zero || x_is_inf) {
            r = sign_r | FP32_INFTY;
        }
    } else {
        /* normalize any subnormals */
        if (expo_x == 0) {
            s = clz32 (abs_x) - FP32_EXPO_BITS;
            x = x << s;
            expo_x = expo_x - (s - 1);
        }
        if (expo_y == 0) {
            s = clz32 (abs_y) - FP32_EXPO_BITS;
            y = y << s;
            expo_y = expo_y - (s - 1);
        }
        //
        expo_r = expo_x - expo_y + FP32_EXPO_BIAS;
        /* extract mantissas */
        x = x & FP32_MANT_MASK;
        y = y & FP32_MANT_MASK;
#if ITO_TAKAGI_YAJIMA
        /* initial approx based on 6 most significant stored mantissa bits */
        r = b1tab [y >> (FP32_MANT_BITS - LOG2_TAB_ENTRIES)];
        /* make implicit integer bit of mantissa explicit */
        x = x | FP32_MANT_INT_BIT;
        y = y | FP32_MANT_INT_BIT;
        r = r * ((y ^ ((1u << (FP32_MANT_BITS - LOG2_TAB_ENTRIES)) - 1)) >> 
                 (FP32_MANT_BITS + 1 + TAB_ENTRY_BITS - 32));
        /* pre-scale y for more efficient fixed-point computation */
        z = y << FP32_EXPO_BITS;
#else // ITO_TAKAGI_YAJIMA
        /* initial approx based on 7 most significant stored mantissa bits */
        r = rcp_tab [y >> (FP32_MANT_BITS - LOG2_TAB_ENTRIES)];
        /* make implicit integer bit of mantissa explicit */
        x = x | FP32_MANT_INT_BIT;
        y = y | FP32_MANT_INT_BIT;
        /* pre-scale y for more efficient fixed-point computation */
        z = y << FP32_EXPO_BITS;
        /* first NR iteration r1 = 2*r0-y*r0*r0 */
        f = r * r;
        p = umul32hi (z, f << (32 - 2 * TAB_ENTRY_BITS));
        r = (r << (32 - TAB_ENTRY_BITS)) - p;
#endif // ITO_TAKAGI_YAJIMA
        /* second NR iteration: r2 = r1*(2-y*r1) */
        p = umul32hi (z, r << 1);
        f = 0u - p;
        r = umul32hi (f, r << 1);
        /* compute quotient as wide product x*(1/y) = x*r */
        l = x * (r << 1);
        r = umul32hi (x, r << 1);
        /* normalize mantissa to be in [1,2) */
        normalization_shift = (r & FP32_MANT_INT_BIT) == 0;
        expo_r -= normalization_shift;
        r = (r << normalization_shift) | ((l >> 1) >> (32 - 1 - normalization_shift));
        if ((expo_r > 0) && (expo_r < FP32_MAX_EXPO)) { /* result is normal */
            int32_t rem_raw, rem_inc, inc;
            /* align x with product y*quotient */
            x = x << (FP32_MANT_BITS + normalization_shift + 1);
            /* compute product y*quotient */
            y = y << 1;
            p = y * r;
            /* compute x - y*quotient, for both raw and incremented quotient */
            rem_raw = x - p;
            rem_inc = rem_raw - y;
            /* round to nearest: tie case _cannot_ occur */
            inc = abs (rem_inc) < abs (rem_raw);
            /* build final results from sign, exponent, mantissa */
            r = sign_r | (((expo_r - 1) << FP32_MANT_BITS) + r + inc);
        } else if ((int)expo_r >= FP32_MAX_EXPO) { /* result overflowed */
            r = sign_r | FP32_INFTY;
        } else { /* result underflowed */
            int denorm_shift = 1 - expo_r;
            if (denorm_shift > (FP32_MANT_BITS + 1)) { /* massive underflow */
                r = sign_r;
            } else {

Categories

floating point - correctly-rounded double-precision division

floating point - correctly-rounded double-precision division

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags