To examine the rounding issue (in round-to-nearest-or-even mode only), I built the emulation code for IEEE-754 binary32
division from scratch for ease of exposition. Once that was working, I mechanically transformed the code into emulation code for IEEE-754 binary64
division. The ISO-C99 code for both, including my test frame work, is shown below. The approach differs slightly from asker's algorithm in that it performs intermediate computations in Q1.63 arithmetic for maximum accuracy and uses a table of either 8-bit or 16-bit entries for the starting approximation of the reciprocal.
The rounding step basically subtracts the product of the raw quotient and the divisor from the dividend to form a remainder rem_raw
. It also forms the remainder rem_inc
that would result from incrementing the quotient by 1 ulp. By construction, we know that the raw quotient is sufficiently accurate that either it or its incremented value is the correctly rounded result. The remainders can be both positive, both negative, or mixed negative / positive. The remainder smaller in magnitude corresponds to the correctly rounded quotient.
The only difference that exists between rounding normals and subnormals (other than the denormalization step inherent in the latter) is that tie cases cannot occur for normal results, while they can occur for subnormal results. See, for example,
Milo? D. Ercegovac and Tomás Lang, "Digital Arithmetic", Morgan Kaufman, 2004, p. 452
When computing in fixed-point arithmetic, the product of raw quotient and divisor is a double-length product. To compute the remainder precisely without any bits lost, we therefore change the fixed-point representation on the fly to provide additional fraction bits. For that the dividend is left shifted by the appropriate number of bits. But because we know from construction of the algorithm that the preliminary quotient is very close to the true result, we know that during subtraction from the dividend all the high-order bits will cancel. So we only need to compute and subtract the low-order product bits to compute the two remainders.
Because the division of two values each in [1,2) results in a quotient in (0.5, 2), the computation of the quotient may involve a normalization step to get back into the interval [1,2), accompanied by an exponent correction. We need to account for this when lining up the dividend and the product of quotient and divisor for subtraction, see normalization_shift
in the code below.
Since the code below is of an exploratory nature, it was not written with extreme optimization in mind. Various tweaks are possible, as are replacements of portable code with platform-specific intrinsics or inline assembly. Likewise, the basic test framework below could be strengthened by incorporating various techniques for generating hard-to-round cases from the literature. For example, I have used division test vectors accompanying the following paper in the past:
Brigitte Verdonk, Annie Cuyt, and Dennis Verschaeren. "A precision- and range-independent tool for testing floating-point arithmetic I: basic operations, square root, and remainder." ACM Transactions on Mathematical Software, Vol. 27, No. 1, March 2001, pp. 92-118.
The pattern-based test vectors of my test framework were motivated by the following publication:
N. L. Schryer, "A Test of a Computer's Floating-Point Unit." Computer Science Technical Report no. 89, AT&T Bell Laboratories, Murray Hill, N.J. (1981).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <limits.h>
#define TEST_FP32_DIV (0) /* 0: binary64 division; 1: binary32 division */
#define PURELY_RANDOM (1)
#define PATTERN_BASED (2)
#define TEST_MODE (PATTERN_BASED)
#define ITO_TAKAGI_YAJIMA (1) /* more accurate recip. starting approximation */
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
uint32_t umul32hi (uint32_t a, uint32_t b)
{
return (uint32_t)(((uint64_t)a*b) >> 32);
}
int clz32 (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
#if ITO_TAKAGI_YAJIMA
/* Masayuki Ito, Naofumi Takagi, Shuzo Yajima, "Efficient Initial Approximation
for Multiplicative Division and Square Root by a Multiplication with Operand
Modification". IEEE Transactions on Computers, Vol. 46, No. 4, April 1997,
pp. 495-498.
*/
#define LOG2_TAB_ENTRIES (6)
#define TAB_ENTRIES (1 << LOG2_TAB_ENTRIES)
#define TAB_ENTRY_BITS (16)
/* approx. err. ~= 5.146e-5 */
const uint16_t b1tab [64] =
{
0xfc0f, 0xf46b, 0xed20, 0xe626, 0xdf7a, 0xd918, 0xd2fa, 0xcd1e,
0xc77f, 0xc21b, 0xbcee, 0xb7f5, 0xb32e, 0xae96, 0xaa2a, 0xa5e9,
0xa1d1, 0x9dde, 0x9a11, 0x9665, 0x92dc, 0x8f71, 0x8c25, 0x88f6,
0x85e2, 0x82e8, 0x8008, 0x7d3f, 0x7a8e, 0x77f2, 0x756c, 0x72f9,
0x709b, 0x6e4e, 0x6c14, 0x69eb, 0x67d2, 0x65c8, 0x63cf, 0x61e3,
0x6006, 0x5e36, 0x5c73, 0x5abd, 0x5913, 0x5774, 0x55e1, 0x5458,
0x52da, 0x5166, 0x4ffc, 0x4e9b, 0x4d43, 0x4bf3, 0x4aad, 0x496e,
0x4837, 0x4708, 0x45e0, 0x44c0, 0x43a6, 0x4293, 0x4187, 0x4081
};
#else // ITO_TAKAGI_YAJIMA
#define LOG2_TAB_ENTRIES (7)
#define TAB_ENTRIES (1 << LOG2_TAB_ENTRIES)
#define TAB_ENTRY_BITS (8)
/* approx. err. ~= 5.585e-3 */
const uint8_t rcp_tab [TAB_ENTRIES] =
{
0xff, 0xfd, 0xfb, 0xf9, 0xf7, 0xf5, 0xf4, 0xf2,
0xf0, 0xee, 0xed, 0xeb, 0xe9, 0xe8, 0xe6, 0xe4,
0xe3, 0xe1, 0xe0, 0xde, 0xdd, 0xdb, 0xda, 0xd8,
0xd7, 0xd5, 0xd4, 0xd3, 0xd1, 0xd0, 0xcf, 0xcd,
0xcc, 0xcb, 0xca, 0xc8, 0xc7, 0xc6, 0xc5, 0xc4,
0xc2, 0xc1, 0xc0, 0xbf, 0xbe, 0xbd, 0xbc, 0xbb,
0xba, 0xb9, 0xb8, 0xb7, 0xb6, 0xb5, 0xb4, 0xb3,
0xb2, 0xb1, 0xb0, 0xaf, 0xae, 0xad, 0xac, 0xab,
0xaa, 0xa9, 0xa8, 0xa8, 0xa7, 0xa6, 0xa5, 0xa4,
0xa3, 0xa3, 0xa2, 0xa1, 0xa0, 0x9f, 0x9f, 0x9e,
0x9d, 0x9c, 0x9c, 0x9b, 0x9a, 0x99, 0x99, 0x98,
0x97, 0x97, 0x96, 0x95, 0x95, 0x94, 0x93, 0x93,
0x92, 0x91, 0x91, 0x90, 0x8f, 0x8f, 0x8e, 0x8e,
0x8d, 0x8c, 0x8c, 0x8b, 0x8b, 0x8a, 0x89, 0x89,
0x88, 0x88, 0x87, 0x87, 0x86, 0x85, 0x85, 0x84,
0x84, 0x83, 0x83, 0x82, 0x82, 0x81, 0x81, 0x80
};
#endif // ITO_TAKAGI_YAJIMA
#define FP32_MANT_BITS (23)
#define FP32_EXPO_BITS (8)
#define FP32_SIGN_MASK (0x80000000u)
#define FP32_MANT_MASK (0x007fffffu)
#define FP32_EXPO_MASK (0x7f800000u)
#define FP32_MAX_EXPO (0xff)
#define FP32_EXPO_BIAS (127)
#define FP32_INFTY (0x7f800000u)
#define FP32_QNAN_BIT (0x00400000u)
#define FP32_QNAN_INDEFINITE (0xffc00000u)
#define FP32_MANT_INT_BIT (0x00800000u)
uint32_t fp32_div_core (uint32_t x, uint32_t y)
{
uint32_t expo_x, expo_y, expo_r, sign_r;
uint32_t abs_x, abs_y, f, l, p, r, z, s;
int x_is_zero, y_is_zero, normalization_shift;
expo_x = (x & FP32_EXPO_MASK) >> FP32_MANT_BITS;
expo_y = (y & FP32_EXPO_MASK) >> FP32_MANT_BITS;
sign_r = (x ^ y) & FP32_SIGN_MASK;
abs_x = x & ~FP32_SIGN_MASK;
abs_y = y & ~FP32_SIGN_MASK;
x_is_zero = (abs_x == 0);
y_is_zero = (abs_y == 0);
if ((expo_x == FP32_MAX_EXPO) || (expo_y == FP32_MAX_EXPO) ||
x_is_zero || y_is_zero) {
int x_is_nan = (abs_x > FP32_INFTY);
int x_is_inf = (abs_x == FP32_INFTY);
int y_is_nan = (abs_y > FP32_INFTY);
int y_is_inf = (abs_y == FP32_INFTY);
if (x_is_nan) {
r = x | FP32_QNAN_BIT;
} else if (y_is_nan) {
r = y | FP32_QNAN_BIT;
} else if ((x_is_zero && y_is_zero) || (x_is_inf && y_is_inf)) {
r = FP32_QNAN_INDEFINITE;
} else if (x_is_zero || y_is_inf) {
r = sign_r;
} else if (y_is_zero || x_is_inf) {
r = sign_r | FP32_INFTY;
}
} else {
/* normalize any subnormals */
if (expo_x == 0) {
s = clz32 (abs_x) - FP32_EXPO_BITS;
x = x << s;
expo_x = expo_x - (s - 1);
}
if (expo_y == 0) {
s = clz32 (abs_y) - FP32_EXPO_BITS;
y = y << s;
expo_y = expo_y - (s - 1);
}
//
expo_r = expo_x - expo_y + FP32_EXPO_BIAS;
/* extract mantissas */
x = x & FP32_MANT_MASK;
y = y & FP32_MANT_MASK;
#if ITO_TAKAGI_YAJIMA
/* initial approx based on 6 most significant stored mantissa bits */
r = b1tab [y >> (FP32_MANT_BITS - LOG2_TAB_ENTRIES)];
/* make implicit integer bit of mantissa explicit */
x = x | FP32_MANT_INT_BIT;
y = y | FP32_MANT_INT_BIT;
r = r * ((y ^ ((1u << (FP32_MANT_BITS - LOG2_TAB_ENTRIES)) - 1)) >>
(FP32_MANT_BITS + 1 + TAB_ENTRY_BITS - 32));
/* pre-scale y for more efficient fixed-point computation */
z = y << FP32_EXPO_BITS;
#else // ITO_TAKAGI_YAJIMA
/* initial approx based on 7 most significant stored mantissa bits */
r = rcp_tab [y >> (FP32_MANT_BITS - LOG2_TAB_ENTRIES)];
/* make implicit integer bit of mantissa explicit */
x = x | FP32_MANT_INT_BIT;
y = y | FP32_MANT_INT_BIT;
/* pre-scale y for more efficient fixed-point computation */
z = y << FP32_EXPO_BITS;
/* first NR iteration r1 = 2*r0-y*r0*r0 */
f = r * r;
p = umul32hi (z, f << (32 - 2 * TAB_ENTRY_BITS));
r = (r << (32 - TAB_ENTRY_BITS)) - p;
#endif // ITO_TAKAGI_YAJIMA
/* second NR iteration: r2 = r1*(2-y*r1) */
p = umul32hi (z, r << 1);
f = 0u - p;
r = umul32hi (f, r << 1);
/* compute quotient as wide product x*(1/y) = x*r */
l = x * (r << 1);
r = umul32hi (x, r << 1);
/* normalize mantissa to be in [1,2) */
normalization_shift = (r & FP32_MANT_INT_BIT) == 0;
expo_r -= normalization_shift;
r = (r << normalization_shift) | ((l >> 1) >> (32 - 1 - normalization_shift));
if ((expo_r > 0) && (expo_r < FP32_MAX_EXPO)) { /* result is normal */
int32_t rem_raw, rem_inc, inc;
/* align x with product y*quotient */
x = x << (FP32_MANT_BITS + normalization_shift + 1);
/* compute product y*quotient */
y = y << 1;
p = y * r;
/* compute x - y*quotient, for both raw and incremented quotient */
rem_raw = x - p;
rem_inc = rem_raw - y;
/* round to nearest: tie case _cannot_ occur */
inc = abs (rem_inc) < abs (rem_raw);
/* build final results from sign, exponent, mantissa */
r = sign_r | (((expo_r - 1) << FP32_MANT_BITS) + r + inc);
} else if ((int)expo_r >= FP32_MAX_EXPO) { /* result overflowed */
r = sign_r | FP32_INFTY;
} else { /* result underflowed */
int denorm_shift = 1 - expo_r;
if (denorm_shift > (FP32_MANT_BITS + 1)) { /* massive underflow */
r = sign_r;
} else {