In the following, math shown in code format
is computed with IEEE 754 in round-to-nearest mode, and math not in code format is exact.
Let p be the number of bits in the significand.
Let f be the factor 2n-1 for a positive integer n and be exactly representable (n ≤ p).
Let U(x) be the ULP of x. For normal values, U(x) ≤ 21-px.
Let t be f*x
. If f*x
is subnormal, then it is exactly fx. If it is normal, then t = fx+e for some |e| ≤ ½U(fx) ≤ 2-px. Note that if |e| is exactly half an ULP, then it must equal the lowest bit of x that is set (since otherwise e would have more than one bit set and could not be half of an ULP).
Substituting for f, t = (2n-1)x+e.
t+x = (2n-1)x+e+x = 2nx+e.
Consider t+x
. By IEEE-754 requirements of round-to-nearest, this must be within ½ an ULP of t+x, which we know to be 2nx+e. Clearly 2nx is representable (barring overflow), and |e| ≤ ½U(fx) ≤ ½U(2nx). Therefore t+x
must be 2nx unless |e| is exactly half an ULP and the low bit of x’s significand is odd (since an even low bit wins the tie and gives 2nx).
If n is 1, then f is 1, and e is 0. If 2 ≤ n, then |e| ≤ 1/4 U(2nx) < ½U(2nx). So a case where |e| is half an ULP and x’s low bit is odd does not occur.
Therefore t+x
must be 2nx. (Overflow and NaN left as an exercise for the reader.)
Additionally, I tested exhaustively for IEEE-754 32-bit binary floating-point.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…