c++ - A fast method to round a double to a 32-bit int explained

Question

Welcome To Ask or Share your Answers For Others

c++ - A fast method to round a double to a 32-bit int explained

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - A fast method to round a double to a 32-bit int explained

When reading Lua’s source code, I noticed that Lua uses a macro to round double values to 32-bit int values. The macro is defined in the Llimits.h header file and reads as follows:

union i_cast {double d; int i[2]};
#define double2int(i, d, t) 
    {volatile union i_cast u; u.d = (d) + 6755399441055744.0; 
    (i) = (t)u.i[ENDIANLOC];}

Here ENDIANLOC is defined according to endianness: 0?for little endian, 1?for big endian architectures; Lua carefully handles endianness. The t?argument is substituted with an integer type like int or unsigned int.

I did a little research and found that there is a simpler format of that macro which uses the same technique:

#define double2int(i, d) 
    {double t = ((d) + 6755399441055744.0); i = *((int *)(&t));}

Or, in a C++-style:

inline int double2int(double d)
{
    d += 6755399441055744.0;
    return reinterpret_cast<int&>(d);
}

This trick can work on any machine using IEEE?754 (which means pretty much every machine today). It works for both positive and negative numbers, and the rounding follows Banker’s Rule. (This is not surprising, since it follows IEEE?754.)

I wrote a little program to test it:

int main()
{
    double d = -12345678.9;
    int i;
    double2int(i, d)
    printf("%d
", i);
    return 0;
}

And it outputs -12345679, as expected.

I would like to understand how this tricky macro works in detail. The magic number 6755399441055744.0 is actually 2⁵¹?+?2⁵², or 1.5?×?2⁵², and 1.5?in binary can be represented as?1.1. When any 32-bit integer is added to this magic number—

Well, I’m lost from here. How does this trick work?

Update

As @Mysticial points out, this method does not limit itself to a 32-bit int, it can also be expanded to a 64-bit int as long as the number is in the range of 2⁵². (Although the macro needs some modification.)
Some materials say this method cannot be used in Direct3D.
When working with Microsoft assembler for x86, there is an even faster macro written in assembly code (the following is also extracted from Lua source):
```
 #define double2int(i,n)  __asm {__asm fld n   __asm fistp i}
```
There is a similar magic number for single precision numbers: 1.5?×?2²³.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:08:42+0000

A value of the double floating-point type is represented like so:

double representation

and it can be seen as two 32-bit integers; now, the int taken in all the versions of your code (supposing it’s a 32-bit int) is the one on the right in the figure, so what you are doing in the end is just taking the lowest 32 bits of mantissa.

Now, to the magic number; as you correctly stated, 6755399441055744 is 2⁵¹?+?2⁵²; adding such a number forces the double to go into the “sweet range” between 2⁵² and 2⁵³, which, as explained by Wikipedia, has an interesting property:

Between 2⁵²?= 4,503,599,627,370,496 and 2⁵³?= 9,007,199,254,740,992, the representable numbers are exactly the integers.

This follows from the fact that the mantissa is 52 bits wide.

The other interesting fact about adding 2⁵¹?+?2⁵² is that it affects the mantissa only in the two highest bits—which are discarded anyway, since we are taking only its lowest 32?bits.

Last but not least: the sign.

IEEE?754 floating point uses a magnitude and sign representation, while integers on “normal” machines use 2’s?complement arithmetic; how is this handled here?

We talked only about positive integers; now suppose we are dealing with a negative number in the range representable by a 32-bit int, so less (in absolute value) than (?2³¹?+?1); call it??a. Such a number is obviously made positive by adding the magic number, and the resulting value is 2⁵²?+?2⁵¹?+?(?a).

Now, what do we get if we interpret the mantissa in 2’s complement representation? It must be the result of 2’s complement sum of (2⁵²?+?2⁵¹) and?(?a). Again, the first term affects only the upper two bits, what remains in the bits?0–50 is the 2’s?complement representation of?(?a) (again, minus the upper two bits).

Since reduction of a 2’s?complement number to a smaller width is done just by cutting away the extra bits on the left, taking the lower 32 bits gives us correctly?(?a) in 32-bit, 2’s?complement arithmetic.

Categories

c++ - A fast method to round a double to a 32-bit int explained

c++ - A fast method to round a double to a 32-bit int explained

Update

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags