Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
383 views
in Technique[技术] by (71.8m points)

c - what happens at background when convert int to float

I have some no understanding about how one can cast int to float, step by step? Assume I have a signed integer number which is in binary format. Moreover, I want cast it to float by hand. However, I can't. Thus, CAn one show me how to do that conversion step by step?

I do that conversion in c, many times ? like;

  int a = foo ( );
  float f = ( float ) a ;

But, I haven't figure out what happens at background. Moreover, To understand well, I want do that conversion by hand.

EDIT: If you know much about conversion, you can also give information about for float to double conversion. Moreover, for float to int

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Floating point values (IEEE754 ones, anyway) basically have three components:

  • a sign s;
  • a series of exponent bits e; and
  • a series of mantissa bits m.

The precision dictates how many bits are available for the exponent and mantissa. Let's examine the value 0.1 for single-precision floating point:

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm    1/n
0 01111011 10011001100110011001101
           ||||||||||||||||||||||+- 8388608
           |||||||||||||||||||||+-- 4194304
           ||||||||||||||||||||+--- 2097152
           |||||||||||||||||||+---- 1048576
           ||||||||||||||||||+-----  524288
           |||||||||||||||||+------  262144
           ||||||||||||||||+-------  131072
           |||||||||||||||+--------   65536
           ||||||||||||||+---------   32768
           |||||||||||||+----------   16384
           ||||||||||||+-----------    8192
           |||||||||||+------------    4096
           ||||||||||+-------------    2048
           |||||||||+--------------    1024
           ||||||||+---------------     512
           |||||||+----------------     256
           ||||||+-----------------     128
           |||||+------------------      64
           ||||+-------------------      32
           |||+--------------------      16
           ||+---------------------       8
           |+----------------------       4
           +-----------------------       2

The sign is positive, that's pretty easy.

The exponent is 64+32+16+8+2+1 = 123 - 127 bias = -4, so the multiplier is 2-4 or 1/16. The bias is there so that you can get really small numbers (like 10-30) as well as large ones.

The mantissa is chunky. It consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), {1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}.

When you add all these up, you get 1.60000002384185791015625.

When you multiply that by the 2-4 multiplier, you get 0.100000001490116119384765625, which is why they say you cannot represent 0.1 exactly as an IEEE754 float.

In terms of converting integers to floats, if you have as many bits in the mantissa (including the implicit 1), you can just transfer the integer bit pattern over and select the correct exponent. There will be no loss of precision. For example a double precision IEEE754 (64 bits, 52/53 of those being mantissa) has no problem taking on a 32-bit integer.

If there are more bits in your integer (such as a 32-bit integer and a 32-bit single precision float, which only has 23/24 bits of mantissa) then you need to scale the integer.

This involves stripping off the least significant bits (rounding actually) so that it will fit into the mantissa bits. That involves loss of precision of course but that's unavoidable.


Let's have a look at a specific value, 123456789. The following program dumps the bits of each data type.

#include <stdio.h>

static void dumpBits (char *desc, unsigned char *addr, size_t sz) {
    unsigned char mask;
    printf ("%s:
  ", desc);
    while (sz-- != 0) {
        putchar (' ');
        for (mask = 0x80; mask > 0; mask >>= 1, addr++)
            if (((addr[sz]) & mask) == 0)
                putchar ('0');
            else
                putchar ('1');
    }
    putchar ('
');
}

int main (void) {
    int intNum = 123456789;
    float fltNum = intNum;
    double dblNum = intNum;

    printf ("%d %f %f
",intNum, fltNum, dblNum);
    dumpBits ("Integer", (unsigned char *)(&intNum), sizeof (int));
    dumpBits ("Float", (unsigned char *)(&fltNum), sizeof (float));
    dumpBits ("Double", (unsigned char *)(&dblNum), sizeof (double));

    return 0;
}

The output on my system is as follows:

123456789 123456792.000000 123456789.000000
integer:
   00000111 01011011 11001101 00010101
float:
   01001100 11101011 01111001 10100011
double:
   01000001 10011101 01101111 00110100 01010100 00000000 00000000 00000000

And we'll look at these one at a time. First the integer, simple powers of two:

   00000111 01011011 11001101 00010101
        |||  | || || ||  || |    | | +->          1
        |||  | || || ||  || |    | +--->          4
        |||  | || || ||  || |    +----->         16
        |||  | || || ||  || +---------->        256
        |||  | || || ||  |+------------>       1024
        |||  | || || ||  +------------->       2048
        |||  | || || |+---------------->      16384
        |||  | || || +----------------->      32768
        |||  | || |+------------------->      65536
        |||  | || +-------------------->     131072
        |||  | |+---------------------->     524288
        |||  | +----------------------->    1048576
        |||  +------------------------->    4194304
        ||+---------------------------->   16777216
        |+----------------------------->   33554432
        +------------------------------>   67108864
                                         ==========
                                          123456789

Now let's look at the single precision float. Notice the bit pattern of the mantissa matching the integer as a near-perfect match:

mantissa:       11 01011011 11001101 00011    (spaced out).
integer:  00000111 01011011 11001101 00010101 (untouched).

There's an implicit 1 bit to the left of the mantissa and it's also been rounded at the other end, which is where that loss of precision comes from (the value changing from 123456789 to 123456792 as in the output from that program above).

Working out the values:

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm    1/n
0 10011001 11010110111100110100011
           || | || ||||  || |   |+- 8388608
           || | || ||||  || |   +-- 4194304
           || | || ||||  || +------  262144
           || | || ||||  |+--------   65536
           || | || ||||  +---------   32768
           || | || |||+------------    4096
           || | || ||+-------------    2048
           || | || |+--------------    1024
           || | || +---------------     512
           || | |+-----------------     128
           || | +------------------      64
           || +--------------------      16
           |+----------------------       4
           +-----------------------       2

The sign is positive. The exponent is 128+16+8+1 = 153 - 127 bias = 26, so the multiplier is 226 or 67108864.

The mantissa is 1 (the implicit base) plus (as explained above), {1/2, 1/4, 1/16, 1/64, 1/128, 1/512, 1/1024, 1/2048, 1/4096, 1/32768, 1/65536, 1/262144, 1/4194304, 1/8388608}. When you add all these up, you get 1.83964955806732177734375.

When you multiply that by the 226 multiplier, you get 123456792, the same as the program output.

The double bitmask output is:

s eeeeeeeeeee mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
0 10000011001 1101011011110011010001010100000000000000000000000000

I am not going to go through the process of figuring out the value of that beast :-) However, I will show the mantissa next to the integer format to show the common bit representation:

mantissa:       11 01011011 11001101 00010101 000...000 (spaced out).
integer:  00000111 01011011 11001101 00010101           (untouched).

You can once again see the commonality with the implicit bit on the left and the vastly greater bit availability on the right, which is why there's no loss of precision in this case.


In terms of converting between floats and doubles, that's also reasonably easy to understand.

You first have to check the special values such as NaN and the infinities. These are indicated by special exponent/mantissa combinations and it's probably easier to detect these up front ang generate the equivalent in the new format.

Then in the case where you're going from double to float, you obviously have less of a range available to you since there are less bits in the exponent. If your double is outside the range of a float, you need to handle that.

Assuming it will fit, you then need to:

  • rebase the exponent (the bias is different for the two types).
  • copy as many bits from the mantissa as will fit (rounding if necessary).
  • padding out the rest of the target mantissa (if any) with zero bits.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...