Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
690 views
in Technique[技术] by (71.8m points)

math - Can a IEEE 754 real number "cover" all integers within its range?

The original question was edited (shortened) to focus on a problem of precision, not range.

Single, or double precision, every representation of real number is limited to (-range,+range). Within this range lie some integer numbers (1, 2, 3, 4..., and so on; the same goes with negative numbers).

Is there a guarantee that a IEEE 754 real number (float, double, etc) can "cover" all integers within its range? By "cover" I mean the real number will represent the integer number exactly, not as (for example) "5.000001".

Just as reminder: http://www3.ntu.edu.sg/home/ehchua/programming/java/DataRepresentation.html nice explanation of various number representation formats.

Update:

Because the question is for "can" I am also looking for the fact this cannot be done -- for it quoting a number is enough. For example "no it cannot be done, for example number 1748574 is not represented exactly by float number" (this number is taken out of thin air of course).

For curious reader

If you would like to play with IEEE 754 representation -- on-line calculator: http://www.ajdesigner.com/fl_ieee_754_word/ieee_32_bit_word.php

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

No, not all, but there exists a range within which you can represent all integers accurately.

Structure of 32bit floating point numbers

The 32bit floating point type uses

  • 1 bit for the sign
  • 8 bits for the exponent
  • 23 bits for the fraction (leading 1 implied)

Representing numbers

Basically, you have a number in the form

(-)1.xxxx_xxxx_xxxx_xxxx_xxxx_xxx (binary)

which you then shift left/right with the (unbiased) exponent.

To have it represent an integer requiring n bits, you need to shift it by n-1 bits to the left. (All xes beyond the floating point are simply zero)

Representing integers with 24 bits

It is easy to see, that we can represent all integers requiring 24 bits (and less)

1xxx_xxxx_xxxx_xxxx_xxxx_xxxx.0 (unbiased exponent = 23)

since we can set the xes at will to either 1 or 0.

The highest number we can represent in this fashion is:

1111_1111_1111_1111_1111_1111.0

or 2^24 - 1 = 16777215

The next higher integer is 1_0000_0000_0000_0000_0000_0000. Thus, we need 25 bits.

Representing integers with 25 bits

If you try to represent a 25 bit integer (unbiased exponent = 24), the numbers have the following form:

1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0

The twenty-three digits that are available to you have all been shifted past the floating point. The leading digit is always a 1. In total, we have 24 digits. But since we need 25, a zero is appended.

A maximum is found

We can represent `1_0000_0000_0000_0000_0000_0000 with the form 1_xxxx_xxxx_xxxx_xxxx_xxxx_xxx0.0, by simply assigning 1 to all xes. The next higher integer from that is: 1_0000_0000_0000_0000_0000_0001. It's easy to see that this number cannot be represented accurately, because the form does not allow us to set the last digit to 1: It is always 0.

It follows, that the 1 followed by 24 zeroes is an upper bound for the integers we can accurately represent. The lower bound simply has its sign bit flipped.

Range within which all integers can be represented (including boundaries)

  • 224 as an upper bound
  • -224 as a lower bound

Structure of 64bit floating point numbers

  • 1 bit for the sign
  • 11 exponent bits
  • 52 fraction bits

Range within which all integers can be represented (including boundaries)

  • 254 as an upper bound
  • -254 as a lower bound

This easily follows by applying the same argumentation to the structure of 64bit floating point numbers.

Note: That is not to say these are all integers we can represent, but it gives you a range within which you can represent all integers. Beyond that range, we can only represent a power of two multiplied with an integer from said range.

Combinatorial argument

Simply convincing ourselves that it is impossible for 32bit floating point numbers to represent all integers a 32bit integer can represent, we need not even look at the structure of floating point numbers.

  1. With 32 bits, there are 232 different things we can represent. No more, no less.
  2. A 32bit integer uses all of these "things" to represent numbers (pairwise different).
  3. A 32bit floating point number can represent at least one number with a fractional part.

Thus, it is impossible for the 32bit floating point number to be able to represent this fractional number in addition to all 232 integers.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...