在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
Floating point numbers — Sand or dirt
Real numbers are a very important part of real life and of programming too. Hardly any computer language doesn’t have data types for them. Most of the time, they come in the form of (binary) floating point datatypes, since those are directly supported by most processors. But these computerized representations of real numbers are often badly understood. This can lead to bad assumptions, mistakes and errors as well as reports like: "The compiler has a bug, this always shows ‘not equal’"
Experienced floating point users will know that this can be expected, but many people using floating point numbers use them rather naively, and they don’t really know how they “work”, what their limitations are, and why certain errors are likely to happen or how they can be avoided. Anyone using them should know a little bit about them. This article explains them from my point of view, i.e. things I found out the hard way. It may be slightly inaccurate, and probably incomplete, but it should help in understanding floating point, its uses and its limitations. It does not use any complicated formulas or higher scientific explanations. Floating point types in DelphiFloating point is the internal format in which “real” numbers, like 0.0745 or 3.141592 are stored. Unlike fixed point representations, which are simply integers scaled by a fixed amount — an example is Delphi’s The While Note that the above does not apply to Real numbersThe real-number system is a continuum containing real values from minus infinity (−∞) to plus infinity(+∞). But in a computer, where they are only represented in a very limited amount of bytes ( There are several ways in which real numbers can be represented. In written form, the usual way is to represent them as a string of digits, and the decimal point is represented by a ‘.’, e.g. 12345.678 or 0.0000345. Another way is to use scientific notation, which means that the number is scaled by powers of 10 to, usually, a number between 1 and 10, e.g. 12345.678 is represented as 1.2345678 × 104 or, in short form (the one Delphi uses), as 1.2345678e4. Internal representationThe way such "real" numbers are represented internally differs a bit from the written notation. The fixed point type The floating point types used in Delphi have an internal representation that is much more like scientific notation. There is an unsigned integer (its size in bit depends on the type) that represents the digits of the number, the mantissa, and a number that represents the scale, in our case in powers of 2 instead of 10, the exponent. There is also a separate sign bit, which is 1 if the number is negative. So in floating point, a number can be represented as: value = (−1)sign × (mantissa / 2len−1) × 2exp where MantissaThe mantissa (The IEEE calls it “significand”, but this is a neologism which means something like “which is to be signified”, and in my opinion, that doesn’t make any sense) can be viewed in two ways. Let’s disregard the exponent for the moment, and assume that its value is thus that the number 1.75 is represented by the mantissa. Many texts will tell you that the implicit binary point is viewed to be directly right of the topmost bit of the mantissa, i.e. that the topmost bit represents 20, the one below that 2−1, etc., so a mantissa of binary 1.1100 0000 0000 000 represents 1.0 + 0.5 + 0.25 = 1.75. Other, but not so many texts, simply treat the mantissa as an unsigned integer, scaled by 2len−1, where ExponentThe exponent is the power of 2 by which the mantissa must be multiplied to get the number that is represented. Internally, the exponent is often “biased”, i.e. it is not stored as a signed number, it is stored as unsigned, and the extremes often have special meanings for the number. This means that, to get the actual value of the exponent, you must subtract a constant value from the stored exponent. For instance, the bias for There are also floating point systems that have a decimal based exponent, i.e. where the value of the exponent represents powers of 10. Examples are the This article mainly discusses the floating point types used in Delphi, to know Sign bitThe sign bit is quite simple. If the bit is 1, the number is negative, otherwise it is positive. It is totally independent of the mantissa, so there is no need for a two’s complement representation for negative numbers. Zero has a special representation, and you can actually even have −0 and +0 values. Normalization and the hidden bitI’ll try to explain normalization and denormals with normal scientific notation first. Take the values 6.123 × 10−22, 612.3 × 10−24 and 61.23 × 10−23 (or How is this done in binary? Let’s take the number 0.375. This can be calculated as 2−2 + 2−3(0.25 + 0.125), or, in a mantissa, 0.011…bin (disregarding the trailing zeroes), i.e. 0.375 × 20. But this is not how floating point numbers are usually stored. The exponent is adjusted thus, that the mantissa always has its top bit set, except for some special numbers, like 0 or the so called “tiny” (denormalized) values. So the mantissa becomes 1.100…bin and the exponent is decremented by 2. This number still represents the value 0.375, but now as 1.5 × 2−2. This is how normalization works for binary floating point. It ensures that 1.0 <= mantissa (including hidden bit) < 2.0. Because of the hidden bit, to calculate the value of such a floating point type, you must mentally put the implicit “1.bin” in front of the stored bits of the mantissa. Note that in e.g. the language C, the value There is some confusion about how to denote the size (or length, as it is often called) of the mantissa of a type with a hidden bit. Some will use the actually stored length in bits, while others also count the hidden bit. For instance, a Denormalized valuesWith real numbers, to get close to zero, you can simply use a very negative exponent, e.g. 6.123 × 10−99999. But in floating point, the exponent is limited and can not go below a certain value. The mantissa is limited too. Let’s assume that in our scientific notation, the exponent can not go below −100 and the mantissa can only have 4 digits. Then a very small normalized value would be 6.123 × 10−100. To denote even smaller values, you have to resort to denormals: you drop the rule that the first digit must always be non-zero. Now you can also have values smaller than the smallest normalized (positive) value of 1.000 × 10−100, like 0.612 × 10−100, 0.061 × 10−100 and 0.006× 10−100. This also means that the lower you go, you lose precision: fewer and fewer significant digits are available. This is similar for binary floating point, except that the digits are just 0 or 1. Sometimes, after an operation, the exponent can not be decremented far enough to represent the result. In that case, the exponent is set to a special value, and the mantissa is not normalized anymore, i.e. the top bit is not 1 and the mantissa is interpreted as something like 0.000xxx…xxxbin, i.e. it has one or more leading zeroes followed by as many significant bits as will fit. Such values are called denormalized or tiny values. Because of the leading zeroes, not every bit is significant anymore, so the precision is lower than for normalized values of the same type. Other special valuesThe most obvious special value is 0. Because 0 is denoted by both exponent and mantissa having all zero bits, there are actually two representations of 0, one with the sign bit cleared, and one with the sign bit set (i.e. +0 and −0). In any calculation, these are considered equal and simply represent 0. Not every bit combination represents a number. Some represent +/− infinity, and some are invalid. The latter are called NaN — Not a Number. The rules for which bit combinations represent what are described in the Delphi help, and in the Delphi DocWiki: Single, Double and Extended. I will not repeat that information here. But the
IEEE typesThe IEEE types used in Delphi are
The following diagram shows a simple representation of these types: Due to how floating point is implemented on Win64 (using SSE instead of the x87 FPU), there is no 80-bit floating point type in the 64-bit compiler. That is why, on Win64, Delphi’s Using floating point numbers
In the following, I am using the terms small and large. I mean values that have a very low or a very high exponent, respectively, regardless of their sign. That means that small values are very close to 0, while large values are far away from 0. In other words, I am addressing their magnitude, not their signs. As you can see in the diagram, the different types have quite a different precision. Internally, for calculations, Delphi always uses There are many such traps, caused by the limitations of how the infinite range of real numbers must be represented in a finite number of bits. Some of these traps are discussed in the following paragraphs. RoundingAfter calculations, e.g. multiplications or additions, the result can contain more significant bits than the type can hold, so the FPU must round the values to make them fit and normalized again, which means that a number of bits gets lost. How this rounding is done is governed by IEEE rules. But this means that there will be additional tiny inaccuracies. An example:
The output is 0.100000001490116120 0.010000000707805160 0.009999999776482580 As you can see, the closest possible representation for 0.1 in a Single is 0.10000000149011612. If this is squared and then rounded, you get 0.01000000070780516, but the closest representation for 0.01 is 0.00999999977648258. So, in other words, Doing multiple calculations like this will slowly add up the errors, and they do not necessarily even each other out. It is very important that you take such errors into consideration and do no more calculations than necessary. It is always a good idea to simplify your expressions and to use professional libraries that know how to avoid too many calculations for the purpose. As in so many programming problems, the choice of algorithm and of the used types is also very important. Rounding modes and tie-breaking rulesRounding is generally done to the nearest more significant digit available. But sometimes there is a tie, if the value to be rounded is exactly between the two nearest digits. In that case, a tie-breaking rule is required and one very common rule is called banker’s rounding (although banks are not known to use or having used it), which says that a tie is rounded to the nearest even more significant digit. This means that 24.05 is rounded to 24.0, but 24.15 to 24.2. Other commonly used tie-breaking rules are:
Note that there are other rounding modes that do not round to the nearest more significant digit, but round to the more significant digit that is either above (closer to +∞), below (closer to −∞) or closer to 0. RoundTo and SimpleRoundToUnit math contains a few nice functions to round a floating point value (
For better decimal rounding than these rather simple approaches, take a look at John Herbster’s DecimalRounding_JH1 unit on Embarcadero’s CodeCentral. It uses a more sophisticated algorithm which produces better results. It implements all the rounding modes I discussed in the Rounding modes and tie-breaking rules section above. The x87 Floating Point UnitThe x87 FPU knows 4 rounding modes (see the FPU control word section of this article). So how does the FPU round? Say an operation on a 1.0001 1100 0100 1100 1001 0111 The underlined bit is the bit to be rounded. There are two possible values this can be rounded to, the value directly below and the value directly above: 1.0001 1100 0100 1100 1001 011 Now what happens depends on the rounding mode. If the rounding mode is the default — round to nearest “even” — it will get rounded to the value that has a 0 as least significant bit. You can probably guess which of the two values is chosen for the other rounding modes. Measuring rounding errorsThere are ways to measure accumulated rounding errors. The most common methods used are ULP and relative or approximation error. Discussing them is outside the realm of this article, so I have to refer you to Wikipedia and the articles mentioned in the References section of this article. ConversionIt is never a good idea to write code that requires a lot of
conversions, for instance code that must convert between several
floating point types, since each conversion, especially to a less
precise type, can mean the loss of a few bits and therefore increases
the inaccuracy. If space or speed are not as important as accuracy, use
the
The output is: 0.100000000000000000 0.100000001490116120 LiteralsIn source code, we use decimal numbers. But floating point types are stored as binary. For integers, this is not a big problem, but as soon as fractions are involved, there is one. Not every number that can be represented exactly in decimal can be represented exactly in binary, just like certain numbers, e.g. 1/3 or π can not be represented exactly in decimal format. In binary, only numbers that are sums of powers of 2 can be represented exactly in a binary floating point type (e.g. 3.625 = 2 + 1 + 0.5 + 0.125). A number like 0.1 can not be composed of such powers. The compiler will try to get the best approximation that is possible, but there will always be a small difference. Comparing valuesThe above shows that it is never a good idea to compare floating point values directly. Conversions and rounding cause tiny inaccuracies. These errors can add up, the more calculations you do. To accomodate for these inaccuraries, it is a good idea to always use a small error value in comparisons. In Delphi’s
An ε (epsilon) value is a small value you can use as an error range. These functions either take an ε you provide, or if you pass nothing or 0, they will calculate an ε that takes the magnitude of the operands you are comparing into consideration. So it is usually best only to pass the operands, and not a specific ε, unless you have a really good reason to force one upon the function. An example follows:
The output is False True That is because
(Exact values extracted using my ExactFloatString unit)
Catastrophic cancellationIf almost equal values are subtracted (or two values with differing sign but otherwise almost equal values are added), the result is a value that is tiny, compared to the values. This tiny value can well be in the range of the roundoff errors mentioned before, so it can’t be trusted. It is another situation you should avoid. An example follows:
The output from Delphi 2010 is: 16.000000000000000000 16.000000000000000000 1.73472347597681E-0018 One would expect the difference to be 1.0 × 10−18, but the value you get is 1.735 × 10−18. Also note that the output doesn’t display the decimal 1 in This is an example of catastrophic cancellation: a devastating loss of precision when small numbers are computed from large numbers, which themselves are subject to roundoff error. Greatly differing magnitudeThis is more or less the reverse of catastrophic cancellation. If two values differ greatly in magnitude, the smaller of the two might be below the precision of the larger one. So adding the tiny value to such a huge value (or subtracting it) will have no effect. That means that you should take care in which order you do such additions or subtractions. Take the following simple example:
The results shown are: 1.0000000000 0.0000000000 The first result is what you would expect, but the second one is the result of the fact that (a + b) + c = a + (b + c) But the addition of floating point values is not associative, so (a + b) + c ≠ a + (b + c)
Note that if you have many values to add, it makes sense to sort them in order of magnitude. A nice explanation is given to this StackOverflow question, by Steve Jessop. Be sure to read the comments too. It comes down to the fact that if you add a tiny number to a big one, the tiny one may not change the big one, but if you add a lot of tiny ones first, they may accumulate to a value that can make a difference by being closer to the big one. The link gives some examples. Also note the answer recommending the Kahan summation algorithm, by Daniel Pryden. Kahan’s algorithm sums the rounding errors in an extra floating point variable and uses that to get a more correct answer. Functions requiring real valuesNot only are fractions like 0.1 not representable in binary floating point, there are also values that are not representable in any integral number base, like the irrational numbers π or Euler’s constant e, but also values like √2. Functions based on numbers like these are bound to be inaccurate, especially in a limited format like floating point and because they require multiple internal calculations, even if these are probably with greater precision. That is why functions calls like sin(π) do not deliver exact results. For Avoiding the trapsThere are a few tips to avoid the many traps.
InternalsSo how does this look internally? In the following example I use a Single, because Singles have a readily comprehensible number of bits in the mantissa and exponent. Let me show you how a number like 0.1 is stored in a Single. The number 0.1 is stored as $3DCCCCCD or (binary) 0011 1101 1100 1100 1100 1100 1100 1101 After ordering the bits, this is: 0 - 0111 1011 - 100 1100 1100 1100 1100 1101 which means
If 13421773 is multiplied with 2−4 (0.0625), the result is 838860,8125. After scaling that by 223 (8388608), this becomes 0.100000001490116119384765625, which is indeed pretty close to 0.1. The following table shows that this is indeed the closest value, by also calculating the values with one ULP difference, i.e. with mantissas $CCCCCC and $CCCCCE respectively.
To see how the conversion from text to binary is done (well, more or less), take a look at this StackOverflow answer of mine. In reality, the functions that convert from a string in decimal format to binary floating point are very complicated. The de facto standard C implementation, My BigDecimal implementation can do extemely accurate conversions between decimal format strings like The FPU control wordThe FPU control word is a word-size set of bits that control the behaviour of the FPU. The bits are set up as follows
In Delphi, to control the FPU control word (in Delphi, it is called 8087CW), there are a few functions, mentioned in the help and the DocWiki entry for the FPU Control Word. An example of their use:
There is no exception, since including Investigating floating point typesIf you want to investigate or (ab)use the internal formats of the floating point types a little more, you should look for the routines by John Herbster, former member of TeamB. Most of them can be found on Embarcadero’s CodeCentral.
But also take a look at my (new) ExactFloatStrings unit, which is included with my BigIntegers unit. It works in Delphi XE3 up to Delphi 10 Seattle. I am working on making it work in Delphi XE2 too. Also note that from Delphi XE3 on, you can use the helpers for floating point types, by including the
全部评论
专题导读
热门推荐
热门话题
阅读排行榜
|
请发表评论