Table-Maker's Dilemma, or Why Almost All Transcendental Elementary Functions Are Rounding Wrong

I was surprised to find that it is difficult to find information on this problem in Russian, as if few people care that the mathematical libraries used in modern compilers sometimes do not give a correctly rounded result. I am worried about this situation, since I am just working on the development of such mathematical libraries. In foreign literature, this problem is well covered, so I decided to present it in Russian in a popular scientific form, relying on Western sources and still a little personal experience.






Friends, for your convenience, the article is also available in video presentation format (almost 34 minutes), this format is more suitable for those readers who find it difficult to build the necessary mathematical images in their heads, since there is a lot of illustrative material in the presentation. The information on the video is completely identical to the content of the article. Please act at your convenience.








I repeat that this is not a scientific, but a popular science article, after reading which, you will briefly get acquainted with this.



  • Transcendental elementary functions (exp, sin, log, cosh and others) that work with floating point arithmetic are rounded incorrectly, sometimes they make an error in the last bit.
  • The reason for errors does not always lie in laziness or low qualifications of developers, but in one fundamental circumstance, which modern science has not yet been able to overcome.
  • «», - .
  • , , , , exp2(x) pow(2.0, x).


To understand this article, you need to be familiar with the IEEE-754 floating point format. It will be enough if you at least just understand that, for example, this is: 0x400921FB54442D18 - number pi in double precision format (binary64, or double), that is, you just understand what I mean by this record; I do not require to be able to do such transformations on the fly. And I will remind you about the rounding modes in this article, this is an important part of the story. It is also desirable to know "programmer" English, because there will be terms and quotations from Western literature, but you can get by with an online translator.



Examples first, so that you immediately understand what the subject of the conversation is. Now I will give the code in C ++, but if this is not your language, then I am sure you will still easily understand what is written. Please look at this code:



#include <stdio.h>
#include <cmath>

int main() {
  float x = 0.00296957581304013729095458984375f;  // ,  .
  float z;
  z = exp2f(x);  // z = 2**x  .
  printf ("%.8f\n", z);  //      8   .
  z = powf(2.0f, x);  // z = 2**x  
  printf ("%.8f\n", z);  //   .
  return 0;
}


The number x is intentionally written with such a number of significant digits so that it is exactly representable in the float type, that is, so that the compiler will convert it to a binary code without rounding. After all, you know very well that some compilers are not able to round without errors (if you don’t know, please indicate in the comments, I will write a separate article with examples). Next in the program, we need to calculate 2 x , but let's do it in two ways: the function exp2f (x), and the explicit exponentiation of two powf (2.0f, x). The result, of course, will be different, because I said above that elementary functions cannot work correctly in all cases, and I specially selected an example to show this. Here's what the output will be:



1.00206053
1.00206041


Four compilers gave me these values: Microsoft C ++ (19.00.23026), Intel C ++ 15.0, GCC (6.3.0) and Clang (3.7.0). They differ in one least significant bit. Here is the hexadecimal code for these numbers:



0x3F804385  // 
0x3F804384  // 


Please remember this example, it is on it that we will look at the essence of the problem a little later, but for now, so that you get a clearer impression, please see the examples for the double-precision data type (double, binary64) with some other elementary functions. I present the results in the table. Correct answers (when available) have * at the end.



Function Argument MS C ++ Intel C ++ Gcc Clang
log10 (x) 2.60575359533670695e129 0x40602D4F53729E44 0x40602D4F53729E45 * 0x40602D4F53729E44 0x40602D4F53729E44
expm1 (x) -1.31267823646623444e-7 0xBE819E53E96DFFA9 * 0xBE819E53E96DFFA8 0xBE819E53E96DFFA8 0xBE819E53E96DFFA8
pow (10.0, x) 3.326929759608827789e-15 0x3FF0000000000022 0x3FF0000000000022 0x3FF0000000000022 0x3FF0000000000022
logp1 (x) -1.3969831951387235e-9 0xBE17FFFF4017FCFF * 0xBE17FFFF4017FCFE 0xBE17FFFF4017FCFE 0xBE17FFFF4017FCFE




I hope you do not get the impression that I deliberately took some completely unique tests that you can hardly find? If so, let's cook on our knees a complete enumeration of all possible fractional arguments for the 2 x function for the float data type. It is clear that we are only interested in values ​​of x between 0 and 1, because other arguments will give a result that differs only in the value in the exponent field and are not of interest. You yourself understand:

2x=2[x]2{x}...





Having written such a program (the hidden text will be below), I checked the exp2f function and how many erroneous values ​​it produces on the interval x from 0 to 1.



MS C ++ Intel C ++ Gcc Clang
1,910,726 (0.97%) 90231 (0.05%) 0 0




It is clear from the program below that the number of arguments x tested was 197612997. It turns out that, for example, Microsoft C ++ incorrectly calculates the 2 x function for almost one percent of them. Don't rejoice, dear fans of GCC and Clang, it's just that this function is implemented correctly in these compilers, but full of errors in others.



Brute force code
#include <stdio.h>
#include <cmath>

    //         float  double
#define FAU(x) (*(unsigned int*)(&x))
#define DAU(x) (*(unsigned long long*)(&x))

    //    2**x      0<=x<=1.
    //  , ,    ,  
    //     10- .
    //     double (     ).
    //        FMA-, 
    //  ,   , ...   .
float __fastcall pow2_minimax_poly_double (float x) {
  double a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10;
  DAU(a0) = 0x3ff0000000000001;
  DAU(a1) = 0x3fe62e42fefa3763;
  DAU(a2) = 0x3fcebfbdff845acb;
  DAU(a3) = 0x3fac6b08d6a26a5b;
  DAU(a4) = 0x3f83b2ab7bece641;
  DAU(a5) = 0x3f55d87e23a1a122;
  DAU(a6) = 0x3f2430b9e07cb06c;
  DAU(a7) = 0x3eeff80ef154bd8b;
  DAU(a8) = 0x3eb65836e5af42ac;
  DAU(a9) = 0x3e7952f0d1e6fd6b;
  DAU(a10)= 0x3e457d3d6f4e540e;
  return (float)(a0+(a1+(a2+(a3+(a4+(a5+(a6+(a7+(a8+(a9+a10*x)*x)*x)*x)*x)*x)*x)*x)*x)*x);
} 

int main() {
  unsigned int n = 0;  //  .
  //      x   (0,1)
  //  : 0x33B8AA3B = 0.00000008599132428344091749750077724456787109375
  //   ,   2**x > 1.0f
  //  : 0x3F800000 = 1.0 .
  for (unsigned int a=0x33B8AA3B; a<0x3F800000; ++a) {  
   float x;
    FAU(x) = a;
    float z1 = exp2f (x);	//  .
    float z2 = pow2_minimax_poly_double (x);	//  .
    if (FAU(z1) != FAU(z2)) {	//  .
      //  ,        (   ).
      //fprintf (stderr, "2**(0x%08X) = 0x%08X, but correct is 0x%08X\n", a, FAU(z1), FAU(z2));
      ++n;
    }		
  }
  const unsigned int N = 0x3F800000-0x33B8AA3B;  //     .
  printf ("%u wrong results of %u arguments (%.2lf%%)\n", n, N, (float)n/N*100.0f);
  return 0;
} 




I will not bore the reader with these examples, the main thing here was to show that modern implementations of transcendental functions can round off the last bit incorrectly, and different compilers make mistakes in different places, but none of them will work correctly. By the way, the IEEE-754 Standard allows this error in the last bit (which I will talk about below), but it still seems strange to me: ok double, this is a large data type, but float can be checked by brute force! Was it that difficult to do? Not difficult at all, and I've already shown an example.



Our enumeration code contains a "self-written" function of correct calculation 2 xusing an approximating polynomial of the 10th degree, and it was written in a few minutes, because such polynomials are derived automatically, for example, in the Maple computer algebra system. It is enough to set a condition for the polynomial to provide 54 bits of precision (for this function, 2 x ). Why 54? But you will soon find out, right after I tell you the essence of the problem and tell you why, in principle, it is now impossible to create fast and correct transcendental functions for a data type of quadruple precision (binary128), although there are already attempts to attack this problem in theory.



Default rounding and the problem with it



If you are not immersed in the development of mathematical libraries, then there is nothing wrong with forgetting the default rounding rule for floating point numbers according to the IEEE-754 Standard. Therefore, I will remind you of it. If you remember everything well, take a look at at least the end of this section anyway, you are in for a surprise: I will show you a situation where rounding up a number can be very difficult.



You can easily recall what is “round up” (to plus infinity), “round down” (to minus infinity) or “round to zero” by the name (if anything, there is Wikipedia). The main difficulties for programmers arise with rounding "to the nearest, but in the case of equal distance from the nearest - to the one with the last digit even". Yes, this is how this rounding mode is translated, which the Western literature calls in short: "Round nearest ties to even".



This rounding mode is used by default and works as follows. If, as a result of calculations, the length of the mantissa turned out to be greater than the resulting data type can accommodate, rounding is performed to the nearest of two possible values. However, a situation may arise when the original number turned out to be exactly in the middle between the two nearest ones, then the result is chosen for which the last bit (after rounding) turns out to be even, that is, equal to zero. Consider four examples where you need to round to two bits after the binary decimal point:



  1. Round 1.00 1 001. The third bit after the decimal point is 1, but then there is another 6th bit, which is 1, which means the rounding will be up, because the original number is closer to 1.01 than to 1.00.
  2. 1,001000. , 1,00 1,01, .
  3. 1,011000. 1,01 1,10. , .
  4. 1,010111. , 1,01, 1,10.


From these examples, it may seem that everything is simple, but it is not. The fact is that sometimes we cannot say for sure whether we are really in the middle between two values. See an example. Suppose we again want to round to two bits after the decimal point:



1.00 1 0000000000000000000000000000000000001



It is now obvious to you that the rounding should be up, that is, to the number 1.01. However, you are looking at a number with 40 bits after the decimal point. What if your algorithm couldn't provide 40 bits of precision and only achieves 30 bits? Then it will give out another number:



1.00 1 000000000000000000000000000



Unaware that at the 40th position (which the algorithm is unable to calculate) there will be a cherished one, you round this number down and get 1.00, which is wrong. You rounded up the last bit incorrectly - that's the subject of our discussion. From the above, it turns out that in order to get only the 2nd bit correct, you have to calculate the function up to 40 bits! Wow! And if the "locomotive" of zeros turns out to be even longer? That's what we'll talk about in the next section.



By the way, this is the mistake many compilers make when converting the decimal notation of a floating point number into the resulting binary format. If the original decimal number in the program code is too close to the middle between two accurately representable binary values, then it will not be rounded correctly. But this is not the topic of this article, but a reason for a separate story.



The essence of the problem of rounding the last significant bit



The problem manifests itself for two reasons. The first is the deliberate rejection of time-consuming calculations in favor of speed. In this case, as long as the specified accuracy is observed, and what bits there will be in the response is a secondary matter. The second reason is Table Maker's Dilemma, which is the main subject of our conversation. Let's consider both reasons in more detail.



First reason



You, of course, understand that the calculation of transcendental functions is implemented by some approximate methods, for example, by the method of approximating polynomials or even (rarely) by series expansion. To make the calculations happen as quickly as possible, the developers agree to perform as few iterations of the numerical method as possible (or take a polynomial of the least degree possible), so long as the algorithm allows an error not exceeding half the value of the last bit of the mantissa. In the literature, this is written as 0.5ulp (ulp = unit in the last place ).



For example, if we are talking about a number x of type float in the interval (0.5; 1), the value ulp = 2 -23 . On the interval (1; 2) ulp = 2 -22 . In other words, if x is on the interval (0; 1) then 2 xwill be on the interval (1,2), and to ensure an accuracy of 0.5ulp, you need, roughly speaking, to choose EPS = 2 -23 (so we will denote the constant "epsilon", in the common people called "error", or "accuracy", to whom as you like, do not find fault, please).



For applied calculations, this is enough, but the fact that the last bits may not coincide with the absolute result is not important for almost 100% of programmers, because it is important for them not what the bits will be, but what the accuracy will be.



For those who do not understand, I will give an example in the decimal number system. Here are two numbers: 1.999999 and 2.0. Let's say that the first is what the programmer received, and the second is the standard of what should have been obtained if we had unlimited possibilities. The difference between them is only one millionth, that is, the answer is calculated with an error of EPS = 10 -6 . However, there is not a single correct number in this answer. Is it bad? No, from the point of view of the application program, this is purple, the programmer will round the answer, say, to two decimal places and will receive 2.00 (for example, it was about currency, $ 2.00), he does not need more, but the fact that he put EPS = 10 -6 in my program , then well done, took a margin for the error of intermediate calculations and solved the problem correctly.



In other words, don't be confused: the precision and the number of correct bits (or digits) are two different things. Those who need accuracy (this is almost 100% of programmers), the discussed problem does not apply at all. Anyone who needs a bit sequence to match a correctly rounded reference is very worried about this problem, for example, developers of libraries of elementary functions. Nevertheless, it is useful for everyone to know about this for general development.



Let me remind you that this was the first direction of the problem: the last bits of the answer may be wrong because this is an intentional solution. The main thing is to keep the accuracy of 0.5ulp (or higher). Therefore, the numerical algorithm is selected only from this condition, if only it works extremely quickly. In this case, the Standard allows the implementation of elementary functions without correct rounding of the last bit. I quote [1, section 12.1] (English):

The 1985 version of the IEEE 754 Standard for Floating-Point Arithmetic did not specify anything concerning the elementary function. This was because it has been believed for years that correctly rounded functions would be much too slow at least for some input arguments. The situation changed since then and the 2008 version of the standard recommends (yet does not require) that some functions be correctly rounded.



The following are functions that are recommended but do not need to be rounded correctly:



function list




The second reason



Finally we got to the topic of conversation: Table Maker's Dilemma (abbreviated as TMD). I could not adequately translate its name into Russian, it was introduced by William Kahan (founding father of IEEE-754) in article [2]. Perhaps if you read the article, you will understand why the name is exactly that. In short, the essence of the dilemma is that we need to get absolutely accurate rounding of the function z = f (x), as if we had an infinite bit record of the perfectly calculated result z at our disposal. But it is clear to everyone that we cannot get an infinite sequence. How many bits to take then? Above, I showed an example when we need to see 40 bits of the result in order to get at least 2 correct bits after rounding. And the essence of the TMD problem is that we do not know in advance, up to how many bits to calculate the value of z in order to get as many bits correct after rounding as we need. What if there are a hundred or a thousand? We don't know in advance!



For example, as I said, for the function 2 x , for the data type float, where the fractional part of the mantissa has only 23 bits, we need to perform the calculation with an accuracy of 2 -54 so that rounding occurs correctly for all possible x arguments without exception. It is not difficult to obtain this estimate by exhaustive search, but for most other functions, especially for types double or long double (put "class" if you know what it is), such estimates are unknown .



Let's already understand why this is happening. I deliberately gave the very first example in this article with the float data type and asked you to remember it, because in this type there are only 32 bits and it will be easier to look at it, in other data types the situation is similar.



We started with the number x = 0.00296957581304013729095458984375, this is an exactly representable number in the float data type, that is, it is written so that it can be converted to the binary float system without rounding. We calculate 2 x , and if we had a calculator with infinite precision, then we should get (So you can check me, the calculation is done in the online system WolframAlpha ):



1.0020604729652405753669743044108123031635398201893943954577320057 ...



Let's translate this number into binary system, let's say 64 bits will be enough:



1.00000000100001110000100 1 000000000000000000000000000001101111101



The rounding bit (24th bit after the decimal point) is underlined. Question: where to round? Up or down? Clearly, up, you know this because you see enough bits and you can make a decision. But look carefully ...



After the rounding bit, we have 29 zeros. This means that we are very, very close to the middle between the two nearest numbers and it is enough to just move down a little, as the direction of rounding will change. But the question is: where will this shift be? The numerical algorithm can sequentially, step by step, approach the exact value from different sides, and until we pass all these 29 zeros and reach an accuracy that exceeds the value of the very last zero in this "locomotive", we will not know the rounding direction ... What if, in fact, the correct answer should be:



1.00000000100001110000100 0 11111111111111111111111111111?



Then the rounding will be down.



We don't know this until our precision reaches 54th bit after the decimal point. When the 54th bit is known exactly, we will know exactly which of the two nearest numbers we are actually closer to. Such numbers are called hardest-to-round-points [1, section 12.3] (critical points for rounding), and the number 54 is called hardness-to-round, and is denoted by the letter m in the cited book.



The complexity of rounding (m) is the number of bits that is the minimum necessary to ensure that for all arguments of some function f (x) and for a preselected range, the function f (x) is rounded correctly to the last bit (for different rounding modes, there may be different value m). In other words, for the data type float and for the x argument from the range (0; 1) for the "nearest even" rounding mode, the rounding complexity is m = 54. This means that for absolutely all x from the interval (0; 1) we can put into the algorithm the same precision ESP = 2 -54 , and all results will be rounded correctly to 23 bits after the binary decimal point.



In fact, some algorithms are able to provide an exact result and based on 53 and even 52 bits, brute force shows this, but theoretically, 54 is needed. If it were not for the possibility of cranking out the brute force, we would not be able to "cheat" and save a couple of bits, as I did in the brute-force program above. I took a polynomial with a degree lower than it should, but it still works, just because I was lucky.



So, regardless of the rounding mode, we have two possible situations: either a “steam locomotive” of zeros arises in the rounding area, or a “steam locomotive” of ones. The task of the correct algorithm for calculating the transcendental function f (x) is to refine the value of this function until the accuracy exceeds the value of the last bit of this "locomotive", and until it becomes clear that as a result of subsequent fluctuations of the numerical algorithm for calculating f (x) zeros will not turn to ones, or vice versa. As soon as everything has stabilized, and the algorithm has reached such an accuracy that is beyond the limits of the "steam locomotive", then we can round as if we had an infinite number of bits. And this rounding will be with the correct last bit. But how can this be achieved?



"Crutches"



As mentioned, the main problem is to get the algorithm to overcome the locomotive of zeros or ones that comes immediately after the rounding bit. When the locomotive is overcome and we see it as a whole, then this is equivalent to the fact that these zeros or ones have already been calculated exactly , and we already know exactly in which direction the rounding will now occur. But if we do not know the length of the locomotive, then how can we design an algorithm?



The first "crutch"



It may seem to the reader that the answer is obvious: take arithmetic with infinite precision and put a deliberately excessive number of bits, and if it is not enough, then put in another and recalculate. In general, it is correct. This is done when the speed and resources of the computer do not play a special role. This approach has a name: Ziv's multilevel strategy [1, section 12.3]. Its essence is extremely simple. The algorithm should support calculations at several levels: a quick preliminary calculation (in most cases it turns out to be final), a slower but more accurate calculation (saves in most critical cases), even slower, but even more accurate calculation (when it is absolutely “bad "Had to) and so on.



In the overwhelming majority of cases, it is enough to take the accuracy slightly higher than 0.5ulp, but if a "locomotive" appears, then we increase it. As long as the "steam locomotive" remains, we increase the accuracy until it is strictly clear that further fluctuations of the numerical method will not affect this "steam locomotive". So, for example, in our case, if we have reached ESP = 2 -54 , then at the 54th position a unit appears, which, as it were, "protects" the locomotive from zeros and guarantees that there will no longer be a subtraction of a value greater than or equal to 2 -53 and zeros won't turn to ones, dragging the rounding bit to zero with it.



It was a popular science presentation, all the same with the Ziv's rounding test, where it is shown how quickly, in one step, to check whether we have reached the desired accuracy, can be read in [1, Chapter 12], or in [3, Section 10.5].



The problem with this approach is obvious. It is necessary to design an algorithm for calculating each transcendental function f (x) so that in the course of the piece it would be possible to increase the accuracy of the calculations. For software implementation, this is still not so scary, for example, Newton's method allows, roughly speaking, to double the number of exact bits after the decimal point at each iteration. You can double until it becomes "enough", although this is a rather time-consuming process, I must admit that Newton's method is not always justified, because it requires calculating the inverse function f -1(x), which in some cases may not be any simpler than calculating f (x) itself. For hardware implementation, the "Ziva strategy" is completely unsuitable. The algorithm, hardwired into the processor, must perform a series of actions with the already preset number of bits, and this is quite problematic to implement if we do not know this number in advance. Take stock? How much?



The probabilistic approach to solving the problem [1, Section 12.6] allows us to estimate the value of m (remember, this is the number of bits, which is enough for correct rounding). It turns out that the length of the "locomotive" in the probabilistic sense is slightly larger than the length of the mantissa of the number. Thus, in most cases it will be sufficient to take m a little more than twice the value of the mantissa, and only in very rare cases will it be necessary to take even more. I quote the authors of this work: "we deduce that in practice, m must be slightly greater than 2p" (they have p - the length of the mantissa together with the integer part, that is, p = 24 for float). Further in the text, they show that the probability of error with such a strategy is close to zero, but still positive, and this is confirmed by experiments.



Nevertheless, there are still cases when the value of m has to be taken even more, and the worst case is not known in advance. Theoretical estimates of the worst case exist [1, section 12.7.2], but they yield unthinkable millions of bits, which is no good. Here is a table from the cited work (this is for the function exp (x) on the interval from -ln (2) to ln (2)):



p m
24 (binary32) 1865828
53 (binary64) 6017142
113 (binary128) 17570144




Second "crutch"



In practice, m will not be so terribly large. And to determine the worst case, a second "crutch" is applied, which is called "exhaustive pre-calculation." For the data type float (32 bits), if the function f has one argument (x), then we can easily "run" all possible values ​​of x. The problem will arise only with functions that have more than one argument (among them pow (x, y)), for which we could not think of anything like that. After checking all possible values ​​of x, we calculate our constant m for each function f (x) and for each rounding mode. Then the calculation algorithms that need to be implemented in hardware are designed to provide an accuracy of 2 -m . Then rounding f (x) is guaranteed to be correct in all cases.



For double type (64 bits), simple enumeration is almost impossible. However, they are sorting out! But how? The answer is given in [4]. I'll tell you about it very briefly.



The domain of the function f (x) is divided into very small segments so that inside each segment it is possible to replace f (x) with a linear function of the form b-ax (the coefficients a and b are, of course, different for different segments). The size of these segments is calculated analytically so that such a linear function would indeed be almost indistinguishable from the original in each segment.



Then, after some scaling and shifting operations, we come to the following problem: can a straight line b-ax go "close enough" to an integer point?



It turns out that it is relatively easy to give a yes or no answer. That is, "yes" - if potentially dangerous points are close to a straight line, and "no" - if no such point, in principle, can come close to the line. The beauty of this method is that the answer “no” in practice is obtained in the overwhelming majority of cases, and the answer “yes”, which is rarely obtained, forces you to go through the segment with a brute force to determine which specific points were critical.



However, iterating over the arguments to f (x) is reduced many, many times and makes it possible to detect breaking points for numbers like double (binary64) and long double (80 bits!). This is done on supercomputers and, of course, video cards ... in your free time from mining. However, no one knows yet what to do with the binary128 data type. Let me remind you that the fractional part of the mantissa of such numbers is 112 bits . Therefore, in the foreign literature on this subject so far, one can find only semi-philosophical reasoning starting with “we hope ...” (“we hope ...”).



Details of the method, which allows you to quickly determine the passage of a line near integer points, is inappropriate here. For those wishing to learn the process, I recommend looking more carefully towards the problem of finding the distance between a straight line and Z 2 , for example, in article [5]. It describes an improved algorithm, which in the course of construction resembles the famous Euclid's algorithm for finding the greatest common divisor. I will give the same picture from [4] and [5], which shows the further transformation of the problem:



image



There are extensive tables containing the worst cases of rounding at different intervals for each transcendental function. They are found in [1 section 12.8.4] and in [3, section 10.5.3.2], as well as in separate articles, for example, in [6].



I will give some examples by taking random rows from such tables. I emphasize that these are not the worst cases for all x, but only for some small intervals, see the source if you are interested.



Function x f (x) (cropped) 53rd bit and following
log2 (x) 1.B4EBE40C95A01P0 1.8ADEAC981E00DP-1 10 53 1011 ...
cosh (x) 1.7FFFFFFFFFFF7P-23 1.0000000000047P0 11 89 0010 ...
ln (1 + x) 1.8000000000003P-50 1.7FFFFFFFFFFFEP-50 10 99 1000 ...




How to read the table? The value x is specified in hexadecimal floating point double notation. First, as expected, there is a leading one, then 52 bits of the fractional part of the mantissa and the letter P. This letter means "multiply by two to a power" followed by a degree. For example, P-23 means the specified mantissa needs to be multiplied by 2 -23 .



Further, imagine that the function f (x) is calculated with infinite precision and the first 53 bits are cut off from it (without rounding!). It is these 53 bits (one of them up to the comma) that are indicated in the f (x) column. Subsequent bits are indicated in the last column. The "degree" sign of the bit sequence in the last column means the number of bit repetitions, that is, for example, 10 531011 means that first there is a bit equal to 1, then 53 zeros and then 1011. Then three dots, which means that we, in general, do not need the rest of the bits at all.



Further, it is a matter of technology - we know the worst cases for each interval of a separately taken function and we can choose for this interval such an approximation so that it covers the worst case with its accuracy. With just years of supercomputer computing, it is possible to create fast and accurate hardware implementations of elementary functions. The matter is small: it remains to teach at least compiler developers to use these tables.



Why is this needed?



Great question! After all, I have repeatedly spoken above that almost 100% of programmers do not need to know an elementary function with an accuracy to the correctly rounded last bit (often they don’t even need half of the bits), why do scientists drive supercomputers and compile tables to solve a “useless” problem?



First, the challenge is fundamental. It is rather interesting not to get exact rounding for the sake of accurate rounding, but in principle to understand how this interesting problem could be solved, what secrets of computational mathematics will its solution reveal to us? How could these secrets be used in other tasks? Fundamental sciences - they are like that, you can do some kind of "nonsense" for decades, and then a hundred years later, thanks to this "nonsense", a scientific breakthrough occurs in some other area.



Second, the issue of code portability. If a function can afford to handle the last bits of the result in whatever way it wants, then it means that on different platforms and on different compilers, slightly different results may be obtained (even if within the specified error). In some cases this is not important, but in some it may be significant, especially when the program has an error that appears on one platform, but does not appear on another platform precisely because of the different bits of the result. But why am I describing to you the well-known headache associated with different program behavior? You know all this without me. It would be great to have a math system that works exactly the same on all platforms, no matter how compiled it is. That's what you need to do correctly round off the last bit.



List of sources



[1] Jean-Michel Muller, “Elementary Functions: Algorithms and Implementation”, 2016



[2] William Kahan, “ A Logarithm Too Clever by Half ”, 2004



[3] Jean-Michel Muller, “Handbook of floating-point arithmetic” , 2018



[4] Vincent Lefèvre, Jean-Michel Muller, “Toward Correctly Rounded Transcendentals”, IEEE TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 11, NOVEMBER 1998. pp. 1235-1243



[5] Vincent Lefèvre. “New Results on the Distance Between a Segment and Z 2 ”. Application to the Exact Rounding. 17th IEEE Symposium on Computer Arithmetic - Arith'17, Jun 2005, Cape Cod, MA,

United States. pp. 68-75



[6] Vincent Lefèvre, Jean-Michel Muller, “Worst Cases for Correct Rounding of the Elementary Functions in Double Precision”, Rapport de recherche (INSTITUT NA TIONAL DE RECHERCHE EN INFORMA TIQUE ET EN AUTOMA TIQUE) n˚4044 - Novembre 2000 - 19 pages.



All Articles