Accurate and fast calculations for floating point numbers using the example of the sine function. Part 3: fixed-point

We continue the cycle of lectures ( part 1 and part 2 ). In part 2, we looked at what is inside the libm library and in this work we will try to slightly alter the do_sin function to increase its accuracy and speed. I will cite this function again do_sin ):



image



As shown in the previous article, part 132-145. Executed for x within the range [0.126, 0.855469]. Well. Let's try to write a function that, within the given limits, will be more accurate and, possibly, faster.



The way we'll use it is pretty obvious. It is necessary to expand the accuracy of calculations to include more decimal places. The obvious solution would be to choose the long double type, count in it, and then convert back. In terms of accuracy, the solution should be good, but in terms of performance, there may be problems. Still, long double is a rather exotic type of data and its support in modern processors is not a priority. On x86_64 SSE / AVX instructions with this data type do not work. The mathematical coprocessor will be "blown away".



What then should you choose? Let's take a closer look at the argument and function limits.



They are in the 1.0 region. Those. in fact, we don't need a floating point. Let's use a 64-bit integer when calculating the function. This will give us an additional 10-11 bits to the original precision. Let's figure out how to work with these numbers. A number in this format is represented as a / d , where a is an integer, and d is a divisor that we choose constant for all variables and store "in our memory", and not in the computer's memory. Below are some operations for such numbers:

cd=adยฑbd=aยฑbdcd=adโ‹…bd=aโ‹…bd2cd=adโ‹…x=aโ‹…xd



As you can see, there is nothing complicated about it. The last formula shows multiplication by any integer. Note also a fairly obvious thing that the result of multiplying two unsigned integer variables of size N is more often a number of size up to 2 * N inclusive. Addition can cause an overflow of up to 1 extra bit.



Let's try to choose the divisor d . Obviously, in the binary world, it is best to choose it as a power of two, so as not to divide, but just move the register, for example. What power of two should you choose? Find the hint in the multiplication machine instructions. For example, the standard MUL instruction in the x86 system multiplies 2 registers and writes the result to 2 registers too, where 1 of the registers is the "upper part" of the result, and the second is the lower part.



For example, if we have 2 64-bit numbers, then the result will be a 128-bit number written into two 64-bit registers. Let's call RH - "upper" case, and RL - "lower" case 1 . Then, mathematically, the result can be written asR=RHโ‹…264+RL... Now we use the formulas above and write the multiplication ford=2โˆ’64

cd=a264โ‹…b264=aโ‹…b2128=RHโ‹…264+RL2128=RH+RLโ‹…2โˆ’64264



And it turns out that the result of multiplying these two fixed-point numbers is the register R=RH...



It's even easier for the Aarch64 system. The "UMULH" instruction multiplies two registers and writes the "upper" part of the multiplication to the 3rd register.



Well then. We have specified a fixed-point number, but there is still one problem. Negative numbers. In the Taylor series, the expansion goes with a variable sign. To cope with this problem, we transform the formula for calculating the polynomial by the Goner method, to the following form:

sinโก(x)โ‰ˆx(1โˆ’x2(1/3!โˆ’x2(1/5!โˆ’x2(1/7!โˆ’x2โ‹…1/9!))))



Check that it is mathematically exactly the same as the original formula. But in each parenthesis there is a number of the form1/(2n+1)!โˆ’x2โ‹…(โ‹ฏ)always positive. Those. this conversion allows the expression to be evaluated as unsigned integers.



constexpr mynumber toint    = {{0x00000000, 0x43F00000}};  /*  18446744073709551616 = 2^64     */
constexpr mynumber todouble = {{0x00000000, 0x3BF00000}};  /*  ~5.42101086242752217003726400434E-20 = 2^-64     */

double sin_e7(double xd) {
  uint64_t x = xd * toint.x;
  uint64_t xx = mul2(x, x);
  uint64_t res = tsx[19]; 
  for(int i = 17; i >= 3; i -= 2) {
    res = tsx[i] - mul2(res, xx);
  }
  res = mul2(res, xx);
  res = x - mul2(x, res);
  return res * todouble.x;
}


Tsx [i] values
constexpr array<uint64_t, 18> tsx = { // 2^64/i!
    0x0000000000000000LL,
    0x0000000000000000LL,
    0x8000000000000000LL,
    0x2aaaaaaaaaaaaaaaLL, // Change to 0x2aaaaaaaaaaaaaafLL and check.
    0x0aaaaaaaaaaaaaaaLL,
    0x0222222222222222LL,
    0x005b05b05b05b05bLL,
    0x000d00d00d00d00dLL,
    0x0001a01a01a01a01LL,
    0x00002e3bc74aad8eLL,
    0x0000049f93edde27LL,
    0x0000006b99159fd5LL,
    0x00000008f76c77fcLL,
    0x00000000b092309dLL,
    0x000000000c9cba54LL,
    0x0000000000d73f9fLL,
    0x00000000000d73f9LL,
    0x000000000000ca96LL
};




Where tsx[i]=1/i!in fixed point format. This time, for convenience, I posted all the code on the fast_sine github , got rid of quadmath for compatibility with clang and arm. And I changed a little the method of calculating the error.



Comparison of the standard sine function and the fixed point function is given in the two tables below. The first table shows the calculation accuracy (it is completely the same for x86_64 and ARM). The second table is a performance comparison.



Function Number of mistakes Maximum ULP value Average deviation
sin_e7 0.0822187% 0.504787 7.10578e-20
sin_e7a 0.0560688% 0.503336 2.0985e-20
std :: sin 0.234681% 0.515376 ---




During testing, the "true" sine value was calculated using the MPFR library... The maximum ULP value was considered as the maximum deviation from the "true" value. Percentage of errors - the number of cases when the calculated value of the sine function by us or by libm binary did not coincide with the rounded up to double sine value. The average value of the deviation shows the "direction" of the calculation error: overestimation or underestimation of the value. As you can see from the table, our function tends to overestimate the sine values. This can be fixed! Who said that the tsx values โ€‹โ€‹should be exactly equal to the coefficients of the Taylor series. A rather obvious idea suggests itself to vary the values โ€‹โ€‹of the coefficients in order to improve the accuracy of the approximation and remove the constant component of the error. It is quite difficult to correctly make such a variation. But we can try. Let's take for example4th value from the array of tsx coefficients (tsx [3]) and change the last number a to f. Let's restart the program and see the accuracy (sin_e7a). Look, it is a little, but increased! We add this method to our piggy bank.



Now let's see what the performance is. For testing, I took what was at hand i5 mobile and a slightly overclocked fourth raspberry (Raspberry PI 4 8GB), GCC10 from the Ubuntu 20.04 x64 distribution for both systems.



Function x86_64 time, s ARM time, s
sin_e7 0.174371 0.469210
std :: sin 0.154805 0.447807




I do not pretend to be more accurate in these measurements. Variations of several tens of percent are possible depending on the processor load. The main conclusion can be made like this. Switching to integer arithmetic does not give a performance gain on modern processors 2 . The unimaginable number of transistors in modern processors makes it possible to quickly perform complex calculations. But, I think that on such processors as Intel Atom, as well as on weak controllers, this approach can give a significant performance gain. What do you think?



While this approach has resulted in a gain in accuracy, this gain in accuracy seems more interesting than useful. In terms of performance, this approach can find itself, for example, in the IoT. But for high performance computing, it is no longer mainstream. In today's world SSE / AVX / CUDA prefer to use parallel function calculation. And in floating point arithmetic. There are no parallel analogs of the MUL function. The function itself is rather a tribute to tradition.



In the next chapter, I will describe how you can effectively use AVX for calculations. Again, let's go into the libm code and try to improve it.



1 There are no registers with such names in processors known to me. The names have been chosen for example.

2It should be noted here that my ARM is equipped with the latest version of the math coprocessor. If the processor emulated floating point calculations, the results could be dramatically different.



All Articles