🏺 🏇🏻 🈳 Accurate and fast calculations for floating point numbers using the example of the sine function. Introduction and part 1 📯 👍🏻 👾

Carefully read very good articles from ArtemKaravaevon the addition of floating point numbers. The topic is very interesting and I would like to continue it and show with examples how to work with floating point numbers in practice. Take the GNU glibc library (libm) as a reference. And so that the article is not too boring, we will add a competitive component: we will try not only to repeat, but also to improve the library code, making it faster / more accurate.

As an example, I have chosen the trigonometric sine function. This is a widespread function, the mathematics of which is well known from school and university. At the same time, when it is implemented, there will be many vivid examples of "correct" work with numbers. I will use double as a floating point number.

In this series of articles, a lot is planned, from mathematics to machine codes and compiler options. The language of writing an article is C ++, but without "frills". In contrast to the C language, working examples will be more readable even for people not familiar with the language and take up fewer lines.

Articles will be written by immersion method. Subtasks will be discussed, which will then come together in a single solution to the problem.

Decomposition of the sine into a Taylor series.

The sine function expands into an infinite Taylor series.

\sin (x) = x - \frac{x^{3}}{3!} + \frac{x^{5}}{5!} - \frac{x^{7}}{7!} + \frac{x^{9}}{9!} - \dots

$\sin(x)=x-{\frac {x^{3}}{3!}}+{\frac {x^{5}}{5!}}-{\frac {x^{7}}{7!}}+{\frac {x^{9}}{9!}}-\cdots$

It is clear that we cannot calculate an infinite series, except for cases when there is an analytic formula for an infinite sum. But this is not our case))) Suppose that we want to calculate the sine in the interval

[0, \frac{π}{2}]

$[0, \frac{\pi}{2}]$ ... We will discuss the work with intervals in more detail in Part 3. Knowing that

\sin (\frac{π}{2}) = 1

$\sin(\frac{\pi}{2})=1$ estimate, find the first term that can be discarded based on the condition that

\frac{(π / 2)^{n}}{n!} < e

$\frac{(\pi/2)^n}{n!}<e$ where

e

$e$ it is the difference between the number 1 and the smallest number that is greater than 1. Roughly speaking, this is the last bit of the mantissa ( wiki ). It is easier to solve this equation by brute force. For

e \approx 2.22 \times 10^{- 16}

$e \approx 2.22\times10^{-16}$ ... I managed

n = 23

$n=23$ can already be dropped. The correct choice of the number of terms will be discussed in one of the next parts, so for today we will “be safe” and take the terms up to

n = 25

$n=25$ inclusive.

The last term is approximately 10,000 times less than

e

$e$ ...

Simplest solution

Hands are already itching, we write:

Full text of the program for testing

#include <iostream>
#include <iomanip>
#include <cmath>
#include <array>
#include <bitset>
#include <quadmath.h>
//      clang
//#include "/usr/lib/gcc/x86_64-linux-gnu/10/include/quadmath.h"
#include <numeric>
#include <limits>
#include <vector>

#include <boost/timer/timer.hpp>
#include <boost/math/special_functions/factorials.hpp>

namespace bm = boost::math;

using namespace std;

typedef union { uint32_t i[2]; double x; } mynumber;

array<double, 25> fc;

double sin_e1(double x) {
  double result = 0;
  int sign = 1;
  for(int i = 1; i < 25; i += 2) {
    result += sign * pow(x, i) / bm::unchecked_factorial<double>(i);
    sign = -sign;
  }
  return result;
}

double sin_e2(double x) {
  double result = 0;
  int sign = 1;
  double xx = x * x;
  double pw = x;
  double fti = 1.0;
  for(int i = 1; i < 25; i += 2) {
    fti /= i;
    result += sign * pw * fti;
    fti /= ( i + 1 );
    sign = -sign;
    pw  *= xx;
  }
  return result;
}

double sin_e3(double x) {
  double result = 0;
  for(int i = 25; i >= 1; i -= 2) {
    result += (((i - 1) % 4 == 0) ? 1 : -1 ) * pow(x, i) / bm::unchecked_factorial<double>(i);
  }
  return result;
}

double sin_e4(double x) {
  double xx = x * x;
  double res = fc[25];
  for(int i = 23; i >= 1; i -= 2) {
    res = fc[i] + xx * res;
  }
  return x * res;
}

double sin_e5(double x) {
  double xx = x * x;
  double res = fc[25];
  for(int i = 23; i >= 3; i -= 2) {
    res = fc[i] + xx * res;
  }
  return x + x * xx * res;
}

inline
double fsin(double x) {
  double result;
  asm ("fsin" :"=t" (result) : "0" (x));
  return result;
}

#define SIN(a) fsin(a)
//#define SIN(a) sin(a)
//#define SIN(a) sin_e5(a)
// ^^     . ^^

int main() {

  __uint128_t ft = 1;
  fc[1] = 1.0; //3 * 5;
  for(int i = 2; i < fc.size(); i++) {
    ft *= i;
    // factorial with sign for Taylor series
    fc[i] = (1.0 / ft) * (( (i - 2) % 4 < 2) ? -1 : 1);
  }
  vector<double> xv;
  xv.resize(8 * 2000000);
  //      0  M_PI/2
  for (int i = 0; i < xv.size(); i++) {
    //      .
    xv[i] = (M_PI / 2) * i / double(xv.size());
  }

  double res = 0;
  {
    boost::timer::auto_cpu_timer at;
    for(int i = 0; i < xv.size(); i++) {
      res += SIN(xv[i]);
    }
  }

  int co = 0, cn = 0;
  //      .
  __float128 avg = 0.0, div = 0.0;
  for(int i = 0; i < xv.size(); i++) {
    mynumber dold, dnew;
    dold.x = sin(xv[i]);
    dnew.x = SIN(xv[i]);
    __float128 q = sinq(xv[i]); // <= sinq  .
    __float128 dd = __float128(dnew.x) - q;
    //     .
    div += dd * dd;
    avg += dd;
    //  ,         .
    //   ,         .
    if( dold.i[0] != dnew.i[0] || dold.i[1] != dnew.i[1] ) {
      if( fabsq(q - dold.x) <= fabsq(q - dnew.x) )
        co++;
      else
        cn++;
    }
  }
  avg /= xv.size();
  div /= xv.size();

  cout << res << endl;

  //  ,           .
  cout << "Better libm: " <<  co << " / " << xv.size() << "(" << 100.0 * co / xv.size() << "%)" << endl;

  //  ,  ""         .
  cout << "Better new: " <<  cn << " / " << xv.size() << "(" << 100.0 * cn / xv.size() << "%)" << endl;

  //         .
  cout << "  Avg / std new: " << double(avg) << " / " << double(sqrtq( div - avg * avg )) << endl;
  return 0;
}

double sin_e1(double x) {
  double result = 0;
  int sign = 1;
  for(int i = 1; i < 25; i += 2) {
    result += sign * pow(x, i) / bm::factorial<double>(i);
    sign = -sign;
  }
  return result;
}

How to speed up the program, I think that many people figured out right away. How many times do you think your changes can speed up the program? The optimized version and the answer to the question under the spoiler.

Optimized version of the program.

double sin_e2(double x) {
  double result = 0;
  int sign = 1;
  double xx = x * x;
  double pw = x;
  double fti = 1.0;
  for(int i = 1; i < 25; i += 2) {
    fti /= i;
    result += sign * pw * fti;
    fti /= ( i + 1 );
    sign = -sign;
    pw  *= xx;
  }
  return result;
}

10000 (GNU C++ v10; -O2)

Improving accuracy

Methodology

The calculation accuracy of the function will be determined by 2 standard parameters.

The standard deviation from the true sin (float128) and the mean of this deviation. The last parameter can give important information about how our function behaves. She can systematically underestimate or overestimate the result.

In addition to these parameters, we will introduce two more. Together with our function, we call the sin (double) function built into the library. If the results of two functions: ours and the built-in do not match (bit by bit), then we add to the statistics which of the two functions is farther from the true value.

Summation order

Let's go back to the original example. How can you increase its accuracy "quickly"? Those who carefully read the article Is it possible to add N numbers of double type most accurately? will most likely give an answer right away. It is necessary to turn the cycle in the opposite direction. To add from smallest modulo to largest.

double sin_e3(double x) {
  double result = 0;
  for(int i = 25; i >= 1; i -= 2) {
    result += (((i - 1) % 4 == 0) ? 1 : -1 ) * pow(x, i) / bm::unchecked_factorial<double>(i);
  }
  return result;
}

The results are shown in the table.

Function	Average error	STD	Ours is better	Better libm
sin_e1	-1.28562e-18	8.25717e-17	0.0588438%	53.5466%
sin_e3	-3.4074e-21	3.39727e-17	0.0423%	10.8049%
sin_e4	8.79046e-18	4.77326e-17	0.0686%	27.6594%
sin_e5	8.78307e-18	3.69995e-17	0.0477062%	13.5105%

It may seem that using the smart summation algorithms will remove the error almost to 0, but this is not the case. Of course, these algorithms will give an increase in accuracy, but to get rid of errors completely, smart multiplication algorithms are also required. They exist, but they are very expensive: there are a lot of unnecessary operations. Their use is not justified here. However, we will return to them later in a different context.

There is very little left. Combine fast and accurate algorithms. To do this, let's go back to the Taylor series again. Let's restrict it to 4 members for example and make the following transformation.

\sin (x) \approx x (1 + x^{2} (- 1 / 3! + x^{2} (1 / 5! + x^{2} (- 1 / 7! + x^{2} \cdot 1 / 9!))))

$\sin(x)\approx x(1+x^2(-1/3!+x^2(1/5!+x^2(-1/7!+x^2\cdot1/9!))))$

You can expand the parentheses and check that the original expression is obtained. This view is very easy to fit into a loop.

double sin_e4(double x) {
  double xx = x * x;
  double res = fc[25];
  for(int i = 23; i >= 1; i -= 2) {
    res = fc[i] + xx * res;
  }
  return x * res;
}

Works fast, but lost accuracy compared to e3. Again, rounding is a problem. Let's look at the last step of the loop and transform the original expression a bit.

\sin (x) \approx x + x \cdot x^{2} (- 1 / 3! + \dots))

$\sin(x)\approx x+x \cdot x^2(-1/3!+ \cdots))$

And the corresponding code.

double sin_e5(double x) {
  double xx = x * x;
  double res = fc[25];
  for(int i = 23; i >= 3; i -= 2) {
    res = fc[i] + xx * res;
  }
  return x + x * xx * res;
}

The accuracy has doubled compared to libm. If you can guess why the accuracy has increased, write in the comments. In addition, there is one more, much more unpleasant thing about sin_e4, which is missing from sin_e5, related to accuracy. Try to guess what the problem is. In the next part I will definitely tell you about it in detail.

If you like the article, then in the next one I will tell you how the GNU libc calculates a sine with a maximum ULP of 0.548.

Accurate and fast calculations for floating point numbers using the example of the sine function. Introduction and part 1

Decomposition of the sine into a Taylor series.

Simplest solution

Improving accuracy

Methodology

Summation order

More articles: