🤞🏾 👩🏽‍🤝‍👨🏻 📮 Life hack: how to parse a gigabyte of doubles per second 👦🏿 🤐 🐙

How to read a double value from a string in C ++ code?

std::stringstream in(mystring);
while(in >> x) {
   sum += x;
}

On Intel Skylake with GCC 8.3 compiler, this code parses 50 MB / s. Hard disks can easily provide sequential reads at a speed of several GB / s, so there is no doubt that it is not the speed of reading from the disk that limits us, but the speed of parsing. How can I speed it up?

The first thing that suggests itself is to abandon the convenience provided by streams in C ++ and call strtod (3) directly:

do {
    number = strtod(s, &end);
    if(end == s) break;
    sum += number;
    s = end; 
} while (s < theend);

The speed grows up to 90 MB / s; profiling shows that when reading from the stream ~ 1600 instructions are executed for each read number, when using strtod - ~ 1100 instructions per number. The C and C ++ standard libraries can be justified by the requirements of universality and portability; but if we limit ourselves to parsing only double and only on x64, then we can write much more efficient code: 280 instructions per number are enough.

Parsing integers

Let's start with a subtask: given an eight-digit decimal number, we need to convert it to int. At school, we were all taught to do this in a cycle:

int sum = 0;
for (int i = 0; i <= 7; i++)
{
	sum = (sum * 10) + (num[i] - '0');
}
return sum;

Compiled GCC 9.3.1 -O3, for me this code handles 3 GB / s. The obvious way to speed it up is to reverse the loop; but the compiler does it itself.

Note that the decimal notation of an eight-digit number fits into the int64_t variable: for example, the string “87654321” is stored in the same way as the int64_t value 0x3132333435363738 (the first byte contains the least significant 8 bits, the last - the most significant ones). This means that instead of eight memory accesses, we can allocate the value of each digit with a shift:

int64_t sum = *(int64_t*)num;
return (sum      & 15) * 10000000 +
    ((sum >> 8)  & 15) * 1000000 +
    ((sum >> 16) & 15) * 100000 +
    ((sum >> 24) & 15) * 10000 +
    ((sum >> 32) & 15) * 1000 +
    ((sum >> 40) & 15) * 100 +
    ((sum >> 48) & 15) * 10 +
    ((sum >> 56) & 15);

There is no acceleration yet; on the contrary, the speed drops to 1.7 GB / s! Let's go further: AND with the mask 0x000F000F000F000F will give us 0x0002000400060008 - decimal digits in even positions. Let's combine each of them with the following:

sum = ((sum & 0x000F000F000F000F) * 10) + 
      ((sum & 0x0F000F000F000F00) >> 8);

After that, 0x3132333435363738 turns into 0x00 (12) 00 (34) 00 (56) 00 (78) - bytes at even positions are zeroed, at odd ones - they contain binary representations of two-digit decimal numbers.

return (sum        & 255) * 1000000 +
      ((sum >> 16) & 255) * 10000 +
      ((sum >> 32) & 255) * 100 +
      ((sum >> 48) & 255);

- completes the conversion of an eight-digit number.

The same trick can be repeated with pairs of two-digit numbers:

sum = ((sum & 0x0000007F0000007F) * 100) +
      ((sum >> 16) & 0x0000007F0000007F);

- 0x00 (12) 00 (34) 00 (56) 00 (78) becomes 0x0000 (1234) 0000 (5678);

and with the resulting pair of four-digit ones:

return ((sum & 0x3FFF) * 10000) + ((sum >> 32) & 0x3FFF);

- 0x00 (12) 00 (34) 00 (56) 00 (78) becomes 0x00000000 (12345678). The lesser half of the resulting int64_t is the desired result. Despite the noticeable simplification of the code (three multiplications instead of seven), the parsing speed (2.6 GB / s) remains worse than that of the naive implementation.

But combining pairs of numbers can be even simplified if you notice that the following action will apply the mask 0x007F007F007F007F, which means that any garbage can be left in the nullable bytes:

sum = ((sum & 0x0?0F0?0F0?0F0?0F) * 10) + ((sum & 0x0F??0F??0F??0F??) >> 8) =
   = (((sum & 0x0F0F0F0F0F0F0F0F) * 2560)+ (sum & 0x0F0F0F0F0F0F0F0F)) >> 8 =
    = ((sum & 0x0F0F0F0F0F0F0F0F) * 2561) >> 8;

In simplified terms, one mask instead of two, and there is no addition. The two remaining expressions are simplified in the same way:

sum = ((sum & 0x00FF00FF00FF00FF) * 6553601) >> 16;
return((sum & 0x0000FFFF0000FFFF) * 42949672960001) >> 32;

This trick is called SIMD within a register (SWAR): using it, the parsing speed grows to 3.6 GB / s.

A similar SWAR trick can be used to check if an eight-character string is an eight-digit decimal number:

return ((val & 0xF0F0F0F0F0F0F0F0) |
      (((val + 0x0606060606060606) & 0xF0F0F0F0F0F0F0F0) >> 4))
            == 0x3333333333333333;

Double device

The IEEE standard assigned 52 bits to the mantissa and 11 to the exponent within these numbers; this means that the number is stored as

\pm 1. m \cdot 2^{e}

$\pm1.m\cdot2^e$ where

0 \leq m < 2^{52} < 10^{16}

$0\le m<2^{52}<10^{16}$ and

- 1022 \leq e \leq + 1023

$-1022\le e\le+1023$ ... In particular, this means that in the decimal notation of double, only the most significant 17 digits are significant; the least significant bits cannot affect the double value in any way. Even 17 significant digits is much more than might be needed for any practical application: Denormalized numbers complicate the work a little (from

2^{- 1074}

$2^{-1074}$ before

2^{- 1022}

$2^{-1022}$ with a correspondingly smaller number of significant bits in the mantissa) and rounding requirements (any decimal number must be represented by the nearest binary, and if the number is exactly in the middle between the nearest binary, then the even mantissa is preferred). All this ensures that if computer A converts the number X to a decimal string with 17 significant digits, then any computer B, reading this string, will receive the same number X, bit for bit - regardless of whether A and B are the same. processor models, operating systems, and programming languages. (Imagine how fun it would be to debug your code if the rounding errors were different on different computers.) Due to the rounding requirements, the misunderstandings recently mentioned arise. on Habré: the decimal fraction 0.1 is represented as the closest binary fraction

7205759403792794 \cdot 2^{- 56}

$7205759403792794\cdot2^{-56}$ equal to exactly 0.1000000000000000055511151231257827021181583404541015625; 0.2 - in the form

7205759403792794 \cdot 2^{- 55}

$7205759403792794\cdot2^{-55}$ equal to exactly 0.200000000000000011102230246251565404236316680908203125; but their sum is not a binary fraction closest to 0.3: approximation from below

5404319552844595 \cdot 2^{- 54}

$5404319552844595\cdot2^{-54}$ = 0.299999999999999988897769753748434595763683319091796875 turns out to be more accurate. Therefore, the IEEE standard requires adding 0.1 + 0.2 to produce a result other than 0.3.

Parsing double

A straightforward generalization of the integer algorithm consists in iterative multiplications and divisions by 10.0 - in contrast to parsing integers, here it is necessary to process digits from low to high so that rounding errors in high digits "hide" rounding errors in low ones. At the same time, the parsing of the mantissa is easily reduced to the parsing of integers: it is enough to change the normalization so that the imaginary binary point is not at the beginning of the mantissa, but at the end; the resulting 53-bit integer multiply or divide by the desired power of ten; and finally, subtract 52 from the exponent, i.e. move point 52 bits to the left - where it should be according to the IEEE standard. In addition, there are three important facts to note:

It is enough to learn how to multiply or divide by 5, and multiplication or division by 2 is just an increment or decrement of an exponent;
uint64_t 5 0xcccccccccccccccd 66 , , $\frac{\texttt{0xcccccccccccccccd}}{2^{66}} - \frac1 5 = \frac1{5\cdot2^{66}} < 2^{-68}$ 64 ( );
– $10^{-324} < 2^{-1074}$ $2^{1024} < 10^{309}$ ; , 309 , 324 0xcccccccccccccccd, . ( 53 ; 128- , 53- 53- .) 633 double ( , ⅕, 7205759403792794 – 0xcccccccccccccccd, 53 ), double ; 53 , . , , 64 64 , , 128- . .

The difficulty of standard rounding is that in order to find out that the result is exactly in the middle between the nearest binary fractions, we not only need 54 significant bits of the result (the least significant zero bit means "everything is in order", the one means "hit the middle"), but and you need to watch if there was a non-zero continuation after the least significant bit. In particular, this means that considering only the 17 most significant digits in the decimal notation of the number is not enough: they define the binary mantissa only with an accuracy of ± 1, and to select the desired rounding direction, you will have to consider the lower digits. For example, 10000000000000003 and 10000000000000005 are the same double value (equal to 10000000000000004), and 10000000000000005.00 (many zeros) 001 is already a different value (equal to 10000000000000006).

Professor Daniel Lemire of the Correspondence University of Quebec (TÉLUQ), who invented this parser, published it on github . This library provides two functions from_chars

: one parses a string to double, the other to float. His experiments showed that in 99.8% of cases it is enough to consider 19 significant decimal digits (64 bits): if for two consecutive values of a 64-bit mantissa the same final double value is obtained, then this is the correct value. Only in 0.2% of cases, when these two values do not coincide, a more complex and more universal code is executed that implements always correct rounding.

Life hack: how to parse a gigabyte of doubles per second

Parsing integers

Double device

Parsing double

More articles: