Image - www.freepik.com
Several years ago I thought and wrote a lot about floating point math. It was very interesting, and in the process of research, I learned a lot, but sometimes I long time did not use in practice all these skills received heavy labor. Therefore, I am extremely pleased every time I have to work on a bug that requires various specialized knowledge. In this article, I will tell three stories about floating point bugs that I learned in Chromium.
Part 1: unrealistic expectations
The bug was called "JSON does not parse 64-bit Integers correctly"; It doesn't look like a floating point or browser issue at first, but it was posted to crbug.com so I was asked to take a look. The easiest way to recreate it is by opening the Chrome developer tools (F12 or Ctrl + Shift + I) and pasting the following code into the developer console:
json = JSON.parse(‘{“x”: 2940078943461317278}’); alert(json[‘x’]);
Inserting unknown code into the console window is a great way to be hacked, but the code was so simple I could figure out it wasn't malicious. In the bug report, the author kindly indicated his expectations and actual results:
What is the expected behavior? An integer value of 2940078943461317278 should be returned.
What is the error? An integer 2940078943461317000 is returned instead.
The "bug" was found on Linux, and I'm working on Chrome for Windows, but this behavior is cross-platform, and I had knowledge of floating point numbers, so I researched it.
This behavior of integers is potentially a floating point bug, because there is really no integer type in JavaScript. And for the same reason, this is not actually a bug.
The entered number is quite large, it is approximately equal to 2.9e18. And that's the problem. Since JavaScript does not have an integer type, it uses IEEE-754 floating-point double-precision for numbers . This binary floating point format has a sign bit, an 11-bit exponent and a 53-bit mantissa (yes, that's 65 bits, one bit is hidden by magic). This double type is so good at storing integer that many JavaScript programmers never noticed that there was no integer type. However, very large numbers destroy this illusion.
JavaScript number can store any integer value up to 2 ^ 53 with precision. After that, it can store all even numbers up to 2 ^ 54. After that, it can store all multiples of four numbers up to 2 ^ 55, and so on.
The problem number is expressed in base 2 exponential notation, which is approximately 1.275 * 2 ^ 61. Only a very small number of integers can be expressed in this interval - the distance between the numbers is 512. Here are the three corresponding numbers:
- 2 940 078 943 461 317 278 is the number that the author of the bug report wanted to keep
- 2 940 078 943 461 317 120 - the closest double to this number (less than it)
- 2 940 078 943 461 317 632 - the next closest to the number double (greater than it)
The number we need is in the interval between these two doubles and the JSON module (for example, JavaScript itself or any other correctly implemented function for converting text to double) did its best and returned the closest double. Simply put, the number that the author of the report wanted to store cannot be stored in the built-in JavaScript numeric type .
So far, everything is clear: if you reach the limits of the language, then you need to know more about how it works. But there is still one more mystery. The bug report says that in fact the following number is returned:
2 940 078 943 461 317 000
The situation is curious, because this is not an entered number, not the nearest double, and, in fact, not even a number that can be represented as a double!
This puzzle is also explained by the JavaScript specification. The specification says that when printing a number, an implementation must output a sufficient number of digits to uniquely identify it, and no more. This is useful for printing numbers like 0.1, which cannot be accurately represented as a double. For example, if JavaScript required 0.1 to be output as a stored value, then it would output:
0.1000000000000000055511151231257827021181583404541015625
It would be an accurate result , but it would just confuse people by not adding anything useful. Specific rules can be found here (look for the line "ToString Applied to the Number Type"). I don't think the spec requires trailing zeros, but it certainly does.
So, when the program runs, JavaScript outputs 2,940,078,943,461,317,000 because:
- Original number value was lost when saved as JavaScript number
- The displayed number is close enough to the stored value to uniquely identify it
- The displayed number is the simplest number that uniquely identifies the stored value
Everything works as it should, this is not a bug, the problem is closed as WontFix ("unrecoverable"). The original bug can be found here .
Part 2: bad epsilon
This time I actually fixed the bug, first in Chromium and then in googletest, to avoid confusion for future generations of developers.
This bug was a non-deterministic test failure that started happening suddenly. We hate these fuzzy test failures. They are especially confusing when they start happening in a test that hasn't changed in years. A few weeks later, I was brought in to investigate. The error messages (slightly modified for line lengths) started out something like this:
The difference between expected_microseconds and converted_microseconds is 512, which exceeds 1.0 [The difference between expected_microseconds and converted_microseconds is 512, which exceeds 1.0]
Yes, that sounds bad. This is a googletest error message saying that two floating point values that should be no more than 1.0 apart are actually 512 apart. The
first piece of evidence was the difference between the floating point numbers. It seemed very suspicious that the two numbers are separated by exactly 2 ^ 9. Coincidence? I don’t think so. The rest of the post, which indicated the two values being compared, convinced me even more of the reason:
expected_microseconds evaluates to 4.2934311416234112e + 18,
converted_microseconds evaluates to 4.2934311416234107e + 18
If you've fought with IEEE 754 long enough, you'll immediately understand what's going on.
You have read the first part, so you can feel déjà vu because of the same numbers. However, this is pure coincidence - I just use the numbers that I encountered. This time they were displayed in exponential format, which makes the article a bit diversified.
The main problem is a variation of the problem from the first part: floating point numbers in computers are different from the real numbers used by mathematicians. They become less accurate as they increase, and all doubles were necessarily multiples of 512 in the range of failing numbers. Double has 53 bits of precision, and these numbers were much larger than 2 ^ 53, so a significant reduction in precision was inevitable. And now we can understand the problem.
The test calculated the same value in two different ways. He then checked to see if the results were close, with “closeness” meaning a difference within 1.0. The calculation methods gave very similar answers, so in most cases the results were rounded to the same value with double precision. However , from time to timethe correct answer is near the inflection, and one calculation rounds one way and the other rounds off another.
More specifically, as a result, the following numbers were compared:
- 4293431141623410688
- 4293431141623411200
Without exponents, it is more noticeable that they are separated by exactly 512. The two infinitely precise results generated by the test functions always differed by less than 1.0, that is, when they were values like 429 ... 10653.5 and 429 ... 10654.3, both were rounded to 429 ... 10688. The problem occurred when infinitely precise results were close to a value like 4293431141623410944. This value is exactly halfway between two doubles. If one function generates 429 ... 10943.9, and the other 429 ... 10944.1, then these results, divided by a value of only 0.2, were rounded in different directions and ended up at a distance of 512!
This is the nature of the inflection, or step function. You can get two results, arbitrarily close to each other, but located on opposite sides of the inflection - points exactly in the middle between the two - and therefore rounded in different directions. It is often recommended to change the rounding mode, but this does not help - it just moves the inflection point.
It's like having a baby around midnight - a tiny deviation can permanently change the date (maybe a year, century, or millennium) of the event's registration.
Perhaps my commit note was overly dramatic, but infallible. I felt like a unique specialist, able to handle this situation:
commit 6c2427457b0c5ebaefa5c1a6003117ca8126e7bc
Author: Bruce Dawson
Date: Fri Dec 08 21:58:50 2017
Fix epsilon calculation for large-double comparisons
My whole life has been leading up to this bug fix. [My whole life has led me to fix this bug.]
Indeed, I rarely manage to make a change in Chromium with a commit note that reasonably links to two (2!) Of my posts .
The fix in this case was to calculate the difference between two neighboring doubles with the magnitude of the calculated values. This was done with the rarely used nextafter function . More or less like this:
epsilon = nextafter(expected, INFINITY) – expected;
if (epsilon < 1.0)
epsilon = 1.0;
The nextafter function finds the next double (in this case, in the direction of infinity), and subtraction (which is done exactly, and this is very convenient) then finds the difference between the doubles at their value. The tested algorithm gave an error of 1.0, so epsilon should be no more than this value. This computation of epsilon makes it very easy to check if values are less than 1.0 apart or adjacent doubles.
I haven’t investigated the reason why the test suddenly started to fail, but I suspect that it’s a timer frequency or a change in the timer start point that caused the numbers to get larger.
. QueryPerformanceCounter (QPC), <int64>::max(), 2^63-1. , . , , QPC 2 148 . , QPC, , , , , 3 . QPC 2^63-1 , .
, , QueryPerformanceCounter.
googletest
I was annoyed that understanding the problem required esoteric knowledge of floating point specifics, so I wanted to fix googletest . My first attempt ended badly.
I originally tried to fix googletest by making EXPECT_NEAR fail when transmitting an insignificantly small epsilon, however it seems like a lot of tests inside Google, and probably many more outside of Google, misuse EXPECT_NEAR on double values. They pass an epsilon value that is too small to be useful, but the numbers they compare are the same, so the test succeeds. I fixed a dozen points of using EXPECT_NEAR without getting close to solving the problem, so I gave up.
It was only while writing this post (almost three years after the bug appeared!) That I realized how safe and easy it was to fix googletest. If the code uses EXPECT_NEAR with too little epsilon and the test succeeds (that is, the values are actually equal), then there is no problem. This becomes a problem only when the test fails, so it was enough for me to search for too small epsilon values only in case of failure and display an informative message at the same time.
I made this change and now the error message for this 2017 crash looks like this:
expected_microseconds converted_microseconds 512,
expected_microseconds 4.2934311416234112e+18,
converted_microseconds evaluates to 4.2934311416234107e+18.
abs_error 1.0, double , 512; EXPECT_NEAR EXPECT_EQUAL. EXPECT_DOUBLE_EQ.
Note that EXPECT_DOUBLE_EQ does not actually check for equality, it checks if doubles are equal to four units in the last digit (units in the last place, ULP). You can read more about this concept in my post Comparing Floating Point Numbers .
I hope that most software developers see this new error message and take the right path, and I believe that fixing googletest is ultimately more important than fixing the Chromium test.
Part 3: when x + y = x (y! = 0)
This is another variation on precision problems when approaching bounds: Perhaps I just find the same floating point bug over and over again?
In this part I will also describe the debugging techniques that you can apply if you want to investigate the Chromium source code or investigate the cause of the crash.
When I came across this problem, I posted a bug report titled " Crash with OOM (Out of Memory) error in chrome: // tracing when zooming in "; it doesn't sound like a floating point bug.
As usual, I was not looking for problems myself, but just studying chrome: // tracing, trying to understand some of the events; a sad tab suddenly appeared - there was a failure.
You can view and download the latest crashes for Chrome at chrome: // crashes, but I wanted to load the crash dump into the debugger, so I looked where they are stored locally:
% localappdata% \ Google \ Chrome \ User Data \ Crashpad \ reports
I uploaded the most recent crash dump to windbg (Visual Studio will do as well) and then proceeded to investigate. Since I had the Chrome and Microsoft symbol servers configured and the source server enabled, the debugger automatically downloaded the PDB (debug information) and the required source files. Please note that this scheme is available to everyone - you do not need to be a Google employee or Chromium developer for this magic to work. Instructions for setting up Chrome / Chromium debugging can be found here . Automatic download of source code requires Python installation.
The crash analysis showed that the out-of-memory error occurs due to the fact that the v8 (JavaScript engine) function NewFixedDoubleArraytries to allocate an array with 75 209 227 elements, and the maximum size allowed in this context is 67 108 863 (0x3FFFFFF in hex).
The nice thing about the glitches I caused myself is that you can try to recreate them with more careful monitoring. Experiments showed that when zoomed, the memory remained stable until I got to the critical point, after which memory usage suddenly skyrocketed and the tab crashed even if I did nothing.
The problem here was that I could easily view the call stack for this failure, but only in the C ++ portion of Chrome's code. However, apparently, the bug itself appeared in the chrome: // tracing JavaScript code. I tried to test it with a canary build of Chrome (daily) under the debugger, and got the following curious message:
==== JS stack trace ======================================
Unfortunately, there was no stack trace behind this interesting line. After wandering a bit in the wilds of git , I found out that the ability to output JS call stacks over OOM was added in 2015 and then removed in December 2019 .
I researched this bug at the beginning of January 2020 (remember those good old days when everything was innocent and easier?), And it meant that the stack trace code OOM has been removed from the daily build, but still remained at a stable assembly ...
Therefore, my next step was to try to recreate the bug in the stable version of Chrome. This gave me the following results (I edited them a bit for clarity):
0: ExitFrame [pc: 00007FFDCD887FBD]
1: drawGrid_ [000016011D504859] [chrome: //tracing/tracing.js: ~ 4750]
2: draw [000016011D504821] [chrome: //tracing/tracing.js: 4750]
In short, the OOM crash was caused by drawGrid_ , which I found (using the Chromium code lookup page ) in x_axis_track.html. Having tweaked this file a bit, I narrowed it down to calling updateMajorMarkData . This function contains a loop that calls the majorMarkWorldPositions_.push function , which is the culprit of the problem.
It's worth mentioning here that although I develop a browser, I remain the world's worst JavaScript programmer. Skill in C ++ systems programming doesn't give me the magic of "frontend". Hacking JavaScript to understand this bug was quite a painful process for me.
The loop (which can be viewed here ) looked something like this:
for (let curX = firstMajorMark;
curX < viewRWorld;
curX += majorMarkDistanceWorld) {
this.majorMarkWorldPositions_.push(
Math.floor(MAJOR_MARK_ROUNDING_FACTOR * curX) /
MAJOR_MARK_ROUNDING_FACTOR);
}
I added debug output statements before the loop and got the data shown below. When I zoomed in on the image, the numbers that were critical, but not enough to cause a crash, looked like this:
firstMajorMark: 885.0999999642371
majorMarkDistanceWorld: 1e-13
Then I zoomed in to cause a crash, and I got numbers like this:
firstMajorMark: 885.0999999642371
majorMarkDistanceWorld: 5e-14
885 divided by 5e-14 is 1.8e16, and the precision of a double-precision floating point number is 2 ^ 53, which is 9.0e15. Therefore, a bug occurs when the majorMarkDistanceWorld (distance between grid points) is so small relative to firstMajorMark (the location of the first major grid mark) that adding in a loop ... does nothing. That is, if we add a small number to a large, then when the small is "too small", the large number (in the standard / sane rounding to nearest mode) can remain equal to the same value.
Because of this, the loop runs indefinitely, and the push command is executed until the array is limited to its size. If there were no size limits, the push command would continue to run until the entire machine runs out of memory. So hooray, problem solved?
The fix turned out to be pretty simple - don't display grid labels if we can't:
if (firstMajorMark / majorMarkDistanceWorld > 1e15) return;
As is often the case with the changes I make, my bugfix consisted of one line of code and a six-line comment. I'm only surprised that there was no 50-line iambic pentameter commit notes, notation notation, and blog post. Wait a minute ...
Unfortunately, JavaScript stack frames are still not displayed on OOM crashes, because it takes memory to write call stacks, which means it's not safe at this stage. I don't quite understand how I would investigate this bug today, when the OOM stack frames were completely removed, but I'm sure I would find a way.
So, if you are a JavaScript developer trying to use extremely large numbers, a test writer trying to use the largest integer value, or implementing a UI with unlimited zoom, then it's important to remember that as you approach the bounds of floating point math, those bounds can be breached.
Advertising
Development servers are epic from Vdsina.
We use extremely fast NVMe drives from Intel and do not save on hardware - only branded equipment and the most modern solutions on the market!