Exceptionally fast UTF-8 validation

A text string is one of the most common "data types" in programming. When programmers think of a string, they imagine a list or array of characters. This is a "good enough" approximation, but the reality is more complicated.



Characters must be encoded into bits in some way. Most of the strings on the Internet, including this post on HabrΓ©, are encoded in UTF-8. The UTF-8 format represents "characters" in one, two, three, or four bytes. This is a generalization for the ASCII standard, which uses only one byte per character. That is, the ASCII string is also a UTF-8 string.



It's actually a little more complicated, because technically UTF-8 describes code points. A visible character like an emoji can consist of multiple code points ... but most programmers don't need those pedantic wording.



There are other standards as well. Some older programming languages ​​like C # and Java rely on UTF-16. It uses two or four bytes per character. It seemed like a good idea at the time, but now the consensus is moving towards using UTF-8 anytime, anywhere.



Most encodings have enforceable restrictions. In other words, not any random sequence of bits can go into UTF-8. Thus, you need to validate the strings - check that it is really UTF-8.



What does it matter? Great. For example, Microsoft's web server has such a vulnerability: it accepts a URI that appears to be valid and safe, but after being interpreted by the server, it gives the attacker remote access to the disk. Even leaving security concerns aside, you almost certainly don't want to store invalid rows in your database.



Thus, programming languages, web servers, browsers and database engines are constantly validating UTF-8.



If your strings are mostly just ASCII, then the checks are pretty fast and UTF-8 checking is not a problem. Gone are the days when most strings were ASCII encoded. We live in a world of emojis and many national alphabets.



Back in 2018, I asked myself:how fast can UTF-8 strings be validated ? At that time, I found a validation option with several CPU cycles per symbol. One could calm down on this, but this answer did not satisfy me.



The work took years, but it seems that now we got a version close to ideal. The new algorithm is an order of magnitude faster than other quick search options. We have prepared a white paper: "UTF-8 Validation in Less than One Instruction Per Byte" (to be published in Software: Practice and Experience ), and also published a benchmarking utility .



All the details are explained in a scientific article, so we will not go into details here, but only briefly consider the essence. The main part of UTF-8 validation is done by looking for pairs of consecutive bytes. After checking all byte pairs and identifying any possible violations that can be found from this information, there is relatively little left to do.



All processors have fast SIMD instructions. They work with wide registers (128 bit, 256 bit, etc.). Most sets have a "vectorized lookup" instruction that takes, say, 16-byte values ​​(ranging from 0 to 16) and searches for them in a 16-byte table. In Intel and AMD processors, this description corresponds to the instructionpshufb... A value between 0 and 16 is sometimes called nibble and spans 4 bits. The byte is composed of two nibble (low and high).



In our search algorithm, the vectorized search instruction is called three times: once for low nibble, once for high nibble, and once for high nibble for the next byte. We have three corresponding 16-byte lookup tables. If you choose them correctly, then the bitwise AND of the three searches finds any error.



See scientific article for more details , but ultimately nearly complete UTF-8 validation is done with just five lines of fast C ++ code without any branches ... and those five lines check blocks up to 32 bytes at a time ...



simd8 classify(simd8 input, simd8 previous_input) {
  auto prev1 = input.prev<1>(previous_input);
  auto byte_1_high = prev1.shift_right <4>().lookup_16(table1);
  auto byte_1_low = (prev1 & 0x0F).lookup_16(table2);
  auto byte_2_high = input.shift_right <4>().lookup_16(table3); 
  return (byte_1_high & byte_1_low & byte_2_high);
}


While it is not immediately obvious, this validation is sufficient and 100% safe. It really is . There are only a few inexpensive additional technical steps left.



As a result, on recent Intel / AMD processors, it takes a little less than one instruction per byte to check even the most junk, random input. Since the code is extremely optimized, you can go up to three instructions per cycle, and even more. That is, we use a small part of the cycle (less than a third) per input byte on a modern CPU. Thus, the processing speed is reliably maintained at over 12 GB / s.



The lesson is that regular lookup tables are useful, but vectorized tables are the fundamental building blocks for high-speed algorithms.



If you need to use the fast UTF-8 validation feature in production, we recommend the simdjson library (version 0.5 or higher). It is well tested and has useful built-in features such as runtime dispatching. Although the library is designed to parse JSON, you can use it purely for UTF-8 validation, even if there is no JSON at all. It supports 64-bit ARM and x64 processors, and also has fallback processing for other CPUs. We packed it into one header file along with one source file; so you can just put it in your C ++ project.



Previous work... The main merit in popularizing the vectorized classification method, which is the key to the search algorithm, belongs to Mula. As far as I know, Keizer was the first to propose our triple search strategy. The first practical SIMD-based UTF-8 validation algorithm was created by K. Willets. Several people, including Z. Wegner, made improvements in it. Travis Downs came up with smart ideas on how to speed up conventional algorithms.



Further reading . If you enjoy this work, you may like other articles on related topics: "Base64 Encoding and Decoding at Nearly Copy-in-Memory Speed" (Software: Practice and Experience, 50 (2), 2020) and "Parsing JSON Gigabytes Per Second" ( VLDB Journal, 28 (6), 2019).



All Articles