Unicode characters: what every developer should know





If you are writing an international application that uses multiple languages, then you need to know a thing or two about encoding. She is responsible for how the text is displayed on the screen. I'll briefly talk about the history of encoding and its standardization, and then we'll talk about its use. Let's touch on the theory of informatics a little.



Introduction to encoding



Computers understand only binary numbers - zeros and ones, this is their language. Nothing else. One number is called a byte, each byte is made up of eight bits. That is, eight zeros and ones make up one byte. Inside computers, everything boils down to binary - programming languages, mouse movements, keystrokes, and all the words on the screen. But if the article you are reading used to be a bunch of zeros and ones, then how did binary numbers become text? Let's figure it out.



A brief history of encoding



At the dawn of its development, the Internet was exclusively English-speaking. Its authors and users did not have to worry about the characters of other languages, and all the needs were fully covered by the American Standard Code for Information Interchange (ASCII) encoding.



ASCII is a table for mapping binary symbols to alphabet characters. When the computer receives an entry like this:



01001000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100

      
      





then using ASCII it converts it to the phrase "Hello world".



One byte (eight bits) was large enough to contain any English-language letter, as well as control characters, some of which were used by teleprinters, so in those years they were useful (today they are no longer very useful). Control characters were, for example, 7 (0111 in binary), which caused the computer to emit a signal; 8 (1000 in binary) - displays the last printed character; or 12 (1100 in binary) - erased all text written on the video terminal.



In those days, computers counted 8 bits per byte (this was not always the case), so there was no problem. We could store all control characters, all numbers and English letters, and even there was still space, since one byte can encode 255 characters, and ASCII only needs 127. That is, there were still 128 positions in the encoding unused.



This is what an ASCII table looks like. Binary numbers encode all upper and lower case letters from A to Z and numbers from 0 to 9. The first 32 positions are reserved for non-printable control characters.





ASCII problems



Positions 128 through 255 were empty. The public wondered how to fill them. But they all had different ideas . The American National Standards Institute (ANSI) formulates standards for different industries. They approved the ASCII positions from 0 to 127. Nobody disputed them. The problem was with the rest of the positions.



This is what filled positions 128-255 in the first IBM computers:





Some squiggles, background icons, math operators, and accented symbols like é. But the developers of other computer architectures did not support the initiative. Everyone wanted to implement their own encoding in the second half of ASCII.



All of these different endings have been called code pages .



What are ASCII code pages?



Here is a collection of over 465 different code pages! There were different pages even within the same language, for example, for Greek and Chinese. How could this mess be standardized? Or at least make it work between different languages? Or between different code pages for the same language? In languages ​​other than English? The Chinese have over 100,000 characters. ASCII cannot even accommodate all of them, even if they decided to give all empty positions for Chinese characters.



This problem is even called Mojibake (bnot, krakozyabry). This is what they say about distorted text, which is obtained when using an incorrect encoding. Translated from Japanese, mojibake means "character conversion".





An example of bnopni (krakozyabrov).



Some kind of madness ...



Exactly! There was no chance of reliably converting the data. The Internet is just a monstrous connection of computers all over the world. Imagine that all countries decide to use their own standards. For example, Greek computers only accept Greek, while English computers only send English. It's like screaming in an empty cave, nobody can hear you.



ASCII was no longer adequate for life. For the worldwide Internet, something different had to be created, or there would have been hundreds of code pages to deal with. Unless you do not want to read these paragraphs.



֎֏ 0590 ֐ ׀ׁׂ׃ׅׄ׆ׇ



This is how Unicode was born



Unicode stands for Universal Coded Character Set (UCS) and has the official ISO / IEC 10646 designation. But usually everyone uses the Unicode name.



This standard helped solve problems caused by encoding and code pages. It contains many code points (code points) assigned to characters from languages ​​and cultures around the world. That is, Unicode is a set of characters . It can be used to associate some abstraction with the letter we want to refer to. And this is done for every symbol, even Egyptian hieroglyphs .



Someone has done a great job matching every character in all languages ​​with unique codes. This is how it looks:



«Hello World»

U+0048 :   H
U+0065 :   E
U+006C :   L
U+006C :   L
U+006F :   O
U+0020 : 
U+0057 :   W
U+006F :   O
U+0072 :   R
U+006C :   L
U+0064 :   D
      
      





The U + prefix indicates that this is a Unicode standard and the number is a binary conversion. The standard uses hexadecimal notation, which is a simplified representation of binary numbers. Here you can enter anything in the field and see how it is converted to Unicode. And here you can admire all 143,859 code points.



I'll clarify just in case: we are talking about a large dictionary of code points assigned to all kinds of symbols. This is a very large set of symbols, nothing more.



It remains to add the last ingredient.



Unicode Transform Protocol (UTF)



UTF is a protocol for encoding Unicode code points. It is spelled out in the standard and allows you to encode any code point. However, there are different types of UTF. They differ in the number of bytes used to encode one item. UTF-8 uses one byte per point, UTF-16 uses two bytes, and UTF-32 uses four bytes.



But if we have three different encodings, how do we know which one is used in a particular file? For this, a Byte Order Mark (BOM) is used, which is also called the Encoding Signature. BOM is a two-byte marker at the beginning of the file that tells you which encoding is used here.



On the Internet, UTF-8 is the most commonly used , it is also written as preferred in the HTML5 standard, so I'll give it the most attention.





This graph was built in 2012, UTF-8 was becoming the dominant encoding. And it still is.





The graph shows the prevalence of UTF-8.



What is UTF-8 and how does it work?



UTF-8 encodes in one byte each Unicode code point 0 through 127 (as in ASCII). That is, if you wrote your program using ASCII, and your users use UTF-8, they won't notice anything out of the ordinary. Everything will work as intended. Note how important this is. We needed to maintain backward compatibility with ASCII during the mass adoption of UTF-8. And this encoding doesn't break anything.



As the name suggests, a code point consists of 8 bits (one byte). There are characters in Unicode that take up several bytes (up to 6). This is called variable length. In different languages, the specific number of bytes is different. In English - 1, European languages ​​(with Latin alphabet), Hebrew and Arabicrepresented by two bytes per code point. For Chinese, Japanese, Korean, and other Asian languages , three bytes are used.



If you need a character to occupy more than one byte, then a bit pattern is used to indicate the transition - it says that the character continues in the next several bytes.



And now we, as if by magic, came to an agreement on how to encode the Sumerian cuneiform (Habr does not display it), as well as emoji icons !



To summarize, we first read the BOM to determine the encoding version, then convert the file to Unicode code points, and then display the characters from the Unicode set.



Finally about UTF



Codes are keys . If I post the wrong encoding you won't be able to read anything. Keep this in mind when sending and receiving data. In our day-to-day tools, this is often abstracted, but for us programmers it's important to understand what's going on under the hood.



How do we set the encoding? Since HTML is written in English, and almost all encodings work fine with English, we can specify the encoding at the beginning of the section <had>



.



<html lang="en">
<head>
  <meta charset="utf-8">
</head>

      
      





It is important to do this at the very beginning <had>



, as HTML parsing can start over if the wrong encoding is currently being used. You can also find out the encoding version from the Content-Type header of the HTTP request / response.



If the HTML document contains no mention of encoding, the HTML5 spec offers an interesting solution like BOM sniffing . With its help, we can determine the encoding used by the byte order marker (BOM).



It's all?



Unicode is not yet complete. As is the case with any standard, we add something, remove something, offer something new. None of the specifications are “complete”. Usually there are 1-2 releases a year, you can find their description here .



I recently read about a very interesting bug related to incorrect display of Russian Unicode characters on Twitter .



If you have read to the end, then you are great. I suggest you do your homework. See how sites can break when using the wrong encoding. I took advantage of thisextension for Google Chrome, changed the encoding and tried to open different pages. The information was completely unreadable. Try it yourself, what a stump looks like. This will help you understand how important the encoding is.





Conclusion



While writing this article, I came to know about Michael Everson . Since 1993, he has proposed over 200 changes to Unicode, adding thousands of characters to the standard. As of 2003, he was considered the most productive member. He alone greatly influenced the face of Unicode. Michael is one of those who made the Internet as we know it today. Very impressive.



I hope I was able to show you what encodings are for, what problems they solve, and what happens when they fail.



All Articles