UTF-8 vs UTF-16. A few tips for programmers

Introduction

With the advent of the first devices for digital transmission of information and electronic computers, the problem arose of encoding text characters using sequences of ones and zeros. The minimum unit of information presentation is byte. Based on this, in 1963 in the United States, the ASCII (American standard code for information interchange) code table was developed, standardized, and subsequently expanded, using an 8-bit encoding. First of all, with the help of this table, it was supposed to encode numbers and letters of the English language. The first 128 characters of the table are shown in Fig. 1:





Fig. 1.  The first 128 characters of the ASCII table.
Fig. 1. The first 128 characters of the ASCII table.

The cell number in the table (Fig. 1) is the symbol code. As an example, consider encoding the word Hello. The numbers of the cells in the ASCII table in which the letters are located: 72 (H), 101 (e), 108 (l), 111 (o). The word code in binary representation looks like this:





00010 010 ( H ) 10100 110 ( e ) 00110 110 ( l ) 00110 110 ( l ) 11110 110 ( o ) ( most significant bit on the right ).





The underscored and bold codes in binary representation correspond to the cell numbers in the table (Fig. 1). The algorithm for generating the code is as follows:





1. – (). 010 – , 011 – .





2. – .





, 128 ASCII , . 128 (8 256 ) . , , 8 .





Unicode () – , ( ) . , ( ) .





.





Β« Β» UCS (Universal Coded Character Set), ISO/IEC 10646. UCS , , .





, .. , , UTF (Unicode Transformation Format): UTF-8, UTF-16 UTF-32





UTF-8 – , : 8, 16, 24 32.





UTF-16 – , :16 32.





UTF-8 UTF-16 UCS.





UTF-8

RFC (Request For Comments) 3629. RFC:





0xxxxxxx





110xxxxx 10xxxxxx





1110xxxx 10xxxxxx 10xxxxxx





11110xx 10xxxxxx 10xxxxxx 10xxxxxx





. ( ):





0 – 8- ,





110 – 16- ,





1110 – 24- ,





11110 – 32 .





– 10 – ( ), .





128 ASCII. 1040-1103.





Β« HelloΒ».





( ):





00001011 11111001 () 00001011 00001101 () 00001011 11111101 () 00001011 00001101 () 00000100 () 00010010 (H) 10100110 (e) 00110110 (l) 00110110 (l) 11110110 (o).





1055, 10000011111 – 11 . 110 – 10 – . Hello 1 , ASCII.





UTF-8 , , , , .





UTF-16

2000 RFC 2781, UTF-16, 16 32 . 0-55295 57344-65535 16 ( ), , 16, 32 . Β« HelloΒ».





( ):





11111000 00100000 () 00001100 001000000 () 11111100 00100000 () 00001100 001000000 () 00000100 00000000 () 00010010 00000000 (H) 10100110 00000000 (e) 00110110 00000000 (l) 00110110 00000000 (l) 111110110 00000000 (o).





16 , .





, 65535. , .2:





Fig. 2.  Letter of the ancient Turkic alphabet.
.2. .

– 68620 (010COC).





UTF-16 :





  1. 010000. 20 . : 010COC – 0x10000 = 0xC0C.





  2. 10 10 . 00 , 00000000110000001100, 10 , – 10 .





  3. 0xD800 (11011000 00000000) 003 (00000000 00000011), 10 , . 0xD800 + 003 =  0D803 (11011000 00000011) – 16 UTF-16.





  4. 0xDC00 (11011000 00000000) 00C (00000000 00001100), 10 , β„–2. 0xD00 + 00 =  D0 (11011100 00001100) – 16 UTF-16.





  5. UTF-16, , , 3 4: 0D803DC0C (11011000 00000011 11011100 00001100).





UTF-8 UTF-16 ,

1.





1. .





    





0-127





128 - 2047





2048-32767





32768-65535





65535-





1048575





1048575-…





UTF-8





8





16





24





32





32





_





UTF-16





16





16





16





16





32





32





1 , . , 128-2047, 65535-1048575 UTF-8 UTF-16 . 0-127 UTF-8, , , . 2048-32767 32768-65535 UTF-16, , , ( 12549-12589). , , 1048575 UTF-16. 





Β« HelloΒ» UTF-8 UTF-16. UTF-8 14 , UTF-16 20 , - 000. , , UTF-8.





: UTF-8, UTF-16. UTF-8, , .





, , . 12549-12589 , , UTF-16 . , , , . – button. - , . :





1. , , . , , , .





2.  . , . UTF-16. .





3. , -, – , – . , , , , .





, . , ΓΌ , , 252 : u, 117 Β¨, 776. . , . , u, , ΓΌ, u.








All Articles