How the UTF-8 encoding was invented: excerpts from the correspondence of the creators



Everyone knows the UTF-8 encoding, which has long dominated the Internet space , and has been used for many years. It would seem that everything is known about her, and there is nothing interesting to tell on this topic. If you read popular resources like Wikipedia, then really there is nothing unusual there, except that the English version briefly mentions a strange story about how it was "sketched on a napkin in a diner." 



In fact, the invention of this encoding cannot be so banal, if only because Ken Thompson, a legendary person, had a hand in its creation. He worked with Dennis Ritchie, was one of the creators of UNIX, contributed to the development of C (invented its predecessor - B), and later, while working at Google, took part in the creation of the Go language. 



Before you - the translation of several letters in which the developers recall the history of the creation of the encoding. 



Characters:



ken (at) entrisphere.com - Ken Thompson





Ken Thompson (left) with Dennis Ritchie



"Rob 'Commander' Pike" - Robert Pike , a Canadian programmer who worked on UTF-8 with Ken Thompson







mkuhn (at) acm.org - Markus Kuhn , German computer scientist







henry (at) spsystems.net - Henry Spurser , author of one of the RegExp implementations







Russ Cox < rsc@plan9.bell-labs.com > - Russ Cox , Bell Labs employee who worked on the Plan 9







Greger system Leijonhufvud < greger@friherr.com> - One of the X / Open







Plan 9 staff members - The operating system that first used UTF-8 encoding to provide multilingualism.







UTF-8 - Unicode character encoding










2003 Correspondence





Below is the correspondence between the creators of the encoding, Robert and Ken, which Robert Pike began, complaining that their contribution to the creation of UTF-8 was undeservedly forgotten. Robert asks one of his old acquaintances to rummage through the archives of the mail server and find there evidence of their participation. (approx. per.)



Subject: UTF-8 history
From: "Rob 'Commander' Pike" <r (at) google.com>
Date: Wed, 30 Apr 2003 22:32:32 -0700 (Thu 06:32 BST)
To: mkuhn (at) acm.org, henry (at) spsystems.net
Cc: ken (at) entrisphere.com
      
      







Looking at the conversations about the origins of UTF-8, I see the same story being repeated over and over. 



Wrong version: 



  1. UTF-8 was developed by IBM.
  2. It was implemented in Plan 9 (an operating system developed by Bell Laboratories)


It is not true. I saw firsthand how UTF-8 was invented on a napkin one evening in September 1992 in a New Jersey diner. 



It happened in this way. We used the original ISO 10646 UTF to support 16-bit characters in Plan 9, which we hated, and were about to release Plan 9 when some guys called me late one night, I think they were from IBM. I remember meeting them at the X / Open committee meeting in Austin. They wanted Ken and me to see their FSS / UTF project. 
(, ..) . Bell Labs , Plan 9 — , , , . , .



Unicode , , , , , .



(. .)
We understood why they wanted to change the design and decided that this is a good opportunity to use our experience to develop a new standard and get the guys at X / Open to promote it. We had to tell them about it, and they agreed on the condition that we deal with it quickly. 



Then we went to have a bite to eat, and during dinner, Ken sorted out the pack of beats, and when we got back, they called the guys from X / Open and explained our idea to them. We sent our sketch by mail, and they replied that it was better than theirs (but I remember exactly that they did not show us their version), and asked when we could implement it.
One of the options for delimiting characters was a slash, but this could confuse the file system, it could interpret it as an escape sequence.



(approx. per.)
It seems to me that it happened on Wednesday night. We promised that we would launch the system by Monday, when I think they have some important meeting scheduled. That evening, Ken wrote the encoder / decoder code, and I started to deal with C and the graphics libraries. The next day the code was ready and we started converting the system's text files. By Friday, Plan 9 was already up and running on so-called UTF-8. 



And later the story was rewritten a little.



Why didn't we just use their FSS / UTF? 



As far as I remember, in that first phone call I sang to Desiderata my requirements for encoding, and FSS / UTF did not have at least one thing - the ability to synchronize a stream of bytes taken from the middle of the stream using as few characters as possible for synchronization (see above, about defining character boundaries.



sang them Desiderat
.



, 1971 , : «Desiderata» , , : «». , « » « ». ( .)





Since there was no solution anywhere, we were free to do it however we wanted.

I think "the story was invented by IBM and implemented in Plan 9" has its origins in the RFC 2279 documentation. We were so happy when UTF-8 took root that we didn't tell the story to anyone. 



None of us work at Bell Labs anymore, but I'm sure there is an archive of email that can corroborate our story, and I can ask someone to dig into it.



So all the credit goes to the guys at X / Open and IBM for making it possible and pushing the encoding, but Ken developed it, and I helped him with that, no matter what the history books say. 



-Rob




Date: Sat, 07 Jun 2003 18:44:05 -0700
From: "Rob `Commander' Pike" <r@google.com>
To: Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>
cc: henry@spsystems.net, ken@entrisphere.com,
Greger Leijonhufvud <greger@friherr.com>
Subject: Re: UTF-8 history
      
      







I asked Russ Cox to dig in the archives. I am attaching his message. I think you will agree that this confirms the story I posted earlier. The email we sent to X / Open (I think Ken edited and circulated this document) includes a new "desideratum number 6" about character boundary detection. 



We will no longer know what impact the original solution from X / Open had on us. Although they are different, they have common characteristics. I don’t remember looking at it in detail, it was too long ago (in the last letter he says that X / Open did not show them their implementation. Approx. Lane) . But I remember very well how Ken wrote sketches on a napkin and then wished we had saved it. 



-Rob




From: Russ Cox <rsc@plan9.bell-labs.com>
To: r@google.com
Subject: utf digging
Date-Sent: Saturday, June 07, 2003 7:46 PM -0400
      
      





User bootes file /sys/src/libc/port/rune.c



was modified by division-heavy on September 4, 1992. The version in the dump has a time of 19:51:55. The next day a comment was added to it, but otherwise it didn't change until November 14, 1996, when it runelen



was sped up by explicitly checking the value rune



instead of using the return value runetochar



. The last change was on May 26, 2001 when it was added runenlen



. (Rune is a structure that contains a Unicode value. Runelen and runetochar are functions that work with this data type. Approx. Trans.)



There were several letters from your mailboxes that were generated by line grapping utf. 



The first is about a file utf.c



that is a copy wctomb



and mbtowc



(character conversion functions. Approx. Per.) That handle the full 6-byte UTF-8, 32-bit encoding runes.



With flow control logic it looks pretty ugly. I guess this code came from that first email. 



In /usr/ken/utf/xutf



I found a copy of what appears to be the source of that non-self-synchronizing encoding method, with the UTF-8 schema appended to the end of the email (starts with "We define 7 types byte



"). 



The version of the letter below, dated September 2 11:44:10 PM, is the first. After several edits, on the morning of September 8th, the second version came out. The mail server logs show how the second version of the letter is sent and, after a while, returns to Ken:



helix: Sep  8 03:22:13: ken: upas/sendmail: remote inet!xopen.co.uk!xojig
>From ken Tue Sep  8 03:22:07 EDT 1992 (xojig@xopen.co.uk) 6833
helix: Sep  8 03:22:13: ken: upas/sendmail: delivered rob From ken Tue Sep 8 03:22:07 EDT 1992 6833
helix: Sep  8 03:22:16: ken: upas/sendmail: remote pyxis!andrew From ken Tue Sep  8 03:22:07 EDT 1992 (andrew) 6833
helix: Sep  8 03:22:19: ken: upas/sendmail: remote coma!dmr From ken Tue Sep  8 03:22:07 EDT 1992 (dmr) 6833
helix: Sep  8 03:25:52: ken: upas/sendmail: delivered rob From ken Tue Sep 8 03:24:58 EDT 1992 141
helix: Sep  8 03:36:13: ken: upas/sendmail: delivered ken From ken Tue Sep 8 03:36:12 EDT 1992 6833
      
      





Good luck.



Files from the mail archive



Next is a file with correspondence from the mail server dump, which Russ Cox attached to his, in response to Robert's request to delve into history. This is the "first version". (approx.)



>From ken Fri Sep  4 03:37:39 EDT 1992

      
      





Here is our suggestion for modifying FSS-UTF. It is about the same thing as in the previous one. I apologize to the author. 



The code has been tested to some extent and should be in pretty good shape. We redesigned the Plan 9 code to use this encoding, and we are going to release the distribution and select university users for initial testing. 



File System Safe Universal Character Set Transformation Format (FSS-UTF)




With the establishment of ISO / IEC 10646 (Unicode) as an international standard and the expectation of widespread adoption of this Universal Coded Character Set (UCS), operating systems historically based on ASCII need to develop ways to represent and handle a large number of characters that can encode with the new standard. UCS has several problems that need to be solved in historically developed operating systems and the environment for programming in C. 



(The text below mentions "historical operating systems" several times. Apparently in the context of "historically working with ASCII encoding." I either omitted this epithet, or replaced it with "existing", etc. approx. lane)



The biggest problem is the encoding scheme used in UCS. Namely, the integration of the UCS standard with existing programming languages, operating systems and utilities. Problems in programming languages ​​and operating systems are being addressed across industries, yet we are still faced with the handling of UCS by operating systems and utilities. 



Among the problems associated with handling UCS in operating systems, the main one is the presentation of data within the file system. The underlying concept is to support the existing operating systems in which the investment has been made and at the same time take advantage of the large number of characters provided by UCS. 



UCS makes it possible to encode multilingual text using a single character set. But UCS and UTF not protect null bytes (the end of lines in some languages. Approx. Per.) And / or a slash in ASCII /, which makes the coding incompatible with Unix. The following proposal provides a Unix-compatible UCS transform format, so that Unix systems can support multilingual text within a single encoding.



This encoding conversion format is intended for encoding files as an intermediate step towards full UCS support. However, since nearly all Unis implementations face the same UCS support issues, this proposal is intended to provide general encoding compatibility at this stage.



Purpose / Objective




Based on the assumption, we get that if almost all the problems of processing and storing UCS in OS file systems are known, then we need to use such a UCS conversion format that will work without violating the structure of the operating system. The goal is that the format conversion process can be used to encode the file.



Criteria for conversion format




The following are guidelines to be followed when defining the UCS conversion format:



  1. Compatibility with existing file systems.

    It is forbidden to use a null byte and a slash character as part of the file name. 
  2. .

    ASCII. UCS, ASCII, ASCII.
  3. UCS .
  4. .
  5. , , .
  6. , (. ..).


FSS-UTF




The proposed UCS transform format encodes UCS values ​​in a range [0,0x7fffffff]



using multiple bytes per character and in lengths of 1, 2, 3, 4, 5, and 6 bytes. In all cases of encoding with more than one byte, the leading byte determines the number of bytes used, with the most significant bit set in each byte. Each byte that does not start with 10XXXXXX is the start of a UCS character sequence.



An easy way to remember the format: the number of high ones in the first byte means the number of bytes in a multibyte character.



Bits    Hex Min  Hex Max  Byte Sequence in Binary
1    7  00000000 0000007f 0vvvvvvv
2   11  00000080 000007FF 110vvvvv 10vvvvvv
3   16  00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
4   21  00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
5   26  00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
6   31  04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

      
      





The multibyte-encoded UCD character value is a concatenation of v-bits. If several encoding methods are possible, for example UCS 0, then the shortest is considered acceptable.



Below are examples of implementations of standard C functions wcstombs()



and mbstowcs()



that demonstrate algorithms for converting from UCS to conversion format and converting from conversion format to UCS. Examples of implementations include error checking, some of which may not be needed for consistency:



typedef

struct
{
    int   cmask;
    int   cval;
    int   shift;
    long  lmask;
    long  lval;
} Tab;

static
Tab       tab[] =
{
    0x80, 0x000*6,   0x7F,         0,            /* 1 byte sequence */
    0xE0, 0xC01*6,   0x7FF,        0x80,         /* 2 byte sequence */
    0xF0, 0xE02*6,   0xFFFF,              0x800,        /* 3 byte sequence */
    0xF8, 0xF03*6,   0x1FFFFF,     0x10000,      /* 4 byte sequence */
    0xFC, 0xF84*6,   0x3FFFFFF,    0x200000,     /* 5 byte sequence */
    0xFE, 0xFC5*6,   0x7FFFFFFF,   0x4000000,    /* 6 byte sequence */
    0,                                              /* end of table */
};

int
mbtowc(wchar_t *p, char *s, size_t n)
{
    long l;
    int c0, c, nc;
    Tab *t;

    if(s == 0)
        return 0;

    nc = 0;
    if(n <= nc)
        return -1;
    c0 = *s & 0xff;
    l = c0;
    for(t=tab; t->cmask; t++) {
        nc++;
        if((c0 & t->cmask) == t->cval) {
            l &= t->lmask;
            if(l < t->lval)
                return -1;
            *p = l;
            return nc;
        }
        if(n <= nc)
            return -1;
        s++;
        c = (*s ^ 0x80) & 0xFF;
        if(c & 0xC0)
            return -1;
        l = (l<<6) | c;
    }
    return -1;
}

int
wctomb(char *s, wchar_t wc)
{
    long l;
    int c, nc;
    Tab *t;

    if(s == 0)
        return 0;

    l = wc;
    nc = 0;
    for(t=tab; t->cmask; t++) {
        nc++;
        if(l <= t->lmask) {
            c = t->shift;
            *s = t->cval | (l>>c);
            while(c > 0) {
                c -= 6;
                s++;
                *s = 0x80 | ((l>>c) & 0x3F);
            }
            return nc;
        }
    }
    return -1;
}
      
      





>From ken Tue Sep  8 03:24:58 EDT 1992
      
      





I sent it by mail, but the letter went into a black hole, so I did not receive my copy. This internet address must have been in a coma.



Second version of the letter, with revisions



Then a copy of the letter is attached, which is described above as: "After several corrections, on the morning of September 8, the second version turned out." The repeating part is hidden under the spoiler. (approx. transl.)



>From ken Tue Sep  8 03:42:43 EDT 1992
      
      





I finally got my copy.



--- /usr/ken/utf/xutf from dump of Sep 2 1992 ---
      
      





Hidden text

File System Safe Universal Character Set Transformation Format (FSS-UTF)




ISO/IEC 10646 (Unicode) (UCS), , ASCII, , . UCS , C. 



, UCS. UCS , . , UCS . 



, UCS , . , , , , UCS. 



UCS . UCS UTF ( . . .) / ASCII /, Unix. UCS, Unix, , , Unix- .



, UCS. , Unis UCS, .



/




, UCS ,   UCS, . , .






  , , UCS:



  1. .

  2. .

    ASCII. UCS, ASCII, ASCII.
  3. UCS .
  4. .
  5. , , .
  6. , (. ..).


FSS-UTF




UCS UCS [0,0x7fffffff]



1, 2, 3, 4, 5, 6 . , . , 10XXXXXX, UCS.



: .



Bits    Hex Min  Hex Max  Byte Sequence in Binary
1    7  00000000 0000007f 0vvvvvvv
2   11  00000080 000007FF 110vvvvv 10vvvvvv
3   16  00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
4   21  00010000 001FFFFF 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
5   26  00200000 03FFFFFF 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv
6   31  04000000 7FFFFFFF 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

      
      





UCD — v-. , UCS 0, .



C wcstombs()



mbstowcs()



, UCS UCS. , :



typedef

struct
{
    int   cmask;
    int   cval;
    int   shift;
    long  lmask;
    long  lval;
} Tab;

static
Tab       tab[] =
{
    0x80, 0x000*6,   0x7F,         0,            /* 1 byte sequence */
    0xE0, 0xC01*6,   0x7FF,        0x80,         /* 2 byte sequence */
    0xF0, 0xE02*6,   0xFFFF,              0x800,        /* 3 byte sequence */
    0xF8, 0xF03*6,   0x1FFFFF,     0x10000,      /* 4 byte sequence */
    0xFC, 0xF84*6,   0x3FFFFFF,    0x200000,     /* 5 byte sequence */
    0xFE, 0xFC5*6,   0x7FFFFFFF,   0x4000000,    /* 6 byte sequence */
    0,                                              /* end of table */
};

int
mbtowc(wchar_t *p, char *s, size_t n)
{
    long l;
    int c0, c, nc;
    Tab *t;

    if(s == 0)
        return 0;

    nc = 0;
    if(n <= nc)
        return -1;
    c0 = *s & 0xff;
    l = c0;
    for(t=tab; t->cmask; t++) {
        nc++;
        if((c0 & t->cmask) == t->cval) {
            l &= t->lmask;
            if(l < t->lval)
                return -1;
            *p = l;
            return nc;
        }
        if(n <= nc)
            return -1;
        s++;
        c = (*s ^ 0x80) & 0xFF;
        if(c & 0xC0)
            return -1;
        l = (l<<6) | c;
    }
    return -1;
}

int
wctomb(char *s, wchar_t wc)
{
    long l;
    int c, nc;
    Tab *t;

    if(s == 0)
        return 0;

    l = wc;
    nc = 0;
    for(t=tab; t->cmask; t++) {
        nc++;
        if(l <= t->lmask) {
            c = t->shift;
            *s = t->cval | (l>>c);
            while(c > 0) {
                c -= 6;
                s++;
                *s = 0x80 | ((l>>c) & 0x3F);
            }
            return nc;
        }
    }
    return -1;
}
      
      





int mbtowc(wchar_t *p, const char *s, size_t n)
{
       unsigned char *uc;      /* so that all bytes are nonnegative */

       if ((uc = (unsigned char *)s) == 0)
           return 0;               /* no shift states */
       if (n == 0)
           return -1;
       if ((*p = uc[0]) < 0x80)
           return uc[0] != '\0';   /* return 0 for '\0', else 1 */
       if (uc[0] < 0xc0)
       {
           if (n < 2)
               return -1;
           if (uc[1] < 0x80)
               goto bad;
           *p &= 0x3f;
           *p <<= 7;
           *p |= uc[1] & 0x7f;
           *p += OFF1;
           return 2;
       }
       if (uc[0] < 0xe0)
       {
           if (n < 3)
               return -1;
           if (uc[1] < 0x80 || uc[2] < 0x80)
               goto bad;
           *p &= 0x1f;
           *p <<= 14;
           *p |= (uc[1] & 0x7f) << 7;
           *p |= uc[2] & 0x7f;
           *p += OFF2;
           return 3;
       }
       if (uc[0] < 0xf0)
       {
           if (n < 4)
               return -1;
           if (uc[1] < 0x80 || uc[2] < 0x80 || uc[3] < 0x80)
               goto bad;
           *p &= 0x0f;
           *p <<= 21;
           *p |= (uc[1] & 0x7f) << 14;
           *p |= (uc[2] & 0x7f) << 7;
           *p |= uc[3] & 0x7f;
           *p += OFF3;
           return 4;
       }
       if (uc[0] < 0xf8)
       {
           if (n < 5)
               return -1;
           if (uc[1] < 0x80 || uc[2] < 0x80 || uc[3] < 0x80 || uc[4] < 0x80)
               goto bad;
           *p &= 0x07;
           *p <<= 28;
           *p |= (uc[1] & 0x7f) << 21;
           *p |= (uc[2] & 0x7f) << 14;
           *p |= (uc[3] & 0x7f) << 7;
           *p |= uc[4] & 0x7f;
           if (((*p += OFF4) & ~(wchar_t)0x7fffffff) == 0)
               return 5;
       }
bad:;
       errno = EILSEQ;
       return -1;
}

      
      





We define 7 byte types:



T0 0xxxxxxx      7 free bits
Tx 10xxxxxx      6 free bits
T1 110xxxxx      5 free bits
T2 1110xxxx      4 free bits
T3 11110xxx      3 free bits
T4 111110xx      2 free bits
T5 111111xx      2 free bits

      
      





The coding looks like this:



>From hex Thru hex      Sequence             Bits
00000000  0000007f      T0                   7
00000080  000007FF      T1 Tx                11
00000800  0000FFFF      T2 Tx Tx             16
00010000  001FFFFF      T3 Tx Tx Tx          21
00200000  03FFFFFF      T4 Tx Tx Tx Tx       26
04000000  FFFFFFFF      T5 Tx Tx Tx Tx Tx    32

      
      





Some notes:



  1. Two bytes can encode power of 2 ^ 11 characters, but only 2 ^ 11-2 ^ 7 will be used. Codes in the range 0-7F will be considered invalid. I think this is better than adding a bunch of "magic" constants with no real benefit. This remark applies to all longer sequences.
  2. Sequences of 4, 5, and 6 bytes exist for political reasons only. I would prefer to remove them.
  3. The 6 byte sequence spans 32 bits, the FSS-UTF proposal spans only 31.
  4. All sequences are synchronized to any byte that is not Tx.





***



This short correspondence put everything in its place. Although that “legendary napkin” has not survived, there were enough extracts from the mail server archive for the community to recognize their merits. Wikipedia added the names of Ken and Robert and a fun fact about a napkin in a diner, and on the Internet this story is circulated and discussed "as is", in the form of simple text, containing several letters and part of a dump from the mail server. 



The Plan 9 operating system has long been forgotten, no one remembers what it was written for and why it was “number nine”, and UTF-8, almost thirty years later, is still relevant and is not going to retire.



It would seem that this is just an encoding, but even such a simple story can be entertaining if you delve into it a little. At the dawn of technology development, it was impossible to predict what would shoot and go down in history, and what would be forgotten.



Archives source






All Articles