Emoji under the hood

image




Over the past few weeks, Nikita Prokopov has been implementing emoji support for Skija . He decided to share a few small details of how this "greatest innovation in human communication since the invention of the letter image" works under the hood.



Translator's note: Habr does not support emoji, so I had to get out and replace emoji with pictures.



Unicode



Each character on a computer is encoded with a number. The most popular encoding is Unicode, and the two most common subvariants are UTF-8 and UTF-16.



Unicode allocates 2 21 (2 million) characters called "codepoints". Of these two million, only ~ 150k characters are currently defined. All languages, living and dead, and other decorations were crammed into these 150,000 symbols. You can use different fonts to write backwards and upside down: imageas well as to display «GHz» as a single glyph: image.



Directed to the right two-headed arrow with feathers and two vertical lines: imageor Semiglazov Monster: image. And the duck:



image




Pay attention to the block with Egyptian hieroglyphs (U + 13000 – U + 1342F), there are many interesting things:



image




Basic emoji



Emoji are just Unicode characters, which are located here U + 1F300-1F6FF and here U + 1F900-1FAFF:



image




Emoji behave like ordinary letters, you can do all the operations with them, as with letters ( approx.translated: just not on Habré! ). When you type “A,” the computer sees U + 0041. When you type, the imagecomputer sees U + 1F335.



Emoji are fonts



Why are they displayed as pictures? Bitmap fonts. You can create funny png for glyphs instead of boring black and white vectors.



image



Each OS comes with a pre-installed emoji font. On macOS / iOS, this is the Apple Color Emoji. Windows - Segoe UI Emoji, Android - Noto Color Emoji.



Emojis, like fonts, look different on different devices. Some applications have their own emoji: WhatsApp, Twitter, Facebook.



image



Fallback fonts



You write the text in some font, how does the emoji fit there? And why does the Russian text look poor in the Clubhouse or on Medium?



image




Here you are typing the character U + 1F419, and your font is, for example, San Francisco. But the San Francisco font does not have a glyph for U + 1F419, so your OS starts looking for another font that has such a glyph.



U + 1F419 is only available in Apple Color Emoji. So you see this: image.

Whichever font you use, emojis look the same.



image




Variation selector-16



Some emojis originated in the form of icons back in 1993, in the Miscellaneous Symbols U + 2600-26FF or Dingbats U + 2700-27FF sections:



image




These glyphs are just like letters, in black and white. Many fonts have their own image(U + 2702 BLACK SCISSORS):



image




Apple Color Emoji has its own version:



image




How does the OS know what to display imageor imageif they have the same U + 2702 code?



Meet U + FE0F, also known as VARIATION SELECTOR-16. This is a hint for the text renderer to switch to emoji.



image




Simple, elegant and no need to highlight new codepoints. imagehave the same meaning , but slightly different image style.



Grapheme clusters



Here we are faced with another problem - our emoji are now not one codepoint, but two. This means we need a way to define the boundaries of the symbol.



A cluster of graphemes will help us. A grapheme cluster is a sequence of codepoints that is viewed as a single human-readable glyph.



Grapheme clusters were invented not only for emojis, they are applicable to regular alphabets as well. imageIs a single cluster of graphemes, even if it consists of two codepoints: U + 0055 UPPER-CASE U followed by U + 0308 COMBINING DIAERESIS.



Grapheme clusters pose a lot of complexity for programmers. You can't just do substring(0, 10)



to take the first 10 characters - you can split the emoji in half.



The reverse of the line must be done cleverly. U + 263A U + FE0F makes sense, but U + FE0F U + 263A doesn't.



image




Finally, you cannot just call .length



on string. Well, you can, but the result will surprise you. If you are a developer, try running imagein your browser console.



Programmer tip: If you are working with text, get a library focused on grapheme clusters. For C, C ++ and JVM it can be ICU , Swift does everything right by default, for others - do it yourself.



image




This thing is 65 in length and cannot be split. Live with it now.



Skin Tone Modifier



Most human emojis depict an abstract yellow person. When skin tone was added in 2015, instead of adding a new codepoint for each emoji and skin tone combination, only five new codepoints were added: U + 1F3FB..U + 1F3FF



They should not be used by themselves, but should be added to existing emoji ... Together they form a ligature: if we print image(U + 1F44B WAVING HAND SIGN), and then (U + 1F3FD MEDIUM SKIN TONE MODIFIER), then we get it image



imagedoes not have its own codepoint (this is a sequence of two: U + 1F44B U + 1F3FD), but has its own unique look and feel. In total, with the help of five modifiers, ~ 280 human emojis were transformed into 1680 variations. Here are some dancers:



image




Zero-width Joiner



Let's say your friend just sent you a photo of an apple she is growing in her garden. You need to answer - how? You can send imageWOMAN EMOJI (U + 1F469) with imageSHEAF OF RICE (U + 1F33E) rice pad attached . In the end, it will work image, but if you slap U + 200D between them, you get a farmer: image



U + 200D is called Zero-width Joiner, or ZWJ for short. It works in a similar way to what we saw with skin tone, but this time you can combine two self-contained emojis into one. Not all combinations work, but many do, sometimes in surprising ways!



Some examples:



image




One weird inconsistency I noticed is that hair color is done through ZWJ, while skin tone is just an emoji modifier without ZWJ. Why? I have no idea.



image




Unfortunately, some emojis are not implemented as combinations with ZWJ. I consider this a missed opportunity:



image




How to print ZWJ? No way. But you can copy it from here: “”. Note: This is a special character, so expect it to behave strangely. You do not see him, but he is. ( note per: in the original article there is, but Habr does not allow )



Another big area where ZWJ is on a horse is the configuration of families and relationships. Here's a short story to illustrate:



image




Flags



Country flags are part of the Unicode standard, but for some reason are not implemented on Windows. If you are reading this in a Windows browser - sorry!



Flags do not have dedicated codepoints. Instead, they are two-letter ligatures.



image




Left - Windows, right - Mac



True, they do not use real letters. Instead, the “regional indicator symbol letter” alphabet (U + 1F1E6..1F1FF) is used. These letters are not used for anything other than composing flags.



What happens if you put two random letters together? Not much: image(except that text editing starts to behave strangely).



If you want to experiment, feel free to copy and combine from this alphabet: image



There are 258 valid two-letter combinations. Can you find them all?



A fun side effect of the two-letter ligature: image



Sequences of tags



Two-letter ligatures are cool, but don't you want to be cooler? How about 32 letter ligatures? Here are the tag sequences.



A tag sequence is a sequence of regular emoji, followed by another type of Latin letters (U + E0020..E007E), ending with U + E007F CANCEL TAG.



They are currently only used for these three flags: England, Scotland and Wales:



image




Keycaps



Not super-exciting, but necessary for completeness: Keycaps sequences use another convention.



It looks like this: take a number * or #, turn it into an emoji with U + FE0F, wrap it in a square with U + 20E3 COMBINING ENCLOSING KEYCAP



image




There are 12 of them:



image




Unicode updates



Unicode is updated every year and emoji are a core part of every release. For example, in Unicode 13 (March 2020) 55 new emojis were added.



At the time of this writing, neither the latest Mac OS (11.2.3) nor iOS (14.4.1) support emoji from Unicode 13 type: image



Here's what I see in March 2021: image



But thanks to the magic of ZWJ, I can still understand what happens is just not in the most optimal way.



Conclusion



To summarize, there are seven ways to encode emoji:



  1. Single codepoint image
  2. Single codepoint + variation selector-16 image
  3. Skin Tone Modifier image
  4. Sequencing with a zero-width joiner image
  5. Flags image
  6. Sequence of tags image
  7. Keycap sequence image


Methods from 1-4 can be combined to build a rather complex post:



image




If you are a programmer, remember to always use the ICU library for:



  • extraction of substring
  • line length measurements
  • reverse string


The googling keyword is "Grapheme Cluster". This applies to emojis, Western diacritics, induced and Korean fonts, so please be careful.



image








image



Vacancies
, , , - .



, , , .



, , . , , , , , .



, , .







About ITELMA
- automotive . 2500 , 650 .



, , . ( 30, ), -, -, - (DSP-) .



, . , , , . , automotive. , , .


List of useful publications on Habré



All Articles