At the beginning of the lecture, however, I mentioned that this would not be another exposure from the series of "misconceptions about X that programmers believe in". You can find any number of such revelations. However, I don't like these articles. They list various things that are supposedly false, but rarely explain why this is and what should be done instead. I suspect people will just read articles like these, congratulate themselves on this achievement, and then go find interesting new ways to make mistakes not mentioned in these articles. This is because they did not really understand the problems causing these errors.
Therefore, in my report I tried to explain some problems as best as possible and explain how to solve them - I like this approach much more . One of the topics that I touched on only in passing (it was just one slide and a couple of mentions on other slides) is the complexities that can be associated with the case of characters. There is an official Correct Answer ™ for the problem I discussed - case insensitive ID comparison, and in my talk I gave the best solution I know of, using only the Python standard library.
However, I briefly mentioned the deeper complexities of Unicode case, and I want to devote some time to describing the details. It's interesting, and understanding it can help you make decisions when designing and writing text-processing code. So I offer you the opposite of the articles "misconceptions about X that programmers believe" - "truths that programmers should know."
One more thing: Unicode is full of terminology. In this article, I will mainly use the definitions "upper case" and "lower case", since the Unicode standarduses these terms. If you like other terms like lowercase / uppercase letters, that's fine. Also, I will often use the term "symbol", which some may find incorrect. Yes, in Unicode the concept of "character" is not always what people expect, so it is often best to avoid it by using other terms. However, in this article, I will use the term as it is used in Unicode — to describe an abstract entity that can be claimed. Whenever important, I'll use more specific terms like code point to clarify.
There are more than two registers
The native speakers of European languages are used to the fact that their languages use case letters to denote specific things. For example, in English [and Russian] languages, we usually start sentences with an uppercase letter and continue most often with lowercase letters. Also, proper names begin with uppercase letters, and many acronyms and abbreviations are written in uppercase.
And we usually think that there are only two registers. There is the letter "A" and there is the letter "a". One in upper case, one in lower case - isn't that so?
However, there are three registers in Unicode. There is an upper case, there is a lower case, and there is a title case [titlecase]. In English, names are written this way. For example, "Avengers: Infinity War". Typically, for this, the first letter of each word is simply written in uppercase (and depending on different rules and styles, some words, such as articles, are not capitalized).
The Unicode standard gives an example of a character in a capital case: U + 01F2 LATIN CAPITAL LETTER D WITH SMALL Z. It looks like this: Dz.
Such characters are sometimes required to handle the negative consequences of one of the earliest solutions to the Unicode standard: backward compatibility with existing text encodings. It would be more convenient for Unicode to construct sequences using the standard's combination of characters. However, in many existing systems, space has already been allocated for ready-made sequences. For example, in ISO-8859-1 ("latin-1"), the "é" character has a ready-made form numbered 0xe9. In Unicode, it would be preferable to write this letter with a separate "e" and an accent mark. But to ensure full backward compatibility with existing encodings such as latin-1, Unicode also assigns code points for ready-made characters. For example, U + 00E9 LATIN SMALL LETTER E WITH ACUTE.
Although the code position of this character is the same as its latin-1 byte value, you should not rely on this. It is unlikely that character encoding in Unicode will preserve these positions. For example, in UTF-8, the code position U + 00E9 is written as the byte sequence 0xc3 0xa9.
And, of course, there are characters in the existing encodings that needed special handling when using the upper case, which is why they were included in Unicode "as is". If you want to look at them, search your favorite Unicode database for characters from the Lt category ("Letter, titlecase").
There are several ways to define case
The Unicode standard (§4.2) lists three different case definitions. Perhaps the choice of one of the three is done for you by your programming language; otherwise, your choice will depend on your specific goal. These definitions are:
- The character is in upper case if it is in the Lu category ("Letter, uppercase"), and in lower case if it is in the Ll category ("Letter, lowercase"). The standard recognizes the limitations of this definition: each specific symbol has to be attributed to only one of the categories. Because of this, many characters that “must be” in upper or lower case will not meet this requirement because they belong to some other category.
- The character is in upper case if it inherits the Uppercase property, and in lower case if it inherits the Lowercase property. It is a combination of the definition of one with other character properties, which may include case.
- A character is in uppercase if it does not change after being mapped to uppercase. A character is in lower case if it does not change after being mapped to lower case. This is a fairly general definition, but it can also behave non-intuitively.
If you are working with a limited subset of symbols (specifically, with letters), then 1 definition may be enough for you. If your repertoire is broader - it includes letter-like symbols that are not letters, the 2nd definition may suit you. It is recommended by the Unicode standard, §4.2:
Programmers manipulating Unicode strings should work with string functions such as isLowerCase (and its functional cousin toLowerCase) if they do not work directly with character properties.
The function mentioned here is defined in §3.13 of the Unicode standard. Formally, definition 3 uses the isLowerCase and isUpperCase functions from §3.13, defined in terms of the fixed positions in toLowerCase and toUpperCase, respectively.
If your programming language has functions for checking or converting the case of strings or individual characters, it is worth investigating which of the mentioned definitions are used in the implementation. If you're interested, the isupper () and islower () methods in Python use the 2nd definition.
It is impossible to understand the case of a character by its appearance or name
By the appearance of many characters, you can tell in what case they are. For example, "A" is in uppercase. This is also clear from the name of the symbol: "LATIN CAPITAL LETTER A". However, sometimes this method does not work. Take the code point U + 1D34. It looks like this: ᴴ. In Unicode, it is assigned the name: MODIFIER LETTER CAPITAL H. So it's uppercase, right?
In fact, it inherits the Lowercase property, so by definition # 2 it is in lowercase, despite the fact that it visually resembles an uppercase H, and the name contains the word "CAPITAL".
Some characters have no case at all
Definition 135 in §3.13 of the Unicode standard states:
C is case-sensitive if and only if C has a Lowercase or Uppercase property, or the General_Category is Titlecase_Letter.
This means that a lot of Unicode characters - in fact, most of them - are caseless. Questions about their case do not make sense, and the case changes do not affect them. However, we can get the answer to this question by definition # 3.
Some characters behave like they have multiple registers
The implication is that if you use definition # 3 and ask whether an uncased character is in upper or lower case, you get the answer "yes".
The Unicode standard gives an example (Table 4-1, line 7) of the character U + 02BD MODIFIER LETTER REVERSED COMMA (which looks like this: ʽ). It does not have the inherited Lowercase or Uppercase properties, it does not belong to the Lt category, so it has no case. At the same time converting to uppercase does not change it, and converting to lowercase does not change it, therefore, according to the 3rd definition, it answers "yes" to both questions: "are you uppercase?" and "are you lowercase?"
It seems that this can cause unnecessary confusion, but the point is that definition # 3 works with any sequence of Unicode characters, and allows you to simplify the case conversion algorithms (caseless characters just turn into themselves).
Case is context sensitive
You might think that if Unicode case conversion tables cover all characters, then this conversion is simply about finding the right place in the table. For example, the Unicode database says that U + 0041 LATIN CAPITAL LETTER A is lowercase U + 0061 LATIN SMALL LETTER A. Simple, isn't it?
One example where this approach does not work is Greek. The character Σ - that is, U + 03A3 GREEK CAPITAL LETTER SIGMA - is mapped to two different characters when converted to lowercase, depending on where it is in the word. If it is at the end of a word, then it will be lowercase ς (U + 03C2 GREEK SMALL LETTER FINAL SIGMA). Elsewhere it will be σ (U + 03C3 GREEK SMALL LETTER SIGMA).
This means that the register is not one-to-one or transitive. Another example is ß (U + 00DF LATIN SMALL LETTER SHARP S, or escet ). It will be "SS" in uppercase, although there is now another uppercase form (ẞ, U + 1E9E LATIN CAPITAL LETTER SHARP S). And converting "SS" to lower case results in "ss", so (using Unicode terminology for case conversion): toLowerCase (toUpperCase (ß))! = Ss.
Case is locale dependent
Different languages have different case conversion rules. The most popular example: i (U + 0069 LATIN SMALL LETTER I) and I (U + 0049 LATIN CAPITAL LETTER I) are converted to each other in most locales - most, but not all. In the locales az and tr (Turkic languages), the uppercase i will be İ (U + 0130 LATIN CAPITAL LETTER I WITH DOT ABOVE), and the lowercase I will be ı (U + 0131 LATIN SMALL LETTER DOTLESS I). Sometimes, getting it right really means the difference between life and death.
Unicode itself does not handle all possible case conversion rules for all locales. The Unicode database only has general rules for converting all characters, not specific to the locale. Also there are special rules for some languages and compound forms - Lithuanian, Turkic languages, some features of Greek. Everything else is not there. §3.13 of the standard mentions this and recommends the introduction of locale-specific translation rules if necessary.
One example would be an English-speaking sign - this is the title case of certain names. "O'brian" must be converted to "O'Brian" (not "O'brian"). However, in doing so, "it's" must be converted to "It's" and not to "It'S". Another example that is not handled in Unicode is the Dutch letter combination "ij", which, when converted to title case, must be converted to all upper case if it appears at the beginning of a word. Thus, the largest bay in the Netherlands in the title register will be "IJsselmeer" and not "Ijsselmeer". Unicode has the characters IJ U + 0132 LATIN CAPITAL LIGATURE IJ and ij U + 0133 LATIN SMALL LIGATURE IJ if you need them. By default, case conversion converts them to each other (although Unicode normalization forms using compatibility equivalence will split them into two separate characters).
Returning to the material presented in the report. The complexity of Unicode case management means that case-insensitive comparisons cannot be made using the standard lowercase or uppercase conversion functions found in many programming languages. For such comparisons, Unicode has the concept of case folding, and §3.13 of the standard defines the toCaseFold and isCaseFolded functions.
You might think that casting to a folded case is similar to casting to a lower case - but it is not. The Unicode standard warns that a folded-case string does not have to be lowercase. As an example, the Cherokee language is given - there, in a string that is in folded case, characters in upper case will also come across.
In one of the slides in my talk, Unicode Technical Report # 36 is implemented as fully in Python as possible. NFKC normalization is performed and then the casefold () method (available only in Python 3+) is called for the resulting string. And even so, some edge cases fall out, and this is not really what is recommended for ID comparison. The bad news first: Python doesn't expose enough Unicode properties to filter out characters that are not in XID_Start or XID_Continue, or characters that have a Default_Ignorable_Code_Point property. As far as I know, it does not support NFKC_Casefold mapping. There is also no easy way to use the modified NFKC UAX # 31§5.1.
The good news is that most of these edge cases do not involve any real security risks posed by the symbols in question. And case folding is in principle not defined as a normalization-preserving operation (hence the NFKC_Casefold mapping, which is re-normalized to NFC after case folding). Generally, when comparing, you don't care if both strings are normalized after preprocessing. You care if preprocessing is not inconsistent and ensures that only lines that "should" differ later will be different afterwards. If you are concerned about this, you can manually re-normalize after register addition.
Enough for now
This article, like the previous report, is not exhaustive, and it is hardly possible to fit all this material into a single post. Hope this has been a useful overview of the complexities of this topic, and provides enough starting points to look for further information. Therefore, in principle, you can stop here.
Wouldn't it be naive to hope that other people will stop writing exposures from the series of "misconceptions about X that programmers believe in" and will start writing articles like "the truth that programmers should know"?