♎️ 🚿 😓 A History of Two Standard C Libraries 🙇🏽 😛 🧛🏽

Today I got a bug report from a Debian user who fed some bullshit to the scdoc utility and got it SIGSEGV. Researching the problem allowed me to make an excellent comparison between musl libcand glibc. First, let's take a look at the stack trace:

==26267==ERROR: AddressSanitizer: SEGV on unknown address 0x7f9925764184
(pc 0x0000004c5d4d bp 0x000000000002 sp 0x7ffe7f8574d0 T0)
==26267==The signal is caused by a READ memory access.
    0 0x4c5d4d in parse_text /scdoc/src/main.c:223:61
    1 0x4c476c in parse_document /scdoc/src/main.c
    2 0x4c3544 in main /scdoc/src/main.c:763:2
    3 0x7f99252ab0b2 in __libc_start_main
/build/glibc-YYA7BZ/glibc-2.31/csu/../csu/libc-start.c:308:16
    4 0x41b3fd in _start (/scdoc/scdoc+0x41b3fd)

The source code on this line says this:

if (!isalnum(last) || ((p->flags & FORMAT_UNDERLINE) && !isalnum(next))) {

Hint: This pis a valid, non-null pointer. Variables lastand nextare of type uint32_t. Segfault happens on the second function call isalnum. And, most importantly, reproducible only when using glibc, not musl libc. If you have to re-read the code several times, you are not alone: there is simply nothing to trigger a segfault.

Since it was known that it was all about the glibc library, I got its sources and started looking for an implementation isalnum, getting ready to run into some stupid crap. But before I get to the stupid crap, which is, believe me, in bulk , let's first have a quick look at a good option. This is how the function is isalnumimplemented in musl libc:

int isalnum(int c)
{
	return isalpha(c) || isdigit(c);
}

int isalpha(int c)
{
	return ((unsigned)c|32)-'a' < 26;
}

int isdigit(int c)
{
	return (unsigned)c-'0' < 10;
}

As expected, for any value the cfunction will work without a segfault, because why the hell isalnumshould a segfault be thrown at all ?

Okay, now let's compare this with the glibc implementation . As soon as you open the title, you will be greeted with typical GNU nonsense, but let's skip it and try to find it isalnum.

The first result is this:

enum
{
  _ISupper = _ISbit (0),        /* UPPERCASE.  */
  _ISlower = _ISbit (1),        /* lowercase.  */
  // ...
  _ISalnum = _ISbit (11)        /* Alphanumeric.  */
};

It looks like an implementation detail, let's move on.

__exctype (isalnum);

But what is it __exctype? We go back a few lines up ...

#define __exctype(name) extern int name (int) __THROW

Okay, apparently this is just a prototype. It is not clear, however, why a macro is needed here. Looking further ...

#if !defined __NO_CTYPE
# ifdef __isctype_f
__isctype_f (alnum)
// ...

So, this already looks like something useful. What is it __isctype_f? Shaking up ...

#ifndef __cplusplus
# define __isctype(c, type) \
  ((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)
#elif defined __USE_EXTERN_INLINES
# define __isctype_f(type) \
  __extern_inline int                                                         \
  is##type (int __c) __THROW                                                  \
  {                                                                           \
    return (*__ctype_b_loc ())[(int) (__c)] & (unsigned short int) _IS##type; \
  }
#endif

Well, it starts ... Okay, together we will figure it out somehow. Apparently, __isctype_fthis is an inline function ... stop, it's all inside the else block of the #ifndef __cplusplus preprocessor instruction. Dead end. Where isalnum, her mother, is actually defined? Looking further ... Maybe this is it?

#if !defined __NO_CTYPE
# ifdef __isctype_f
__isctype_f (alnum)
// ...
# elif defined __isctype
# define isalnum(c)     __isctype((c), _ISalnum) // <-

Hey, this is the "implementation detail" we saw earlier. Remember?

enum
{
  _ISupper = _ISbit (0),        /* UPPERCASE.  */
  _ISlower = _ISbit (1),        /* lowercase.  */
  // ...
  _ISalnum = _ISbit (11)        /* Alphanumeric.  */
};

Let's try to quickly pick this macro:

# include <bits/endian.h>
# if __BYTE_ORDER == __BIG_ENDIAN
#  define _ISbit(bit)   (1 << (bit))
# else /* __BYTE_ORDER == __LITTLE_ENDIAN */
#  define _ISbit(bit)   ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
# endif

What the fuck is this? Okay, let's move on and consider that this is just a magic constant. Another macro is called __isctype, which is similar to the one we saw recently __isctype_f. Let's take another look at the branch #ifndef __cplusplus:

#ifndef __cplusplus
# define __isctype(c, type) \
  ((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)
#elif defined __USE_EXTERN_INLINES
// ...
#endif

Uh ...

Well, at least we found a pointer dereference that might explain the segfault. What is it __ctype_b_loc?

/*      ctype-info.c.
          localeinfo.h.

     ,   , (. `uselocale'  <locale.h>)
        ,  .
    ,   -,   
    ,    ,   .

        384 ,    
     `unsigned char' [0,255];   EOF (-1);  
    `signed char' value [-128,-1).  ISO C ,   ctype 
      `unsigned char'  EOF;    
    `signed char'      .
          `int`,
     `unsigned char`,   `tolower(EOF)'   EOF,   
       `unsigned char`.     - , 
         .  */
extern const unsigned short int **__ctype_b_loc (void)
     __THROW __attribute__ ((__const__));
extern const __int32_t **__ctype_tolower_loc (void)
     __THROW __attribute__ ((__const__));
extern const __int32_t **__ctype_toupper_loc (void)
     __THROW __attribute__ ((__const__));

How cool of you, glibc! I just love dealing with locales. Anyway, gdb is connected to my crashed application, and with all the information I received in mind, I write this squalor:

(gdb) print ((unsigned int **(*)(void))__ctype_b_loc)()[next]
Cannot access memory at address 0x11dfa68

Segfault found. There was a line about this in the comment: "ISO C requires ctype functions to work with values like ʻunsigned char 'and EOF". If we find this in the spec, we see:

In all implementations [of the functions declared in ctype.h], the argument is int, the value of which must fit into an unsigned char, or equal to the value of the EOF macro.

Now it becomes obvious how to fix the problem. My joint. It turns out that I cannot feed isalnuman arbitrary UCS-32 character to check for its occurrence in the ranges 0x30-0x39, 0x41-0x5A and 0x61-0x7A.

But here I will take the liberty of suggesting: maybe the function isalnumshouldn't throw a segfault at all, regardless of what it gets? Maybe even if the specification allows it , it doesn't mean that it should be done this way ? Maybe, well just as a crazy idea, the behavior of this function should not contain five macros, check the use of the C ++ compiler, depend on the byte order of your architecture, lookup table, stream locale data, and dereference two pointers?

Let's take another look at the musl version as a quick reminder:

int isalnum(int c)
{
	return isalpha(c) || isdigit(c);
}

int isalpha(int c)
{
	return ((unsigned)c|32)-'a' < 26;
}

int isdigit(int c)
{
	return (unsigned)c-'0' < 10;
}

These are the pies.

Translator's Note: Thanks to MaxGraey for linking to the original.

A History of Two Standard C Libraries

More articles: