A great feature was recently added to the mainline of the Clang compiler . Using the [[clang::musttail]]

or attributes, __attribute__((musttail))

you can now get guaranteed tail calls in C, C ++, and Objective-C.

int g(int);
int f(int x) {
    __attribute__((musttail)) return g(x);
}

( Online compiler )

Usually these calls are associated with functional programming, but I'm only interested in them from a performance point of view. In some cases, with their help, you can get better code from the compiler - at least with the available compilation technologies - without resorting to an assembler.

The use of this approach in parsing the serialization protocol gave amazing results: we were able to parse at speeds in excess of 2 Gb / s. , more than twice as fast as the previous best solution. This speedup will be useful in many situations, so it would be incorrect to limit itself to just the statement "tail calls == double speedup". However, these challenges are the key element that made these speed gains possible.

I will tell you what the advantages of tail calls are, how we parse the protocol using them, and how we can extend this technique to interpreters. I think that thanks to it, interpreters of all major languages written in C (Python, Ruby, PHP, Lua, etc.) can get significant performance gains. The main drawback is related to portability: today it musttail

is a non-standard compiler extension. While I hope it will catch on, it will take some time before the extension spreads wide enough for the C compiler on your system to support it. When building, you can sacrifice efficiency in exchange for portability if it turns out to be musttail

unavailable.

Tail Call Basics

A tail call is a call to any function in the tail position, the last action before the function returns a result. When optimizing tail calls, the compiler compiles the instruction for the tail call jmp

, not call

. This does not perform overhead actions that would normally allow the callee to g()

return to the caller f()

, such as creating a new stack frame or passing a return address. Instead, it f()

refers directly to g()

it as if it were part of itself, and g()

returns the result directly to the caller f()

. This optimization is safe because the stack frame f()

is no longer needed after the start of the tail call, because it became impossible to access any local variable f()

.

Even if it looks trivial, this optimization has two important features that open up new possibilities for writing algorithms. First, by executing n consecutive tail calls, the memory stack is reduced from O (n) to O (1). This is important because the stack is limited and overflow can crash the program. So, without this optimization, some algorithms are dangerous. Secondly, it jmp

eliminates the overhead of call

and as a result the function call becomes as efficient as any other branch. Both of these features allow tail calls to be used as an efficient alternative to conventional iterative control structures like for

and while

.

This idea is not new at all and dates back to 1977, when Guy Steele wrote a whole paper in which he argued that procedure calls are cleaner than architectures GOTO

, while optimizing tail calls does not lose speed. It was one of the "Lambda Works" written between 1975 and 1980, which formulated many of the ideas behind Lisp and Scheme.

Tail call optimization is nothing new, even for Clang. He could optimize them before, like GCC and many other compilers. In fact, the attribute musttail

in this example does not change the output of the compiler at all: Clang has already optimized tail calls for -O2

.

New here is a guarantee . While compilers are often successful in optimizing tail calls, this is the “best possible” thing and you cannot rely on it. In particular, the optimization will surely not work in unoptimized builds: the online compiler . In this example, the tail call is compiled into call

, so we are back on a stack of O (n) size. This is why we need musttail

: Until we have a guarantee from the compiler that our tail calls will always be optimized in all assembly modes, it will be unsafe to write algorithms with such calls for iteration. And sticking to code that only works with optimizations enabled is a pretty tough constraint.

The problem with interpreter loops

Compilers are incredible technology, but they are not perfect. Mike Pall, the author of LuaJIT, decided to write the LuaJIT 2.x interpreter in assembly language, not C, and he called this the main factor that made the interpreter so fast . Paul later explained in more detail why C compilers have a hard time finding the main interpreter loops . Two main points:

These observations reflect well our experience in optimizing the parsing of the serialization protocol. And tail calls will help us to solve both problems.

You might find it odd to compare interpreter loops to serialization protocol parsers. However, their unexpected similarity is determined by the nature of the wire-format of the protocol: it is a set of key-value pairs in which the key contains the field number and its type. The key works like an opcode in the interpreter: it tells us what operation needs to be performed to parse this field. The field numbers in the protocol can go in any order, so you need to be ready at any time to jump to any part of the code.

It would be logical to write such a parser using a loop while

with a nested expression switch

... This has been the best approach for parsing a serialization protocol over the lifetime of the protocol. For example, here is the parsing code from the current C ++ version . If we represent the control flow graphically, we get the following scheme:

But the diagram is not complete, because problems can arise at almost every stage. The field type might be wrong, or the data might be corrupted, or we might just jump to the end of the current buffer. The complete diagram looks like this:

We need to stay on the fast paths (blue) as long as possible, but when faced with a difficult situation, we will have to handle it using the fallback code. Such paths are usually longer and more complex than fast paths, they involve more data and often use awkward calls to other functions to handle even more complex cases.

In theory, if you combine this schema with the profile, the compiler will get all the information it needs to generate the most optimal code. But in practice, with this size of function and the number of connections, we often have to struggle with the compiler. It throws out an important variable that we want to store in the register. It applies the manipulation of the stack frame that we want to wrap around the call to the fallback function. It concatenates identical code paths that we want to keep separate for branch prediction. It all looks like trying to play the piano with mittens.

Improving interpreter loops with tail calls

The reasoning described above is largely a rephrasing of Paul's observations about the main cycles of the interpreter . But instead of throwing ourselves into assembler, we found that tailored architecture can give us the control we need to get near-optimal code out of C. I worked on this with my colleague Gerben Stavenga, who came up with most of the architecture. Our approach is similar to the wasm3 WebAssembly interpreter , which calls this pattern "metamachine" .

We put the code for our 2- gigabit parser in upb, a small protobuf library in C. It is fully working and passes all tests for compliance with the serialization protocol, but has not yet been deployed anywhere, and the architecture has not been implemented in the C ++ version of the protocol. But when the extension came to Clang musttail

(and upb was updated to use it ), one of the main barriers to the full implementation of our fast parser dropped.

We have moved away from one big parsing function and apply its own small function for each operation. Each tail function calls the next operation in the sequence. For example, here is a function for parsing a single field of a fixed size (the code is simplified compared to upb, I have removed many small details of the architecture).

The code

#include <stdint.h>
#include <stddef.h>
#include <string.h>

typedef void *upb_msg;
struct upb_decstate;
typedef struct upb_decstate upb_decstate;

// The standard set of arguments passed to each parsing function.
// Thanks to x86-64 calling conventions, these will be passed in registers.
#define UPB_PARSE_PARAMS                                          \
  upb_decstate *d, const char *ptr, upb_msg *msg, intptr_t table, \
      uint64_t hasbits, uint64_t data
#define UPB_PARSE_ARGS d, ptr, msg, table, hasbits, data

#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))

const char *fallback(UPB_PARSE_PARAMS);
const char *dispatch(UPB_PARSE_PARAMS);

// Code to parse a 4-byte fixed field that uses a 1-byte tag (field 1-15).
const char *upb_pf32_1bt(UPB_PARSE_PARAMS) {
  // Decode "data", which contains information about this field.
  uint8_t hasbit_index = data >> 24;
  size_t ofs = data >> 48;

  if (UNLIKELY(data & 0xff)) {
    // Wire type mismatch (the dispatch function xor's the expected wire type
    // with the actual wire type, so data & 0xff == 0 indicates a match).
    MUSTTAIL return fallback(UPB_PARSE_ARGS);
  }

  ptr += 1;  // Advance past tag.

  // Store data to message.
  hasbits |= 1ull << hasbit_index;
  memcpy((char*)msg + ofs, ptr, 4);

  ptr += 4;  // Advance past data.

  // Call dispatch function, which will read the next tag and branch to the
  // correct field parser function.
  MUSTTAIL return dispatch(UPB_PARSE_ARGS);
}

For such a small and simple function, Clang generates code that is almost impossible to beat:

upb_pf32_1bt:                           # @upb_pf32_1bt
        mov     rax, r9
        shr     rax, 24
        bts     r8, rax
        test    r9b, r9b
        jne     .LBB0_1
        mov     r10, r9
        shr     r10, 48
        mov     eax, dword ptr [rsi + 1]
        mov     dword ptr [rdx + r10], eax
        add     rsi, 5
        jmp     dispatch                        # TAILCALL
.LBB0_1:
        jmp     fallback                        # TAILCALL

Note that there is no prologue or epilogue, no register preemption, no stack usage at all. The only exits are jmp

from two tail calls, but no code is needed to pass parameters, because the arguments are already in the correct registers. Perhaps the only possible improvement we see here is a conditional jump for a tail call jne fallback

instead of jne

a subsequent call jmp

.

If you saw a disassembly of this code without symbolic information, you would not realize that this was the entire function. It could equally well be the base unit of a larger function. And that's exactly what we do: we take the interpreter loop, which is a large and complex function, and program it block by block, passing control flow between them through tail calls. We have complete control over the distribution of registers on the boundary of each block (at least six registers), and since the function is simple enough and does not preempt registers, we have achieved our goal of storing the most important state along all the fast paths.

We can independently optimize each sequence of instructions. And the compiler will handle all sequences independently too, because they are located in separate functions (if necessary, you can prevent inlining with noinline

). This is how we solve the problem described above, in which the code from the fallback paths degrades the quality of the code of the fast paths. By placing the slow paths in completely separate functions, the speed stability of the fast paths can be guaranteed. The beautiful assembler remains unchanged, it is not affected by any changes in other parts of the parser.

If we apply this pattern to Paul's example from LuaJIT , then we can roughly correlate his handwritten assembler with small degradations in code quality :

#define PARAMS unsigned RA, void *table, unsigned inst, \
               int *op_p, double *consts, double *regs
#define ARGS RA, table, inst, op_p, consts, regs
typedef void (*op_func)(PARAMS);
void fallback(PARAMS);

#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))

void ADDVN(PARAMS) {
    op_func *op_table = table;
    unsigned RC = inst & 0xff;
    unsigned RB = (inst >> 8) & 0xff;
    unsigned type;
    memcpy(&type, (char*)&regs[RB] + 4, 4);
    if (UNLIKELY(type > -13)) {
        return fallback(ARGS);
    }
    regs[RA] += consts[RC];
    inst = *op_p++;
    unsigned op = inst & 0xff;
    RA = (inst >> 8) & 0xff;
    inst >>= 16;
    MUSTTAIL return op_table[op](ARGS);
}

The resulting assembler:

ADDVN:                                  # @ADDVN
        movzx   eax, dh
        cmp     dword ptr [r9 + 8*rax + 4], -12
        jae     .LBB0_1
        movzx   eax, dl
        movsd   xmm0, qword ptr [r8 + 8*rax]    # xmm0 = mem[0],zero
        mov     eax, edi
        addsd   xmm0, qword ptr [r9 + 8*rax]
        movsd   qword ptr [r9 + 8*rax], xmm0
        mov     edx, dword ptr [rcx]
        add     rcx, 4
        movzx   eax, dl
        movzx   edi, dh
        shr     edx, 16
        mov     rax, qword ptr [rsi + 8*rax]
        jmp     rax                             # TAILCALL
.LBB0_1:
        jmp     fallback

I see only one improvement opportunity here besides the above jne fallback

: for some reason the compiler doesn't want to generate jmp qword ptr [rsi + 8*rax]

, instead it loads into rax

and then executes jmp rax

. These are minor coding issues that I hope Clang will fix soon without too much difficulty.

Limitations

One of the main drawbacks of this approach is that all of these beautiful assembly language sequences are catastrophically pessimized in the absence of tail calls. Any non-tailored call will create a stack frame and push a lot of data onto the stack.

#define PARAMS unsigned RA, void *table, unsigned inst, \
               int *op_p, double *consts, double *regs
#define ARGS RA, table, inst, op_p, consts, regs
typedef void (*op_func)(PARAMS);
void fallback(PARAMS);

#define UNLIKELY(x) __builtin_expect(x, 0)
#define MUSTTAIL __attribute__((musttail))

void ADDVN(PARAMS) {
    op_func *op_table = table;
    unsigned RC = inst & 0xff;
    unsigned RB = (inst >> 8) & 0xff;
    unsigned type;
    memcpy(&type, (char*)&regs[RB] + 4, 4);
    if (UNLIKELY(type > -13)) {
        // When we leave off "return", things get real bad.
        fallback(ARGS);
    }
    regs[RA] += consts[RC];
    inst = *op_p++;
    unsigned op = inst & 0xff;
    RA = (inst >> 8) & 0xff;
    inst >>= 16;
    MUSTTAIL return op_table[op](ARGS);
}

Suddenly we get this

ADDVN:                                  # @ADDVN
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        push    rax
        mov     r15, r9
        mov     r14, r8
        mov     rbx, rcx
        mov     r12, rsi
        mov     ebp, edi
        movzx   eax, dh
        cmp     dword ptr [r9 + 8*rax + 4], -12
        jae     .LBB0_1
.LBB0_2:
        movzx   eax, dl
        movsd   xmm0, qword ptr [r14 + 8*rax]   # xmm0 = mem[0],zero
        mov     eax, ebp
        addsd   xmm0, qword ptr [r15 + 8*rax]
        movsd   qword ptr [r15 + 8*rax], xmm0
        mov     edx, dword ptr [rbx]
        add     rbx, 4
        movzx   eax, dl
        movzx   edi, dh
        shr     edx, 16
        mov     rax, qword ptr [r12 + 8*rax]
        mov     rsi, r12
        mov     rcx, rbx
        mov     r8, r14
        mov     r9, r15
        add     rsp, 8
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        jmp     rax                             # TAILCALL
.LBB0_1:
        mov     edi, ebp
        mov     rsi, r12
        mov     r13d, edx
        mov     rcx, rbx
        mov     r8, r14
        mov     r9, r15
        call    fallback
        mov     edx, r13d
        jmp     .LBB0_2

To avoid this, we tried to call other functions only through inlining or tail calls. This can be tedious if the operation has many places in which an unusual situation may occur that is not an error. For example, when we parse the serialization protocol, integer variables are often one byte long, but longer ones are not an error. Inlining the handling of such situations can degrade the quality of the fast path if the fallback code is too complex. But the tail call of the fallback function does not make it easy to return to the operation when handling an abnormality, so the fallback function must be able to complete the operation. This leads to duplication and complication of the code.

Ideally this problem can be solved by adding __attribute __ ((preserve_most))in a fallback function followed by a normal call, not a tail call. The attribute preserve_most

makes the callee responsible for preserving almost all registers. This allows you to delegate the task of preempting registers to fallback functions, if necessary. We experimented with this attribute, but ran into mysterious issues that we couldn't solve. Perhaps we made a mistake somewhere, we will return to this in the future.

The main limitation is that musttail

it is not portable. I really hope that the attribute will take root, it will be implemented in GCC, Visual C ++ and other compilers, and one day it will even be standardized. But this will not happen soon, but what should we do now?

When the expansion musttail

is not available, you need to execute at least one correct one return

without a tail call for each theoretical iteration of the loop . We have not yet implemented such a fallback in the upb library, but I think that it will turn into a macro that, depending on availability, musttail

will either make tail calls or simply return.

Parsim Protobuff at> 2 Gb / s: How I Learned to Love Tail Recursion in C

Tail Call Basics

The problem with interpreter loops

Improving interpreter loops with tail calls

Limitations

More articles: