Implementing Simple Cooperative Streams in C

Hello, Habr!



Thank you for your attention to our previous translated post on REST . Today we propose to look at the topic of system design from a slightly different angle and publish a translation of an article by Stephen Brennan, a Linux luminary, who talks about his own implementation of multitasking in userspace and how it can be of benefit.



Multitasking, like many other capabilities provided by the operating system, we not only take for granted, but also perceived as something commonplace. With our powerful computers and smartphones, the idea of ​​a computer not being able to juggle hundreds of processes seems strange. I think this is one of the possibilities that made computers so useful, but it is because of this that they are so complex, sometimes it seems like magic.



It is difficult to dabble with code that implements multitasking, and it is not always clear in which cases it is better to implement it yourself, so that you do not have to write an entire operating system. I am quite sure that it is impossible to fully understand the phenomenon until you realize it yourself. Therefore, I decided to write an article in which I will tell you how you can play with a simple threading implementation. In this article, we will implement simple streams in a regular

C program (not an operating system).



Lyrical digression about setjmp and longjmp



The scheduler will be highly dependent on the functions setjmp()and longjmp(). They seem a little magical, so I'll first describe what they do, and then I'll take a little time and tell you exactly how.



The function setjmp()allows you to record information about at what stage of execution the program was, so that you can then again jump to this point. A type variable is passed to it jmp_buf, in which we will store this information. The first time it returns, the function setjmp()returns 0.



Later, you can use the function longjmp(jmp_buf, value)to instantly resume execution from the point where it was called setjmp(). In your program, this situation will look like it setjmp()came back a second time. The argument will return this timevaluethat you passed longjmp()- it is more convenient to distinguish the second return from the first. Here's an example to illustrate this point:



#include <stdio.h>
#include <setjmp.h>

jmp_buf saved_location;
int main(int argc, char **argv)
{
    if (setjmp(saved_location) == 0) {
        printf("We have successfully set up our jump buffer!\n");
    } else {
        printf("We jumped!\n");
        return 0;
    }

    printf("Getting ready to jump!\n");
    longjmp(saved_location, 1);
    printf("This will never execute...\n");
    return 0;
}


If we compile and run this program, we get the following output:



We have successfully set up our jump buffer!
Getting ready to jump!
We jumped!


Wow! It's just like a goto statement, but in this case it can even be used to jump outside of a function. It is also more difficult to read than gotoit is because it looks like a normal function call. If your code is used in abundance setjmp()and longjmp(), then it will be very difficult to read it for anyone (including you).



As is the case goto, it is generally recommended to avoid setjmp()and longjmp(). But likegoto, the above functions can be useful if used sparingly and consistently. The scheduler needs to be able to switch contexts, so we'll use these features responsibly. Most importantly, we will be using these functions from our API so that our planner users do not have to deal with this kind of complexity.



Setjmp and longjmp won't save your stack

True, functions setjmp()andlongjmp()are not intended to support any hopping. They were designed for a very specific practical case. Imagine that you are performing some complex operation, such as making an HTTP request. In this case, a complex set of function calls will be involved, and if any of them fail, you will have to return a special error code from each of them. Such code will look like in the following listing, wherever you call the function (perhaps dozens of times):



int rv = do_function_call();
if (rv != SUCCESS) {
    return rv;
}


The meaning setjmp()and longjmp()is what setjmp()helps to stake out a place before embarking on the task really difficult. Then you can centralize all of your error handling in one place:



int rv;
jmp_buf buf;
if ((rv = setjmp(buf)) != 0) {
    /*    */
    return;
}
do_complicated_task(buf, args...);


If any function involved in fails do_complicated_task(), it just happens longjmp(buf, error_code). This means that each function in the composition do_complicated_task()can assume that any function call is successful, which means that you can not put this code to handle errors in each function call (in practice, this is almost never done, but this is a topic for a separate article) ...



The basic idea here is that longjmp()it only allows you to jump out of deeply nested functions. You cannot jump into that deeply nested function that you jumped out of earlier. This is what the stack looks like when you jump out of the function. Asterisk (*) means the stack pointer at which it is stored setjmp().



  | Stack before longjmp    | Stack after longjmp
      +-------------------------+----------------------------
stack | main()              (*) | main()
grows | do_http_request()       |
down  | send_a_header()         |
 |    | write_bytes()           |
 v    | write()  - fails!       |


As you can see, you can only move backwards on the stack, so there is no danger of data corruption. On the other hand, imagine what it would be like if you wanted to jump between tasks. If you call setjmp()and then return, do some other things and try to resume the work that you have already been doing before, then a problem will arise:



      | Stack at setjmp() | Stack later      | Stack after longjmp()
      +-------------------+------------------+----------------------
stack | main()            | main()           | main()
grows | do_task_one()     | do_task_two()    | do_stack_two()
down  | subtask()         | subtask()        | subtask()
 |    | foo()             |                  | ???
 v    | bar()         (*) |              (*) | ???               (*)


The stack pointer saved setjmp()will point to a stack frame that no longer exists and may have been overwritten by other data at some point in time. If we try to longjmp()jump back to the function from which we returned with help , then very strange things will begin that may well lead to the collapse of the entire program.



Moral: If you are going to use setjmp()and longjmp()to jump between complex tasks like these, you must ensure that each task has its own separate stack. In this case, the problem is completely eliminated, since when longjmp()the stack pointer is reset, the program itself will replace the stack with the desired one, and no stack erasure will occur.



Let's write a scheduler API



The digression is a bit lengthy, but armed with what we've learned, we can now implement user-space flows. To begin with, I note that it is very useful to design the API yourself for initializing, creating and starting threads. If we do this in advance, we will have a much better understanding of what exactly we are trying to build!



void scheduler_init(void);
void scheduler_create_task(void (*func)(void*), void *arg);
void scheduler_run(void);


These functions will be used to initialize the scheduler, add tasks, and then finally start the tasks in the scheduler. Once launched, it scheduler_run()will run until all tasks are completed. Currently running tasks will have the following APIs:



void scheduler_exit_current_task(void);
void scheduler_relinquish(void);


The first function is responsible for exiting the task. Exit from the task is also possible when returning from its function, so this construction exists only for convenience. The second function describes how our threads will tell the scheduler that another task needs to run for a while. When a task calls scheduler_relinquish(), it can be temporarily suspended while other tasks are running; but eventually the function will return and the first task can continue.



For the sake of a concrete example, let's consider a hypothetical use case for our API, with which we will test the scheduler:



#include <stdlib.h>
#include <stdio.h>

#include "scheduler.h"

struct tester_args {
    char *name;
    int iters;
};

void tester(void *arg)
{
    int i;
    struct tester_args *ta = (struct tester_args *)arg;
    for (i = 0; i < ta->iters; i++) {
        printf("task %s: %d\n", ta->name, i);
        scheduler_relinquish();
    }
    free(ta);
}

void create_test_task(char *name, int iters)
{
    struct tester_args *ta = malloc(sizeof(*ta));
    ta->name = name;
    ta->iters = iters;
    scheduler_create_task(tester, ta);
}

int main(int argc, char **argv)
{
    scheduler_init();
    create_test_task("first", 5);
    create_test_task("second", 2);
    scheduler_run();
    printf("Finished running all tasks!\n");
    return EXIT_SUCCESS;
}


In this example, we create two tasks that perform the same function, but take different arguments; thus, their implementation can be tracked separately. Each task performs a set number of iterations. At each iteration, it prints a message and then lets another task run. We expect to see something like the output of the program:



task first: 0
task second: 0
task first: 1
task second: 1
task first: 2
task first: 3
task first: 4
Finished running all tasks!


Let's implement the scheduler API



To implement the API, we need some kind of internal representation of the task. So, let's get down to business; Let's collect the fields we need:



struct task {
    enum {
        ST_CREATED,
        ST_RUNNING,
        ST_WAITING,
    } status;

    int id;

    jmp_buf buf;

    void (*func)(void*);
    void *arg;

    struct sc_list_head task_list;

    void *stack_bottom;
    void *stack_top;
    int stack_size;
};


Let's discuss each of these fields separately. All created tasks must be in the "created" state prior to execution. When a task starts executing, it goes into the “running” state, and if the task ever needs to wait for some asynchronous operation, it can be put into the “waiting” state. A field idis simply a unique identifier for a task. This bufcontains information about when the longjmp()task will be executed to resume. The funcand fields argare passed to scheduler_create_task()and are required to start the task. The field is task_listrequired to implement a doubly linked list of all tasks. The fields stack_bottom, stack_topand stack_sizeall belong to a separate stack dedicated specifically to this task. Bottom is the address returned bymalloc()but "top" is a pointer to the address directly above the given region in memory. Since the x86 stack grows downward, you need to set the stack pointer to a value stack_top, not stack_bottom.



In such conditions, you can implement the function scheduler_create_task():



void scheduler_create_task(void (*func)(void *), void *arg)
{
    static int id = 1;
    struct task *task = malloc(sizeof(*task));
    task->status = ST_CREATED;
    task->func = func;
    task->arg = arg;
    task->id = id++;
    task->stack_size = 16 * 1024;
    task->stack_bottom = malloc(task->stack_size);
    task->stack_top = task->stack_bottom + task->stack_size;
    sc_list_insert_end(&priv.task_list, &task->task_list);
}


By using static int, we guarantee that every time the function is called, the id field is incremented, and there is a new number. Everything else should be clear without explanation, except for the function sc_list_insert_end()that simply adds struct taskto the global list. The global list is stored inside a second structure that contains all of the scheduler's private data. Below is the structure itself, as well as its initialization function:



struct scheduler_private {
    jmp_buf buf;
    struct task *current;
    struct sc_list_head task_list;
} priv;

void scheduler_init(void)
{
    priv.current = NULL;
    sc_list_init(&priv.task_list);
}


The field is task_listused to refer to a task list (not surprisingly). The field currentstores the task that is currently being performed (or null, if there are no such tasks now). Most importantly, the field bufwill be used to jump into the code scheduler_run():



enum {
    INIT=0,
    SCHEDULE,
    EXIT_TASK,
};

void scheduler_run(void)
{
    /*     ! */
    switch (setjmp(priv.buf)) {
    case EXIT_TASK:
        scheduler_free_current_task();
    case INIT:
    case SCHEDULE:
        schedule();
        /*       ,    */
        return;
    default:
        fprintf(stderr, "Uh oh, scheduler error\n");
        return;
    }
}


As soon as the function is called scheduler_run(), we set the buffer setjmp()so that we can always return to that function. The first time, 0 (INIT) is returned, and we call immediately schedule(). Subsequently, we can pass SCHEDULE or EXIT_TASK constants in longjmp(), which will provoke different behaviors. For now, let's ignore the EXIT_TASK case and jump straight to the implementation schedule():



static void schedule(void)
{
    struct task *next = scheduler_choose_task();

    if (!next) {
        return;
    }

    priv.current = next;
    if (next->status == ST_CREATED) {
        /*
         *     .   
         * ,        .
         */
        register void *top = next->stack_top;
        asm volatile(
            "mov %[rs], %%rsp \n"
            : [ rs ] "+r" (top) ::
        );

        /*
         *   
         */
        next->status = ST_RUNNING;
        next->func(next->arg);

        /*
         *     ,    .   – ,   
         *   
         */
        scheduler_exit_current_task();
    } else {
        longjmp(next->buf, 1);
    }
    /*   */
}


First, we call the inner function to select the next task to run. This scheduler will work like a regular carousel, so it will simply select a new task from the list. If this function returns NULL, then we have no more tasks to perform, and we return. Otherwise, we must either start the execution of the task (if it is in the ST_CREATED state), or resume its execution.



To run the created task, we use the assembly instruction for x86_64 to assign the field to a stack_topregister rsp(stack pointer). Then we change the state of the task, run the function and exit. Note: setjmp()both longjmp()store and rearrange stack pointers, so here we only have to use assembler to change the stack pointer.



If the task has already been started, then the field bufmust contain the context we need longjmp()to resume the task, so that's what we do.

Next, let's look at a helper function that selects the next task to run. This is the heart of the scheduler, and as I said earlier, this scheduler works like a carousel:



static struct task *scheduler_choose_task(void)
{
    struct task *task;

    sc_list_for_each_entry(task, &priv.task_list, task_list, struct task)
    {
        if (task->status == ST_RUNNING || task->status == ST_CREATED) {
            sc_list_remove(&task->task_list);
            sc_list_insert_end(&priv.task_list, &task->task_list);
            return task;
        }
    }

    return NULL;
}


If you are not familiar with my linked list implementation (taken from the Linux kernel), no big deal. A function sc_list_for_each_entry()is a macro that allows you to iterate over all the tasks in the task list. The first selectable task (not in a pending state) that we find is removed from its current position and moved to the end of the task list. This ensures that the next time we run the scheduler, we will receive a different task (if there is one). We return the first task available for selection, or NULL if there are no tasks at all.



Finally, let's move on to the implementation scheduler_relinquish()to see how the task can self-eliminate:



void scheduler_relinquish(void)
{
    if (setjmp(priv.current->buf)) {
        return;
    } else {
        longjmp(priv.buf, SCHEDULE);
    }
}


This is another use case for the function setjmp()in our scheduler. Basically, this option may seem a little confusing. When a task calls this function, we use it setjmp()to save the current context (including the actual stack pointer). Then we use longjmp()it to enter the scheduler (again in scheduler_run()) and pass the SCHEDULE function; thus we ask you to assign a new task.



When the task is resumed, the function setjmp()returns with a nonzero value, and we exit any task that we might have been doing before!

Finally, here's what happens when a task exits (this is done either explicitly, by calling the exit function, or by returning from the corresponding task function):



void scheduler_exit_current_task(void)
{
    struct task *task = priv.current;
    sc_list_remove(&task->task_list);
    longjmp(priv.buf, EXIT_TASK);
    /*   */
}

static void scheduler_free_current_task(void)
{
    struct task *task = priv.current;
    priv.current = NULL;
    free(task->stack_bottom);
    free(task);
}


This is a two part process. The first function is returned directly by the task itself. We are deleting the entry corresponding to this one from the list of tasks, since it will no longer be assigned. Then, using, longjmp()we return to the function scheduler_run(). This time we use EXIT_TASK. This is how we tell the scheduler what to call before assigning a new task scheduler_free_current_task(). If you go back to the description scheduler_run(), you will see that this is exactly what it does scheduler_run().



We did this in two steps, since whenscheduler_exit_current_task(), it actively uses the stack contained in the task structure. If you free the stack and continue using it, then there is a chance that the function can still access the same stack memory that we just freed! To ensure that this does not happen, we will have to longjmp()revert to the scheduler using a separate stack with help . Then we can safely release the data related to the task.



So we have analyzed the entire implementation of the scheduler entirely. If we tried to compile it by adding my linked list implementation and the main program above to it, we would get a fully working scheduler! In order not to bother you with copy-paste, I direct you to the repository on github , which contains all the code for this article.



What is the use of the described approach?



If you have read so far, I think there is no need to convince you that the example is interesting. But at first glance, it may not seem very useful. After all, you can use "real" threads in C, which can run in parallel and don't have to wait for each other until one of them calls scheduler_relinquish().



However, I see this as a starting point for a whole series of exciting implementations of useful features. We can talk about heavy I / O tasks, about implementing a single-threaded asynchronous application, in the same way as the new async utilities in Python work. Generators and coroutines can also be implemented using such a system. Finally, with some hard work, this system can also be befriended with the "real" threads of the operating system to provide additional concurrency where needed. An interesting project is hidden behind each of these ideas, and I recommend that you try to complete one of them yourself, dear reader.



It's safe?



I think it's more likely no than yes! It is not safe to consider inline assembly code that affects the stack pointer. Don't risk using these things in production, but be sure to tinker with them and research!



A safer implementation of such a system can be built using a "non-context" API (see man getcontext), which allows switching between these types of user-space "streams" without embedding assembly code. Unfortunately, such an API is not covered by the standards (it has been removed from the POSIX specification). But it can still be used as part of glibc.



How can such a mechanism be made displacing?



This scheduler, as presented here, works only if the threads explicitly transfer control back to the scheduler. This is not good for a general program, for example, for an operating system, since a poorly made thread can block the execution of all others (although this did not prevent the use of cooperative multitasking in MS-DOS!) I do not think that this makes cooperative multitasking clearly bad; it all depends on the application.



When using a non-standard “out-of-context” API, POSIX signals will store the context of the code that was previously executed. By setting the timer to beep, the user-space scheduler can indeed provide a working version of preemptive multitasking! This is another cool project that deserves a separate article.



All Articles