Epoll implementation, part 4

This is the last in a series of four articles ( Part 1 , Part 2 , Part 3 ) on implementation epoll. Here we will talk about how it epolltransfers events from kernel space to user space, and how the edge and level trigger modes are implemented. This article was written later than the others. When I started working on the first material, the most recent stable Linux kernel was 3.16.1. And at the time of this writing, this is already version 4.1. This article is based on the code of this kernel version. The code, however, has not changed very much, so readers of the previous articles should not worry about the fact that something in the implementation has changed a lot.







epoll



Interacting with user space



In the previous articles, I spent quite a lot of time explaining how the event handling system in the kernel works. But, as you know, the kernel needs to pass information about events to a program running in user space in order for the program to use this information. This is mainly done with the epoll_wait (2) system call .



The code for this function can be found on line 1961 of the file fs/eventpoll.c. The function itself is very simple. After quite normal checks, it simply gets the pointer to eventpollfrom the file descriptor and calls the following function:



error = ep_poll(ep, events, maxevents, timeout);


Ep_poll () function



The function is ep_poll()declared on line 1585 of the same file. It starts by checking to see if the user has set a value timeout. If so, the function initializes the wait queue and sets the timeout to the value specified by the user. If the user does not want to wait, that is , timeout = 0, then the function immediately goes to the block of code with a label check_events:, which is responsible for copying the event.



If the user has specified a value timeout, and there are no events that can be reported to him (their presence is determined using a call ep_events_available(ep)), the function ep_poll()adds itself to the waiting queue ep->wq(remember what we talked about in the third article of this series). There we mentioned that ep_poll_callback()in the process, it activates any processes that are waiting in the queue.ep->wq...



The function then goes into standby by calling schedule_hrtimeout_range(). Here are the circumstances under which a "sleeping" process can "wake up":



  1. The timeout has expired.
  2. The process received a signal.
  3. A new event has arisen.
  4. Nothing happened, and the scheduler just decided to activate the process.


In scenarios 1, 2, and 3, the function sets the appropriate flags and exits the wait loop. In the latter case, the function simply enters standby mode again.



After this part of the work is done, it ep_poll()continues to execute the block code check_events:.



In this block, the presence of events is first checked, and then the next call is made, where the most interesting happens.



ep_send_events(ep, events, maxevents)


Function ep_send_events()declared on line 1546. It is, after the call, calls the function ep_scan_ready_list(), passing in a callback, ep_send_events_proc(). The function ep_scan_ready_list()loops through the list of ready file descriptors and calls ep_send_events_proc()for each ready event it finds. It will become clear below that a mechanism involving the use of a callback is needed to ensure security and code reuse.



The function ep_send_events()first puts data from the list of ready-made file descriptors of the structure eventpoolinto its local variable. It then sets the ovfliststructure field eventpoolto NULL(and its default is EP_UNACTIVE_PTR).



Why do authors epolluseovflist? This is done to ensure high efficiency epoll! You may notice that after the list of ready file descriptors has been taken from the structure eventpool, it is ep_scan_ready_list()set ovflistto NULL. This results in ep_poll_callback()not trying to attach the event that is being passed to user space back to ep->rdllist, which can lead to big problems. By using the ovflistfunction, there is ep_scan_ready_list()no need to hold a lock ep->lockwhile copying events to user space. As a result, the overall performance of the solution is improved.



After that, it ep_send_events_proc()will bypass the list of ready file descriptors it has and call their methods again.poll()in order to make sure that the event really happened. Why epollcheck events here again? This is done to make sure that the event (or events) registered by the user is still available. Consider a situation where a file descriptor was added to the ready-to-file descriptor list by event EPOLLOUTwhile the user program is writing to that descriptor. After the program finishes writing, the file descriptor may no longer be writable. Epollyou need to handle situations like this correctly. Otherwise, the user will receive EPOLLOUTat the moment when the write operation is blocked.



Here, however, it is worth mentioning one detail. Functionep_send_events_proc()makes every effort to ensure that user-space programs receive accurate event notifications. It is possible, though unlikely, that the availability of a set of events will change after the ep_send_events_proc()trigger poll(). In this case, a user space program might receive notification of an event that no longer exists. This is why it is considered correct to always use non-blocking sockets when applied epoll. This prevents your application from being blocked unexpectedly.



After checking the event mask, it ep_send_events_proc()simply copies the event structure to the buffer provided by the user-space program.



Edge-triggered and level-triggered



Now we can finally discuss the difference between Edge Triggering (ET) and Level Triggering (LT) in terms of their implementation.



else if (!(epi->event.events & EPOLLET)) {
    list_add_tail(&epi->rdllink, &ep->rdllist);
}


It's very easy! The function ep_send_events_proc()adds the event back to the list of ready file descriptors. As a result, on the next call, ep_poll()the same file descriptor will be checked again. Since it ep_send_events_proc()always calls for a file poll()before returning it to the user space application, this increases the system load slightly (compared to ET) if the file descriptor is no longer available. But the point of all this is to, as mentioned above, not report events that are no longer available.



After it ep_send_events_proc()finishes copying events, the function returns the number of events copied to it, keeping the user-space application up to date.



When the function has ep_send_events_proc()finished, the functionsep_scan_ready_list()need to clean up a little. First, it returns to the list of ready file descriptors the events that were left unprocessed by the function ep_send_events_proc(). This can happen if the number of available events exceeds the size of the buffer provided by the user program. It also ep_send_events_proc()quickly attaches all events from ovflist, if any, back to the list of ready file descriptors. Further, in is ovflistagain recorded EP_UNACTIVE_PTR. As a result, new events will be attached to the main waitlist ( rdllist). The function exits by activating any other "sleeping" processes in the event that there are any other available events.



Outcome



This concludes the fourth and final article in the implementation series epoll. As I write these articles, I was impressed by the tremendous mental work that the authors of the Linux kernel code have gone through to achieve maximum efficiency and scalability. And I am grateful to all the authors of the Linux code for sharing their knowledge with everyone who needs it by sharing the results of their work.



How do you feel about open source software?










All Articles