Epoll implementation, part 2

While publishing the translation of the first article from the implementation series epoll, we conducted a survey on the feasibility of continuing the cycle translation. More than 90% of survey participants were in favor of translating the rest of the articles. Therefore, today we publish a translation of the second material from this cycle.







Ep_insert () function



A function ep_insert()is one of the most important functions in an implementation epoll. Understanding how it works is extremely important in order to understand how exactly it epollgets information about new events from the files it is watching.



The declaration ep_insert()can be found at line 1267 of the file fs/eventpoll.c. Let's look at some code snippets for this function:



user_watches = atomic_long_read(&ep->user->epoll_watches);
if (unlikely(user_watches >= max_user_watches))
  return -ENOSPC;


In this code snippet, the function ep_insert()first checks to see if the total number of files the current user is watching is not greater than the value specified in /proc/sys/fs/epoll/max_user_watches. If user_watches >= max_user_watches, then the function immediately terminates with the errnoset to ENOSPC.



It then ep_insert()allocates memory using the Linux kernel slab memory management mechanism:



if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
  return -ENOMEM;


If the function was able to allocate enough memory for struct epitem, the following initialization process will be performed:



/*  ... */
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;


After that, it ep_insert()will try to register the callback in the file descriptor. But before we can talk about it, we need to get acquainted with some important data structures.



Framework poll_tableis an important entity used by a poll()VFS implementation . (I understand that this can be confusing, but here I would like to explain that the function poll()I mentioned here is an implementation of a file operation poll(), not a system call poll()). She is announced in include/linux/poll.h:



typedef struct poll_table_struct {
  poll_queue_proc _qproc;
  unsigned long _key;
} poll_table;


An entity poll_queue_procrepresents a type of callback function that looks like this:



typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);


A member of a _keytable poll_tableis actually not what it first appears to be. Namely, despite the name suggesting a certain "key", in _keyfact, the masks of the events of interest to us are stored. In the implementation, it is epoll _keyset to ~0(complement to 0). This means that it epollseeks to receive information about events of any kind. This makes sense, as user-space applications can change the event mask at any time using epoll_ctl(), accepting all events from the VFS and then filtering them in the implementation epoll, which makes things easier.



In order to facilitate the restoration of the poll_queue_procoriginal structure epitem, it epolluses a simple structure calledep_pqueuewhich serves as a wrapper poll_tablewith a pointer to the corresponding structure epitem(file fs/eventpoll.c, line 243):



/* -,    */
struct ep_pqueue {
  poll_table pt;
  struct epitem *epi;
};


Then it ep_insert()initializes struct ep_pqueue. The following code first writes to a epistructure member a ep_pqueuepointer to a structure epitemcorresponding to the file we are trying to add, and then writes ep_ptable_queue_proc()to a _qprocstructure member ep_pqueueand _keywrites to it ~0.



/*      */
epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);


It ep_insert()will then call ep_item_poll(epi, &epq.pt);, which will result in a call to the implementation poll()associated with the file.



Let's take a look at an example that uses the poll()Linux TCP stack implementation and understand what exactly this implementation does with poll_table.



A function tcp_poll()is an implementation poll()for TCP sockets. Its code can be found in the file net/ipv4/tcp.c, on line 436. Here is a snippet of this code:



unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
  unsigned int mask;
  struct sock *sk = sock->sk;
  const struct tcp_sock *tp = tcp_sk(sk);

  sock_rps_record_flow(sk);

  sock_poll_wait(file, sk_sleep(sk), wait);

  //  
}


The function tcp_poll()calls sock_poll_wait(), passing, as the second argument, sk_sleep(sk)and as the third - wait(this is the tcp_poll()table previously passed to the function poll_table).



What is it sk_sleep()? As it turns out, this is just a getter for accessing the event waiting queue for a particular structure sock(file include/net/sock.h, line 1685):



static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
  BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
  return &rcu_dereference_raw(sk->sk_wq)->wait;
}


What is sock_poll_wait()going to do with the event waiting queue? It turns out that this function will perform some simple check and then call poll_wait()with the same parameters. The function poll_wait()will then call the callback we specified and pass it an event waiting queue (file include/linux/poll.h, line 42):



static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
  if (p && p->_qproc && wait_address)
    p->_qproc(filp, wait_address, p);
}


In the case of the epollentity, it _qprocwill be a function ep_ptable_queue_proc()declared in the file fs/eventpoll.con line 1091.



/*
*  - ,       
*     ,    .
*/
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
       poll_table *pt)
{
  struct epitem *epi = ep_item_from_epqueue(pt);
  struct eppoll_entry *pwq;

  if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
    init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
    pwq->whead = whead;
    pwq->base = epi;
    add_wait_queue(whead, &pwq->wait);
    list_add_tail(&pwq->llink, &epi->pwqlist);
    epi->nwait++;
  } else {
    /*       */
    epi->nwait = -1;
  }
}


First, it ep_ptable_queue_proc()tries to restore the structure epitemthat corresponds to the file from the waiting queue with which we are working. Since it epolluses a wrapper structure ep_pqueue, restoring epitemfrom a pointer poll_tableis a simple pointer operation.



After that, it ep_ptable_queue_proc()just allocates as much memory as needed for struct eppoll_entry. This structure acts as a "glue" between the waiting queue for the file being watched and the corresponding structure epitemfor that file. It is epollextremely important to know where the wait queue head is for the file being watched. Otherwise, epollit will not be able to unregister the wait queue later. Structureeppoll_entryalso includes a wait ( pwq->wait) queue with a process resume function provided ep_poll_callback(). Perhaps pwq->waitthis is the most important part in the entire implementation epoll, since this entity is used to solve the following tasks:



  1. Monitor events occurring with a specific file being monitored.
  2. Resuming the work of other processes in the event that such a need arises.


Then it will ep_ptable_queue_proc()attach pwq->waitto the waiting queue of the target file ( whead). The function will also add struct eppoll_entryto the linked list from struct epitem( epi->pwqlist) and increment the value epi->nwaitrepresenting the length of the list epi->pwqlist.



And here I have one question. Why epolluse a linked list to store a structure eppoll_entrywithin a epitemsingle file structure ? Isn't epitemjust one element eppoll_entryneeded?



I, however, cannot answer this question exactly. As far as I can tell, unless someone is going to use instances epollin some crazy loops, the list epi->pwqlistwill only contain one element struct eppoll_entry, andepi->nwaitfor most files is likely to be 1.



The good thing is that the ambiguities around epi->pwqlistdo not affect in any way what I will talk about below. Namely, we will talk about how Linux notifies instances epollof events occurring to files being monitored.



Remember what we talked about in the previous section? It was about what epollappends wait_queue_tto the waiting list of the target file (to wait_queue_head_t). Although wait_queue_tmost commonly used as a mechanism for resuming processes, it is essentially just a structure that stores a pointer to a function that will be called when Linux decides to resume processes from the queue wait_queue_tattached to wait_queue_head_t. In this functionepollcan decide what to do with the resume signal, but epollthere is no need to resume any process! As you will see later, usually ep_poll_callback()nothing happens when you call resume.



I suppose it's also worth noting that the process resume mechanism used in poll()is completely implementation dependent. In the case of TCP socket files, the wait queue head is a member sk_wqstored in the structure sock. This also explains the need to use a callback ep_ptable_queue_proc()to work with the wait queue. Since in implementations of the queue for different files, the head of the queue can appear in completely different places, we have no way to find the value we needwait_queue_head_twithout using a callback.



When exactly is the resumption of work sk_wqin the structure carried out sock? As it turns out, the Linux socket system follows the same "OO" design principles as VFS. The structure sockdeclares the following hooks on line 2312 of the file net/core/sock.c:



void sock_init_data(struct socket *sock, struct sock *sk)
{
  //  ...
  sk->sk_data_ready  =   sock_def_readable;
  sk->sk_write_space =  sock_def_write_space;
  //  ...
}


B sock_def_readable()and sock_def_write_space()the call is wake_up_interruptible_sync_poll()for (struct sock)->sk_wqthe purpose of the function-callback, renewable process work.



When will sk->sk_data_ready()and will be called sk->sk_write_space()? It depends on the implementation. Let's take TCP sockets as an example. The function sk->sk_data_ready()will be called in the second half of the interrupt handler when the TCP connection completes the three-way handshake procedure, or when a buffer is received for a certain TCP socket. The function sk->sk_write_space()will be called when the buffer state changes from fullto available. If you keep this in mind when analyzing the following topics, especially the one about front triggering, these topics will look more interesting.



Outcome



This concludes the second article in a series of articles on implementation epoll. Next time, epolllet's talk about what exactly it does in the callback registered in the socket process resume queue.



Have you used epoll?










All Articles