epoll
, we conducted a survey on the feasibility of continuing the cycle translation. More than 90% of survey participants were in favor of translating the rest of the articles. Therefore, today we publish a translation of the second material from this cycle.
Ep_insert () function
A function
ep_insert()
is one of the most important functions in an implementation epoll
. Understanding how it works is extremely important in order to understand how exactly it epoll
gets information about new events from the files it is watching.
The declaration
ep_insert()
can be found at line 1267 of the file fs/eventpoll.c
. Let's look at some code snippets for this function:
user_watches = atomic_long_read(&ep->user->epoll_watches);
if (unlikely(user_watches >= max_user_watches))
return -ENOSPC;
In this code snippet, the function
ep_insert()
first checks to see if the total number of files the current user is watching is not greater than the value specified in /proc/sys/fs/epoll/max_user_watches
. If user_watches >= max_user_watches
, then the function immediately terminates with the errno
set to ENOSPC
.
It then
ep_insert()
allocates memory using the Linux kernel slab memory management mechanism:
if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
return -ENOMEM;
If the function was able to allocate enough memory for
struct epitem
, the following initialization process will be performed:
/* ... */
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
ep_set_ffd(&epi->ffd, tfile, fd);
epi->event = *event;
epi->nwait = 0;
epi->next = EP_UNACTIVE_PTR;
After that, it
ep_insert()
will try to register the callback in the file descriptor. But before we can talk about it, we need to get acquainted with some important data structures.
Framework
poll_table
is an important entity used by a poll()
VFS implementation . (I understand that this can be confusing, but here I would like to explain that the function poll()
I mentioned here is an implementation of a file operation poll()
, not a system call poll()
). She is announced in include/linux/poll.h
:
typedef struct poll_table_struct {
poll_queue_proc _qproc;
unsigned long _key;
} poll_table;
An entity
poll_queue_proc
represents a type of callback function that looks like this:
typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_table_struct *);
A member of a
_key
table poll_table
is actually not what it first appears to be. Namely, despite the name suggesting a certain "key", in _key
fact, the masks of the events of interest to us are stored. In the implementation, it is epoll
_key
set to ~0
(complement to 0). This means that it epoll
seeks to receive information about events of any kind. This makes sense, as user-space applications can change the event mask at any time using epoll_ctl()
, accepting all events from the VFS and then filtering them in the implementation epoll
, which makes things easier.
In order to facilitate the restoration of the
poll_queue_proc
original structure epitem
, it epoll
uses a simple structure calledep_pqueue
which serves as a wrapper poll_table
with a pointer to the corresponding structure epitem
(file fs/eventpoll.c
, line 243):
/* -, */
struct ep_pqueue {
poll_table pt;
struct epitem *epi;
};
Then it
ep_insert()
initializes struct ep_pqueue
. The following code first writes to a epi
structure member a ep_pqueue
pointer to a structure epitem
corresponding to the file we are trying to add, and then writes ep_ptable_queue_proc()
to a _qproc
structure member ep_pqueue
and _key
writes to it ~0
.
/* */
epq.epi = epi;
init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
It
ep_insert()
will then call ep_item_poll(epi, &epq.pt);
, which will result in a call to the implementation poll()
associated with the file.
Let's take a look at an example that uses the
poll()
Linux TCP stack implementation and understand what exactly this implementation does with poll_table
.
A function
tcp_poll()
is an implementation poll()
for TCP sockets. Its code can be found in the file net/ipv4/tcp.c
, on line 436. Here is a snippet of this code:
unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
unsigned int mask;
struct sock *sk = sock->sk;
const struct tcp_sock *tp = tcp_sk(sk);
sock_rps_record_flow(sk);
sock_poll_wait(file, sk_sleep(sk), wait);
//
}
The function
tcp_poll()
calls sock_poll_wait()
, passing, as the second argument, sk_sleep(sk)
and as the third - wait
(this is the tcp_poll()
table previously passed to the function poll_table
).
What is it
sk_sleep()
? As it turns out, this is just a getter for accessing the event waiting queue for a particular structure sock
(file include/net/sock.h
, line 1685):
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
return &rcu_dereference_raw(sk->sk_wq)->wait;
}
What is
sock_poll_wait()
going to do with the event waiting queue? It turns out that this function will perform some simple check and then call poll_wait()
with the same parameters. The function poll_wait()
will then call the callback we specified and pass it an event waiting queue (file include/linux/poll.h
, line 42):
static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
if (p && p->_qproc && wait_address)
p->_qproc(filp, wait_address, p);
}
In the case of the
epoll
entity, it _qproc
will be a function ep_ptable_queue_proc()
declared in the file fs/eventpoll.c
on line 1091.
/*
* - ,
* , .
*/
static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
poll_table *pt)
{
struct epitem *epi = ep_item_from_epqueue(pt);
struct eppoll_entry *pwq;
if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
/* */
epi->nwait = -1;
}
}
First, it
ep_ptable_queue_proc()
tries to restore the structure epitem
that corresponds to the file from the waiting queue with which we are working. Since it epoll
uses a wrapper structure ep_pqueue
, restoring epitem
from a pointer poll_table
is a simple pointer operation.
After that, it
ep_ptable_queue_proc()
just allocates as much memory as needed for struct eppoll_entry
. This structure acts as a "glue" between the waiting queue for the file being watched and the corresponding structure epitem
for that file. It is epoll
extremely important to know where the wait queue head is for the file being watched. Otherwise, epoll
it will not be able to unregister the wait queue later. Structureeppoll_entry
also includes a wait ( pwq->wait
) queue with a process resume function provided ep_poll_callback()
. Perhaps pwq->wait
this is the most important part in the entire implementation epoll
, since this entity is used to solve the following tasks:
- Monitor events occurring with a specific file being monitored.
- Resuming the work of other processes in the event that such a need arises.
Then it will
ep_ptable_queue_proc()
attach pwq->wait
to the waiting queue of the target file ( whead
). The function will also add struct eppoll_entry
to the linked list from struct epitem
( epi->pwqlist
) and increment the value epi->nwait
representing the length of the list epi->pwqlist
.
And here I have one question. Why
epoll
use a linked list to store a structure eppoll_entry
within a epitem
single file structure ? Isn't epitem
just one element eppoll_entry
needed?
I, however, cannot answer this question exactly. As far as I can tell, unless someone is going to use instances
epoll
in some crazy loops, the list epi->pwqlist
will only contain one element struct eppoll_entry
, andepi->nwait
for most files is likely to be 1
.
The good thing is that the ambiguities around
epi->pwqlist
do not affect in any way what I will talk about below. Namely, we will talk about how Linux notifies instances epoll
of events occurring to files being monitored.
Remember what we talked about in the previous section? It was about what
epoll
appends wait_queue_t
to the waiting list of the target file (to wait_queue_head_t
). Although wait_queue_t
most commonly used as a mechanism for resuming processes, it is essentially just a structure that stores a pointer to a function that will be called when Linux decides to resume processes from the queue wait_queue_t
attached to wait_queue_head_t
. In this functionepoll
can decide what to do with the resume signal, but epoll
there is no need to resume any process! As you will see later, usually ep_poll_callback()
nothing happens when you call resume.
I suppose it's also worth noting that the process resume mechanism used in
poll()
is completely implementation dependent. In the case of TCP socket files, the wait queue head is a member sk_wq
stored in the structure sock
. This also explains the need to use a callback ep_ptable_queue_proc()
to work with the wait queue. Since in implementations of the queue for different files, the head of the queue can appear in completely different places, we have no way to find the value we needwait_queue_head_t
without using a callback.
When exactly is the resumption of work
sk_wq
in the structure carried out sock
? As it turns out, the Linux socket system follows the same "OO" design principles as VFS. The structure sock
declares the following hooks on line 2312 of the file net/core/sock.c
:
void sock_init_data(struct socket *sock, struct sock *sk)
{
// ...
sk->sk_data_ready = sock_def_readable;
sk->sk_write_space = sock_def_write_space;
// ...
}
B
sock_def_readable()
and sock_def_write_space()
the call is wake_up_interruptible_sync_poll()
for (struct sock)->sk_wq
the purpose of the function-callback, renewable process work.
When will
sk->sk_data_ready()
and will be called sk->sk_write_space()
? It depends on the implementation. Let's take TCP sockets as an example. The function sk->sk_data_ready()
will be called in the second half of the interrupt handler when the TCP connection completes the three-way handshake procedure, or when a buffer is received for a certain TCP socket. The function sk->sk_write_space()
will be called when the buffer state changes from full
to available
. If you keep this in mind when analyzing the following topics, especially the one about front triggering, these topics will look more interesting.
Outcome
This concludes the second article in a series of articles on implementation
epoll
. Next time, epoll
let's talk about what exactly it does in the callback registered in the socket process resume queue.
Have you used epoll?