BPF for Linux Monitoring Book

imageHello Habitants! The BPF virtual machine is one of the most important components of the Linux kernel. Its intelligent application will enable systems engineers to find failures and solve even the most complex problems. You will learn how to write programs that monitor and modify the behavior of the kernel, be able to safely inject code to observe events in the kernel, and much more. David Calavera and Lorenzo Fontana will help you unlock the power of BPF. Expand your knowledge of performance optimization, networking, security. - Use BPF to track and modify Linux kernel behavior. - Inject code to safely monitor events in the kernel - without the need to recompile the kernel or reboot the system. - Use handy code examples in C, Go, or Python. - Manage the situation by owning the BPF program life cycle.





Linux kernel security, features and Seccomp



BPF provides a powerful way to extend the kernel without compromising stability, security, or speed. For this reason, the kernel developers thought it would be a good idea to leverage its versatility to improve process isolation in Seccomp by implementing Seccomp filters supported by the BPF programs aka Seccomp BPF. In this chapter, we will explain what Seccomp is and how it is applied. Then you will learn how to write Seccomp filters using BPF programs. After that, let's take a look at the built-in BPF hooks that the kernel has for Linux security modules.



Linux Security Modules (LSMs) is a platform that provides a set of functions that can be used to standardize the implementation of various security models. LSM can be used directly in the kernel source tree such as Apparmor, SELinux, and Tomoyo.



Let's start by discussing Linux features.



Capabilities



The essence of Linux's capabilities is that you need to grant a non-privileged process permission to perform a specific task, but without suid for that purpose, or otherwise make the process privileged, reducing the possibility of attacks and allowing the process to perform certain tasks. For example, if your application needs to open a privileged port, say 80, instead of running the process as root, you can simply give it the CAP_NET_BIND_SERVICE capability.



Consider a Go program named main.go:



package main
import (
            "net/http"
            "log"
)
func main() {
     log.Fatalf("%v", http.ListenAndServe(":80", nil))
}


This program serves an HTTP server on port 80 (this is a privileged port). We usually run it right after compilation:



$ go build -o capabilities main.go
$ ./capabilities


However, since we are not granting root privileges, this code will throw an error when binding the port:



2019/04/25 23:17:06 listen tcp :80: bind: permission denied
exit status 1


capsh (shell control tool) is a tool that launches a shell with a specific set of capabilities.


In this case, as already mentioned, instead of granting full root rights, you can enable privileged port bindings by enabling cap_net_bind_service along with everything else that is already in the program. To do this, we can wrap our program in capsh:



# capsh --caps='cap_net_bind_service+eip cap_setpcap,cap_setuid,cap_setgid+ep' \
   --keep=1 --user="nobody" \
   --addamb=cap_net_bind_service -- -c "./capabilities"


Let's understand a little about this command.



  • capsh - use capsh as a shell.
  • --caps = 'cap_net_bind_service + eip cap_setpcap, cap_setuid, cap_setgid + ep' - since we need to change the user (we don't want to run as root), we will specify cap_net_bind_service and the ability to actually change the user ID from root to nobody, namely cap_setuid and cap_setgid ...
  • --keep=1 — , root.
  • --user=«nobody» — , , nobody.
  • --addamb=cap_net_bind_service — root.
  • — -c "./capabilities" — .


— , , execve(). , , , , .


You are probably wondering what + eip means after specifying a capability in the --caps option. These flags are used to specify that the feature:



-must be activated (p);



-available for application (e);



-can be inherited by child processes (i).



Since we want to use cap_net_bind_service, we need to do it with the e flag. Then we start the shell in command. This will launch the capabilities binary and we need to mark it with the i flag. Finally, we want the feature to be activated (we did this without changing the UID) with p. It looks like cap_net_bind_service + eip.



You can check the result with ss. Shrink the output a little to fit on the page, but it will show the associated port and user ID other than 0, in this case 65,534:



# ss -tulpn -e -H | cut -d' ' -f17-
128 *:80 *:*
users:(("capabilities",pid=30040,fd=3)) uid:65534 ino:11311579 sk:2c v6only:0


In this example we used capsh, but you can write a shell using libcap. See man 3 libcap for more information.



When writing programs, the developer quite often does not know in advance all the capabilities required by the program at runtime; moreover, these features may change in new versions.



To better understand the capabilities of our program, we can take the BCC capable tool, which sets kprobe for the cap_capable kernel function:



/usr/share/bcc/tools/capable
TIME      UID  PID   TID   COMM               CAP    NAME           AUDIT
10:12:53 0 424     424     systemd-udevd 12 CAP_NET_ADMIN         1
10:12:57 0 1103   1101   timesync        25 CAP_SYS_TIME         1
10:12:57 0 19545 19545 capabilities       10 CAP_NET_BIND_SERVICE 1


We can achieve the same by using bpftrace with the one-line kprobe in the cap_capable kernel function:



bpftrace -e \
   'kprobe:cap_capable {
      time("%H:%M:%S ");
      printf("%-6d %-6d %-16s %-4d %d\n", uid, pid, comm, arg2, arg3);
    }' \
    | grep -i capabilities


This will output something like the following if the capabilities of our program are activated after kprobe:



12:01:56 1000 13524 capabilities 21 0
12:01:56 1000 13524 capabilities 21 0
12:01:56 1000 13524 capabilities 21 0
12:01:56 1000 13524 capabilities 12 0
12:01:56 1000 13524 capabilities 12 0
12:01:56 1000 13524 capabilities 12 0
12:01:56 1000 13524 capabilities 12 0
12:01:56 1000 13524 capabilities 10 1


The fifth column is the capabilities the process needs, and since this output includes non-audit events, we see all the non-audit checks and finally the required capability with the audit flag (the last in the output) set to 1. Capability. which we are interested in is CAP_NET_BIND_SERVICE, it is defined as a constant in the kernel source code in the include / uapi / linux / ability.h file with ID 10:



/* Allows binding to TCP/UDP sockets below 1024 */
/* Allows binding to ATM VCIs below 32 */
#define CAP_NET_BIND_SERVICE 10<source lang="go">


Features are often leveraged at runtime for containers such as runC or Docker to run in unprivileged mode, but are only allowed those features that are necessary to run most applications. When an application requires specific capabilities, Docker can provide them with --cap-add:



docker run -it --rm --cap-add=NET_ADMIN ubuntu ip link add dummy0 type dummy


This command will provide the container with the CAP_NET_ADMIN capability, which will allow it to configure a network link to add the dummy0 interface.



The next section shows how to use features such as filtering, but with a different method that allows us to programmatically implement our own filters.



Seccomp



Seccomp stands for Secure Computing, it is a layer of security implemented in the Linux kernel that allows developers to filter out certain system calls. While Seccomp is comparable to the capabilities of Linux, its ability to handle specific system calls makes it much more flexible than it is.



Seccomp and Linux's capabilities are not mutually exclusive, and are often used together to benefit from both approaches. For example, you might want to give a process the CAP_NET_ADMIN capability, but not allow it to accept socket connections by blocking the accept and accept4 system calls.



The Seccomp filtering method is based on BPF filters operating in SECCOMP_MODE_FILTER mode, and system call filtering is performed in the same way as for packets.



Seccomp filters are loaded using prctl via the PR_SET_SECCOMP operation. These filters are in the form of a BPF program that runs for each Seccomp package represented by the seccomp_data structure. This structure contains the reference architecture, a pointer to the processor instructions during the system call, and a maximum of six system call arguments, expressed as uint64.



This is how the seccomp_data structure looks from the kernel source in the linux / seccomp.h file:



struct seccomp_data {
int nr;
      __u32 arch;
      __u64 instruction_pointer;
      __u64 args[6];
};


As you can see from this structure, we can filter by the system call, its arguments, or a combination of both.



After receiving each Seccomp packet, the filter must perform processing to make a final decision and tell the kernel what to do next. The final decision is expressed in one of the return values ​​(status codes).



- SECCOMP_RET_KILL_PROCESS - termination of the entire process immediately after filtering a system call that is not executed because of this.



- SECCOMP_RET_KILL_THREAD - termination of the current thread immediately after filtering a system call, which because of this is not executed.



- SECCOMP_RET_KILL - alias for SECCOMP_RET_KILL_THREAD, left for backward compatibility.



- SECCOMP_RET_TRAP — The system call is disabled and the SIGSYS (Bad System Call) signal is sent to the calling task.



- SECCOMP_RET_ERRNO - The system call is not executed, and part of the return value of the SECCOMP_RET_DATA filter is passed to user space as errno. Different errno values ​​are returned depending on the cause of the error. The error numbers are listed in the next section.



- SECCOMP_RET_TRACE - Used to notify the ptrace with - PTRACE_O_TRACESECCOMP to intercept when a system call is made to see and control this process. If the tracer is not connected, an error is returned, errno is set to -ENOSYS, and the system call is not executed.



- SECCOMP_RET_LOG - The system call is allowed and is logged.



- SECCOMP_RET_ALLOW - the system call is simply allowed.



ptrace is a system call for implementing trace mechanisms in a process called tracee, with the ability to monitor and control the execution of the process. The trace program can effectively influence the execution and change the tracee memory registers. In the context of Seccomp, ptrace is used when triggered by the SECCOMP_RET_TRACE status code, so the tracer can prevent the system call from being executed and implement its own logic.


Seccomp errors



From time to time, when working with Seccomp, you will encounter various errors, which are identified by a return value of type SECCOMP_RET_ERRNO. To report an error, the seccomp system call will return -1 instead of 0.



The following errors are possible:



- EACCESS - The caller is not allowed to make a system call. This usually happens because it does not have the CAP_SYS_ADMIN privilege or no_new_privs is not set with prctl (more on that later);



- EFAULT - the arguments passed (args in the seccomp_data structure) do not have a valid address;



- EINVAL - there can be four reasons here: - the



requested operation is unknown or is not supported by the kernel in the current configuration;



-the specified flags are invalid for the requested operation;



-operation includes BPF_ABS, but there are problems with the specified offset, which may exceed the size of the seccomp_data structure;



- the number of instructions passed to the filter exceeds the maximum;



- ENOMEM — not enough memory to run the program;



- EOPNOTSUPP — the operation indicated that an action was available with SECCOMP_GET_ACTION_AVAIL, but the kernel does not support return in arguments;



- ESRCH — there was a problem synchronizing another stream;



- ENOSYS - no tracer is attached to the SECCOMP_RET_TRACE action.



prctl is a system call that allows a user-space program to manipulate (set and get) specific aspects of a process, such as byte sequence, thread names, secure computing mode (Seccomp), privileges, Perf events, and so on.


Seccomp may sound like sandbox technology to you, but it isn't. Seccomp is a utility that allows users to develop a sandboxing mechanism. Now let's look at how to create custom interaction programs using a filter called directly by the Seccomp system call.



Sample BPF Seccomp filter



Here we will show how to combine the two actions discussed earlier, namely:



- write the Seccomp BPF program, which will be used as a filter with different return codes depending on the decisions made;



- load the filter using prctl.



First you need headers from the standard library and the Linux kernel:



#include <errno.h>
#include <linux/audit.h>
#include <linux/bpf.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <linux/unistd.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/prctl.h>
#include <unistd.h>


Before trying this example, we need to make sure the kernel is compiled with CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER set to y. On a production machine, you can test it like this:



cat /proc/config.gz| zcat | grep -i CONFIG_SECCOMP



The rest of the code is a two-part install_filter function. The first part contains our list of BPF filtering instructions:



static int install_filter(int nr, int arch, int error) {
  struct sock_filter filter[] = {
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, (offsetof(struct seccomp_data, arch))),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, arch, 0, 3),
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, (offsetof(struct seccomp_data, nr))),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, nr, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ERRNO | (error & SECCOMP_RET_DATA)),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
  };


Instructions are set using the BPF_STMT and BPF_JUMP macros defined in the linux / filter.h file.

Let's go through the instructions.



- BPF_STMT (BPF_LD + BPF_W + BPF_ABS (offsetof (struct seccomp_data, arch))) - the system loads and accumulates with BPF_LD in the form of the word BPF_W, packet data is located at a fixed offset BPF_ABS.



- BPF_JUMP (BPF_JMP + BPF_JEQ + BPF_K, arch, 0, 3) - checks using BPF_JEQ if the architecture value in the BPF_K accumulator constant is equal to arch. If so, it jumps at offset 0 to the next instruction; otherwise, it jumps at offset 3 (in this case) to throw an error, because arch does not match.



- BPF_STMT (BPF_LD + BPF_W + BPF_ABS (offsetof (struct seccomp_data, nr))) - downloads and accumulates with BPF_LD in the form of the word BPF_W, which is the system call number contained in the fixed offset BPF_ABS.



- BPF_JUMP (BPF_JMP + BPF_JEQ + BPF_K, nr, 0, 1) - compares the system call number with the value of the nr variable. If they are equal, it continues to the next statement and disallows the system call; otherwise, enables the system call with SECCOMP_RET_ALLOW.



- BPF_STMT (BPF_RET + BPF_K, SECCOMP_RET_ERRNO | (error & SECCOMP_RET_DATA)) - terminates the program with BPF_RET and, as a result, issues a SECCOMP_RET_ERRNO error with a number from the err variable.



- BPF_STMT (BPF_RET + BPF_K, SECCOMP_RET_ALLOW) - terminates the program with BPF_RET and allows the execution of a system call using SECCOMP_RET_ALLOW.



SECCOMP IS CBPF You

might be wondering why a list of instructions is used instead of a compiled ELF object or a JIT compiled C program.



There are two reasons for this.



• First, Seccomp uses cBPF (classic BPF), not eBPF, which means it has no registers, but only an accumulator to store the last computation result, as you can see in the example.



• Second, Seccomp takes a pointer to an array of BPF instructions directly and nothing else. The macros that we have used only help to specify these instructions in a form convenient for programmers.


If you need more help understanding this assembly, consider pseudocode that does the same:



if (arch != AUDIT_ARCH_X86_64) {
    return SECCOMP_RET_ALLOW;
}
if (nr == __NR_write) {
    return SECCOMP_RET_ERRNO;
}
return SECCOMP_RET_ALLOW;


After defining the filter code in the socket_filter structure, you need to define a sock_fprog containing the code and the computed filter length. This data structure is needed as an argument to declare the work of the process in the future:



struct sock_fprog prog = {
   .len = (unsigned short)(sizeof(filter) / sizeof(filter[0])),
   .filter = filter,
};


There is only one thing left to do in the install_filter function - download the program itself! To do this, we use prctl, taking PR_SET_SECCOMP as an option to enter secure computing mode. Then we tell the mode to load the filter using SECCOMP_MODE_FILTER, which is contained in the prog variable of the sock_fprog type:



  if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
    perror("prctl(PR_SET_SECCOMP)");
    return 1;
  }
  return 0;
}


Finally, we can use our install_filter function, but before that we need to use prctl to set PR_SET_NO_NEW_PRIVS for the current execution and thus avoid a situation where child processes get more privileges than their parent. However, we can make the following calls to prctl in the install_filter function without having root rights.



Now we can call the install_filter function. Let's block all write system calls that are related to the X86-64 architecture, and just give permission, which blocks all attempts. After installing the filter, continue execution using the first argument:



int main(int argc, char const *argv[]) {
  if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
   perror("prctl(NO_NEW_PRIVS)");
   return 1;
  }
   install_filter(__NR_write, AUDIT_ARCH_X86_64, EPERM);
  return system(argv[1]);
 }


Let's get started. We can use either clang or gcc to compile our program, either way, it's just compiling the main.c file with no special options:



clang main.c -o filter-write


As noted, we have blocked all entries in the program. To test this, you need a program that outputs something - ls seems like a good candidate. This is how she usually behaves:



ls -la
total 36
drwxr-xr-x 2 fntlnz users 4096 Apr 28 21:09 .
drwxr-xr-x 4 fntlnz users 4096 Apr 26 13:01 ..
-rwxr-xr-x 1 fntlnz users 16800 Apr 28 21:09 filter-write
-rw-r--r-- 1 fntlnz users 19 Apr 28 21:09 .gitignore
-rw-r--r-- 1 fntlnz users 1282 Apr 28 21:08 main.c


Perfectly! This is what our shell program looks like: we just pass the program we want to test as the first argument:



./filter-write "ls -la"


When executed, this program produces completely empty output. However, we can use strace to see what's going on:



strace -f ./filter-write "ls -la"


The result of the work is greatly shortened, but the corresponding part of it shows that the records are blocked with the EPERM error - the same one that we configured. This means that the program does not output anything because it cannot access the write system call:



[pid 25099] write(2, "ls: ", 4) = -1 EPERM (Operation not permitted)
[pid 25099] write(2, "write error", 11) = -1 EPERM (Operation not permitted)
[pid 25099] write(2, "\n", 1) = -1 EPERM (Operation not permitted)


Now you understand how Seccomp BPF works and have a good idea of ​​what can be done with it. But wouldn't you want to do the same with eBPF instead of cBPF in order to use its full power?



When thinking about eBPF programs, most people think that they are just writing and loading them with administrator privileges. While this statement is generally true, the kernel implements a set of mechanisms to protect eBPF objects at various levels. These mechanisms are called BPF LSM traps.



Traps BPF LSM



To provide architecture-independent monitoring of system events, LSM implements the concept of traps. A hook call is technically similar to a system call, but is system independent and integrated with the infrastructure. LSM provides a new concept in which the abstraction layer can help avoid problems that arise when dealing with system calls on different architectures.



At the time of this writing, the kernel has seven hooks associated with BPF programs, and SELinux is the only built-in LSM that implements them.



The source code for the hooks is located in the kernel tree in the include / linux / security.h file:



extern int security_bpf(int cmd, union bpf_attr *attr, unsigned int size);
extern int security_bpf_map(struct bpf_map *map, fmode_t fmode);
extern int security_bpf_prog(struct bpf_prog *prog);
extern int security_bpf_map_alloc(struct bpf_map *map);
extern void security_bpf_map_free(struct bpf_map *map);
extern int security_bpf_prog_alloc(struct bpf_prog_aux *aux);
extern void security_bpf_prog_free(struct bpf_prog_aux *aux);


Each of them will be called at different stages of execution:



- security_bpf - performs initial checks of executed BPF system calls;



- security_bpf_map - checks when the kernel returns a file descriptor for the map;



- security_bpf_prog - Checks when the kernel returns a file descriptor for the eBPF program;



- security_bpf_map_alloc - checks if the security field inside BPF maps is initialized;



- security_bpf_map_free - checks if the security field inside BPF maps is cleared;



- security_bpf_prog_alloc - checks if the security field is initialized inside BPF programs;



- security_bpf_prog_free - checks if the security field is cleared inside BPF programs.



Now, seeing all this, we understand that the idea behind LSM BPF interceptors is that they can provide protection for every eBPF object, ensuring that only those with the appropriate privileges can perform operations on maps and programs.



Summary



Security is not something you can enforce in a one-size-fits-all manner for anything you want to protect. It is important to be able to protect systems at different levels and in different ways. Believe it or not, the best way to secure a system is to organize different levels of protection from different positions so that the security degradation of one level prevents access to the entire system. The kernel developers have done a great job providing us with a set of different layers and touchpoints. We hope we have given you a good understanding of what layers are and how to use BPF programs to work with them.



About the authors



David Calavera is CTO at Netlify. He has worked for Docker Support and has contributed to the development of Runc, Go and BCC tools, as well as other open source projects. Known for his work on Docker projects and the development of the Docker plugin ecosystem. David is very fond of flame graphs and always strives to optimize performance.



Lorenzo Fontana is part of the open source development team at Sysdig, where he is mainly involved in Falco, a Cloud Native Computing Foundation project that provides container runtime security and anomaly detection through the kernel module and eBPF. He is passionate about distributed systems, software-defined networking, the Linux kernel, and performance analysis.



»More details about the book can be found on the website of the publishing house

» Table of Contents

» Excerpt



For Habitants a 25% discount on coupon - Linux



Upon payment for the paper version of the book, an e-book is sent by e-mail.



All Articles