Linux containers in a couple of lines of code

As a continuation of the previous article on KVM, we publish a new translation and understand how containers work using the example of running a busybox Docker image.


This article on containers is a continuation of the previous article on KVM. I'd like to show you exactly how containers work by running a busybox Docker image in our own little container.



Unlike virtual machine, the term container is very vague and vague. What we usually call a container is a stand-alone package of code with all the required dependencies that can be shipped together and run in an isolated environment inside the host operating system. If you think this is a description of a virtual machine, let's dive deeper into the topic and see how containers are implemented.



BusyBox Docker



Our main goal will be to run a regular busybox image for Docker, but without Docker. Docker uses btrfs as the filesystem for its images. Let's try to download the image and unpack it into a directory:



mkdir rootfs
docker export $(docker create busybox) | tar -C rootfs -xvf -


We now have the busybox image filesystem unpacked into the rootfs folder . Of course, you can run ./rootfs/bin/sh and get a working shell, but if we look at the list of processes, files, or network interfaces, we can see that we have access to the entire OS.



So let's try to create an isolated environment.



Clone



Since we want to control what the child process has access to, we will use clone (2) instead of fork (2) . Clone does almost the same thing, but allows flags to be passed, indicating which resources you want to share (with the host).



The following flags are allowed:



  • CLONE_NEWNET - isolated network devices
  • CLONE_NEWUTS - host and domain name (UNIX timesharing system)
  • CLONE_NEWIPC - IPC objects
  • CLONE_NEWPID - process identifiers (PID)
  • CLONE_NEWNS - mount points (file systems)
  • CLONE_NEWUSER - users and groups.


In our experiment, we will try to isolate processes, IPC, network and file systems. So let's start:



static char child_stack[1024 * 1024];

int child_main(void *arg) {
  printf("Hello from child! PID=%d\n", getpid());
  return 0;
}

int main(int argc, char *argv[]) {
  int flags =
      CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWIPC | CLONE_NEWNET;
  int pid = clone(child_main, child_stack + sizeof(child_stack),
                  flags | SIGCHLD, argv + 1);
  if (pid < 0) {
    fprintf(stderr, "clone failed: %d\n", errno);
    return 1;
  }
  waitpid(pid, NULL, 0);
  return 0;
}


The code must be run with superuser privileges, otherwise cloning will fail.



The experiment gives an interesting result: the child PID is 1. We know very well that the init process usually has PID 1. But in this case, the child process gets its own isolated process list, where it became the first process.



Working shell



To make it easier to learn a new environment, let's start a shell in the child process. Let's run arbitrary commands like docker run :



int child_main(void *arg) {
  char **argv = (char **)arg;
  execvp(argv[0], argv);
  return 0;
}


Now launching our application with the / bin / sh argument opens a real shell in which we can enter commands. This result proves how wrong we were when we talked about isolation:



# echo $$
1
# ps
  PID TTY          TIME CMD
 5998 pts/31   00:00:00 sudo
 5999 pts/31   00:00:00 main
 6001 pts/31   00:00:00 sh
 6004 pts/31   00:00:00 ps


As we can see, the shell process itself has a PID of 1, but, in fact, it can see and access all other processes of the main OS. The reason is that the process list is read from procfs , which is still inherited.



So, unmount procfs :



umount2("/proc", MNT_DETACH);




Now the ps , mount and other commands break when starting the shell because procfs is not mounted. However, this is still better than the parent procfs leak.



Chroot



Usually chroot is used to create the root directory , but we will use the alternative pivot_root . This system call moves the current system root to a subdirectory and assigns a different directory to the root:



int child_main(void *arg) {
  /* Unmount procfs */
  umount2("/proc", MNT_DETACH);
  /* Pivot root */
  mount("./rootfs", "./rootfs", "bind", MS_BIND | MS_REC, "");
  mkdir("./rootfs/oldrootfs", 0755);
  syscall(SYS_pivot_root, "./rootfs", "./rootfs/oldrootfs");
  chdir("/");
  umount2("/oldrootfs", MNT_DETACH);
  rmdir("/oldrootfs");
  /* Re-mount procfs */
  mount("proc", "/proc", "proc", 0, NULL);
  /* Run the process */
  char **argv = (char **)arg;
  execvp(argv[0], argv);
  return 0;
}


It makes sense to mount tmpfs to / tmp , sysfs to / sys and create a valid / dev filesystem , but I'll skip this step for brevity.



Now we only see files from the busybox image, as if we were using a chroot :



/ # ls
bin   dev   etc   home  proc  root  sys   tmp   usr   var

/ # mount
/dev/sda2 on / type ext4 (rw,relatime,data=ordered)
proc on /proc type proc (rw,relatime)

/ # ps
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    4 root      0:00 ps

/ # ps ax
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    5 root      0:00 ps ax


At the moment, the container looks quite isolated, perhaps even too much. We can't ping anything and the network doesn't seem to work at all.



Network



Creating a new network namespace was just the beginning! You need to assign it network interfaces and configure them to forward packets correctly.



If you don't have a br0 interface, you need to create it manually (brctl is part of the bridge-utils package in Ubuntu):



brctl addbr br0
ip addr add dev br0 172.16.0.100/24
ip link set br0 up
sudo iptables -A FORWARD -i wlp3s0  -o br0 -j ACCEPT
sudo iptables -A FORWARD -o wlp3s0 -i br0 -j ACCEPT
sudo iptables -t nat -A POSTROUTING -s 172.16.0.0/16 -j MASQUERADE


In my case, wlp3s0 was the main WiFi network interface and 172.16.xx was the network for the container.



Our container launcher needs to create a pair of interfaces, veth0 and veth1, associate them with br0, and set up routing within the container.



In the main () function, we will run these commands before cloning:



system("ip link add veth0 type veth peer name veth1");
system("ip link set veth0 up");
system("brctl addif br0 veth0");


When the call to clone () ends, we add veth1 to the new child namespace:



char ip_link_set[4096];
snprintf(ip_link_set, sizeof(ip_link_set) - 1, "ip link set veth1 netns %d",
         pid);
system(ip_link_set);


Now if we run ip link in a container shell, we will see a loopback interface and some veth1 @ xxxx interface. But the network still doesn't work. Let's set a unique hostname in the container and configure routes:



int child_main(void *arg) {

  ....

  sethostname("example", 7);
  system("ip link set veth1 up");

  char ip_addr_add[4096];
  snprintf(ip_addr_add, sizeof(ip_addr_add),
           "ip addr add 172.16.0.101/24 dev veth1");
  system(ip_addr_add);
  system("route add default gw 172.16.0.100 veth1");

  char **argv = (char **)arg;
  execvp(argv[0], argv);
  return 0;
}


Let's see how it looks:



/ # ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
47: veth1@if48: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue qlen 1000
    link/ether 72:0a:f0:91:d5:11 brd ff:ff:ff:ff:ff:ff

/ # hostname
example

/ # ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1): 56 data bytes
64 bytes from 1.1.1.1: seq=0 ttl=57 time=27.161 ms
64 bytes from 1.1.1.1: seq=1 ttl=57 time=26.048 ms
64 bytes from 1.1.1.1: seq=2 ttl=57 time=26.980 ms
...


Works!



Conclusion



The complete source code is available here . If you find a bug or have a suggestion, please leave a comment!



Of course, Docker can do much more! But it's amazing how many suitable APIs the Linux kernel has and how easy it is to use them to achieve OS-level virtualization.



Hope you enjoyed the article. You can find the author's projects on Github and follow Twitter to follow the news, as well as via rss .



All Articles