Over the past few months, I've spent a lot of my personal time learning how Linux containers work. In particular, what exactly does it do
docker run
. In this article, I'm going to summarize what I've found out and try to show how the individual elements form a big picture. We'll start our journey by creating an alpine container using the docker run:
$ docker run -i -t --name alpine alpine ash
This container will be used in the output below. When the docker run command is called, it parses the parameters passed to it on the command line and creates a JSON object to represent the object that docker needs to create. This object is then sent to the docker daemon over the UNIX domain socket /var/run/docker.sock. To monitor API calls, we can use the strace utility :
$ strace -s 8192 -e trace=read,write -f docker run -d alpine
[pid 13446] write(3, "GET /_ping HTTP/1.1\r\nHost: docker\r\nUser-Agent: Docker-Client/1.13.1 (linux)\r\n\r\n", 79) = 79
[pid 13442] read(3, "HTTP/1.1 200 OK\r\nApi-Version: 1.26\r\nDocker-Experimental: false\r\nServer: Docker/1.13.1 (linux)\r\nDate: Mon, 19 Feb 2018 16:12:32 GMT\r\nContent-Length: 2\r\nContent-Type: text/plain; charset=utf-8\r\n\r\nOK", 4096) = 196
[pid 13442] write(3, "POST /v1.26/containers/create HTTP/1.1\r\nHost: docker\r\nUser-Agent: Docker-Client/1.13.1 (linux)\r\nContent-Length: 1404\r\nContent-Type: application/json\r\n\r\n{\"Hostname\":\"\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":[],\"Cmd\":null,\"Image\":\"alpine\",\"Volumes\":{},\"WorkingDir\":\"\",\"Entrypoint\":null,\"OnBuild\":null,\"Labels\":{},\"HostConfig\":{\"Binds\":null,\"ContainerIDFile\":\"\",\"LogConfig\":{\"Type\":\"\",\"Config\":{}},\"NetworkMode\":\"default\",\"PortBindings\":{},\"RestartPolicy\":{\"Name\":\"no\",\"MaximumRetryCount\":0},\"AutoRemove\":false,\"VolumeDriver\":\"\",\"VolumesFrom\":null,\"CapAdd\":null,\"CapDrop\":null,\"Dns\":[],\"DnsOptions\":[],\"DnsSearch\":[],\"ExtraHosts\":null,\"GroupAdd\":null,\"IpcMode\":\"\",\"Cgroup\":\"\",\"Links\":null,\"OomScoreAdj\":0,\"PidMode\":\"\",\"Privileged\":false,\"PublishAllPorts\":false,\"ReadonlyRootfs\":false,\"SecurityOpt\":null,\"UTSMode\":\"\",\"UsernsMode\":\"\",\"ShmSize\":0,\"ConsoleSize\":[0,0],\"Isolation\":\"\",\"CpuShares\":0,\"Memory\":0,\"NanoCpus\":0,\"CgroupParent\":\"\",\"BlkioWeight\":0,\"BlkioWeightDevice\":null,\"BlkioDeviceReadBps\":null,\"BlkioDeviceWriteBps\":null,\"BlkioDeviceReadIOps\":null,\"BlkioDeviceWriteIOps\":null,\"CpuPeriod\":0,\"CpuQuota\":0,\"CpuRealtimePeriod\":0,\"CpuRealtimeRuntime\":0,\"CpusetCpus\":\"\",\"CpusetMems\":\"\",\"Devices\":[],\"DiskQuota\":0,\"KernelMemory\":0,\"MemoryReservation\":0,\"MemorySwap\":0,\"MemorySwappiness\":-1,\"OomKillDisable\":false,\"PidsLimit\":0,\"Ulimits\":null,\"CpuCount\":0,\"CpuPercent\":0,\"IOMaximumIOps\":0,\"IOMaximumBandwidth\":0},\"NetworkingConfig\":{\"EndpointsConfig\":{}}}\n", 1556) = 1556
[pid 13442] read(3, "HTTP/1.1 201 Created\r\nApi-Version: 1.26\r\nContent-Type: application/json\r\nDocker-Experimental: false\r\nServer: Docker/1.13.1 (linux)\r\nDate: Mon, 19 Feb 2018 16:12:32 GMT\r\nContent-Length: 90\r\n\r\n{\"Id\":\"b70b57c5ae3e25585edba898ac860e388582391907be4070f91eb49f4db5c433\",\"Warnings\":null}\n", 4096) = 281
This is where the real fun begins. As soon as the docker daemon receives the request, it parses the output and communicates with containerd via the gRPC API to configure the runtime (or runtime) of the container using the parameters passed on the command line. To observe this interaction, we can use the ctr utility:
$ ctr --address "unix:///run/containerd.sock" events
TIME TYPE ID PID STATUS
time="2018-02-19T12:10:07.658081859-05:00" level=debug msg="Calling POST /v1.26/containers/create"
time="2018-02-19T12:10:07.676706130-05:00" level=debug msg="container mounted via layerStore: /var/lib/docker/overlay2/2beda8ac904f4a2531d72e1e3910babf145c6e68dfd02008c58786adb254f9dc/merged"
time="2018-02-19T12:10:07.682430843-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/attach?stderr=1&stdin=1&stdout=1&stream=1"
time="2018-02-19T12:10:07.683638676-05:00" level=debug msg="Calling GET /v1.26/events?filters=%7B%22container%22%3A%7B%22d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f%22%3Atrue%7D%2C%22type%22%3A%7B%22container%22%3Atrue%7D%7D"
time="2018-02-19T12:10:07.684447919-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/start"
time="2018-02-19T12:10:07.687230717-05:00" level=debug msg="container mounted via layerStore: /var/lib/docker/overlay2/2beda8ac904f4a2531d72e1e3910babf145c6e68dfd02008c58786adb254f9dc/merged"
time="2018-02-19T12:10:07.885362059-05:00" level=debug msg="sandbox set key processing took 11.824662ms for container d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f"
time="2018-02-19T12:10:07.927897701-05:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-container\", Id:\"d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f\", Status:0x0, Pid:\"\", Timestamp:(*timestamp.Timestamp)(0xc420bacdd0)}"
2018-02-19T17:10:07.927795344Z start-container d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f 0
time="2018-02-19T12:10:07.930283397-05:00" level=debug msg="libcontainerd: event unhandled: type:\"start-container\" id:\"d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f\" timestamp:<seconds:1519060207 nanos:927795344 > "
time="2018-02-19T12:10:07.930874606-05:00" level=debug msg="Calling POST /v1.26/containers/d1a6d87886e2d515bfff37d826eeb671502fa7c6f47e422ec3b3549ecacbc15f/resize?h=35&w=115"
Configuring the runtime of a container is a fairly significant task. Namespaces must be configured, the image must be mounted, security controls must be enabled (application protection profiles, seccomp profiles, capabilities), etc., etc., etc. You can get a pretty good idea everything you need to set up the runtime by looking at the output
docker inspect containerid
and the runtime spec file config.json
(more on that in a moment ).
Strictly speaking, containerd does not create a container runtime. It sets up the environment and then calls containerd-shimto run the container runtime through the configured OCI runtime (controlled by the containerd "βruntime" parameter). Most modern systems run the container runtime based on runc . We can observe this using the pstree utility :
$ pstree -l -p -s -T
systemd,1 --switched-root --system --deserialize 24
ββdocker-containe,19606 --listen unix:///run/containerd.sock --shim /usr/libexec/docker/docker-containerd-shim-current --start-timeout 2m --debug
β ββdocker-containe,19834 93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /var/run/docker/libcontainerd/93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /usr/libexec/docker/docker-runc-current
Since pstree strips off the process name, we can check the PID with ps :
$ ps auxwww | grep [1]9606
root 19606 0.0 0.2 685636 10632 ? Ssl 13:01 0:00 /usr/libexec/docker/docker-containerd-current --listen unix:///run/containerd.sock --shim /usr/libexec/docker/docker-containerd-shim-current --start-timeout 2m --debug
$ ps auxwww | grep [1]9834
root 19834 0.0 0.0 527748 3020 ? Sl 13:01 0:00 /usr/libexec/docker/docker-containerd-shim-current 93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /var/run/docker/libcontainerd/93a619715426f613646359863e77cc06fa85502273df931517ec3f4aaae50d5a /usr/libexec/docker/docker-runc-current
When I first started exploring the interactions between dockerd, containerd, and shim , I didn't fully understand what shim was for . Luckily Google led to excellent writing by Michael Crosby . Shim serves several purposes:
- Allows starting containers without daemons.
- STDIO FD containerd docker.
- containerd .
The first and second key points are very important. These features allow you to decouple the container from the docker daemon , allowing dockerd to be updated or restarted without affecting running containers. Very effective! I mentioned that shim is responsible for running the runc to actually start the container. Runc needs two things to do its job : the spec file and the path to the root filesystem image (the combination of which is called a bundle ). To see how this works, we can create rootfs by exporting the alpine docker image :
$ mkdir -p alpine/rootfs
$ cd alpine
$ docker export d1a6d87886e2 | tar -C rootfs -xvf -
time="2018-02-19T12:54:13.082321231-05:00" level=debug msg="Calling GET /v1.26/containers/d1a6d87886e2/export"
.dockerenv
bin/
bin/ash
bin/base64
bin/bbconfig
.....
The export option accepts a container, which you can find in the output
docker ps -a
. To create a spec file, you can use the runc spec command :
$ runc spec
This will create a spec file named
config.json
in your current directory. This file can be customized according to your needs and requirements. Once you are satisfied with the file, you can run runc with the rootfs directory as the only argument (the container configuration will be read from the file config.json
):
$ runc run rootfs
This simple example will create an alpine ash wrapper:
$ runc run rootfs
/ # cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.7.0
PRETTY_NAME="Alpine Linux v3.7"
HOME_URL="http://alpinelinux.org"
BUG_REPORT_URL="http://bugs.alpinelinux.org"
The ability to create containers and play with the runc runtime specification is incredibly powerful. You can evaluate different application profiles, test Linux capabilities, and experiment with every aspect of the container runtime without having to install docker. I've only scratched the surface a bit and would highly recommend reading the runc and containerd documentation . Very cool tools!
Learn more about the course.