Why Strace doesn't work in Docker

When I edited the container capabilities page for How Containers Work , I needed to explain why it didn't work in Docker strace. Here's what happened when running stracein the Docker container on my laptop:



$ docker run  -it ubuntu:18.04 /bin/bash
$ # ... install strace ...
root@e27f594da870:/# strace ls
strace: ptrace(PTRACE_TRACEME, ...): Operation not permitted


straceworks through a system call ptrace, so ptraceit won't work without permission ! But it's easy to fix, and on my laptop I did it like this:



docker run --cap-add=SYS_PTRACE  -it ubuntu:18.04 /bin/bash


But it was interesting to me not to solve the problem, but to figure out why this situation generally arises. So why does it stracenot work and --cap-add=SYS_PTRACEfix everything?



Hypothesis 1: Container processes do not have their own privilege CAP_SYS_PTRACE



Since the problem is consistently resolved through --cap-add=SYS_PTRACE, it always seemed to me that Docker container processes, by definition, do not have their own privilege CAP_SYS_PTRACE, but for two reasons something does not add up here.



Reason 1: As an experiment, I, being logged in as a regular user, could easily start straceany process, however, I CAP_SYS_PTRACEdidnโ€™t find anything in the privilege of my current process :



$ getpcaps $$
Capabilities for `11589': =


Reason 2: in man capabilitiesthe privilege CAP_SYS_PTRACEreads as follows:



CAP_SYS_PTRACE
       * Trace arbitrary processes using ptrace(2);


The whole point CAP_SYS_PTRACEis so that we, by analogy with root, can take control of an arbitrary process of any user. For ptraceyour user this privilege does not need a conventional process.



In addition, I carried out one more check: I launched the Docker container through docker run --cap-add=SYS_PTRACE -it ubuntu:18.04 /bin/bash, then revoked the privilege CAP_SYS_PTRACE- and stracecontinued to work correctly even without the privilege. Why?!



Hypothesis 2: Case in user namespace?



My next (and much less well-founded) hypothesis sounded like "hmm, maybe the process is in a different user namespace and stracedoesn't work ... just because?" It looks like a set of not very coherent statements, but I still tried to look at the problem from this side.



So, is the process in a different user namespace? This is how it looks in the container:



root@e27f594da870:/# ls /proc/$$/ns/user -l
... /proc/1/ns/user -> 'user:[4026531837]'


And this is how it looks on the host:



bork@kiwi:~$ ls /proc/$$/ns/user -l
... /proc/12177/ns/user -> 'user:[4026531837]'


The root in the container is the same user as the root on the host, because they have a common identifier in the user namespace (4026531837), so there should not be any stracereasons that interfere with the work . As you can see, the hypothesis turned out to be so-so, but then I did not yet realize that the users in the container and on the host are the same, and this approach seemed interesting to me.



Hypothesis 3: The system call is ptraceblocked by a ruleseccomp-bpf



I already knew that there is a rule in Docker to restrict a large number of system calls to be run by container processors in Docker seccomp-bpf, and it turned out that there are and in its list of calls blocked by definition ptrace! (In fact, the call list is an exception sheet and ptracesimply does not get into it, but the result does not change.)



Now itโ€™s clear why the container does not work in the Docker strace, because itโ€™s obvious that ptraceit will not work to call a completely blocked one.



Let's test this hypothesis and see if we can use stracethe Docker container if we disable all seccomp rules:



$ docker run --security-opt seccomp=unconfined -it ubuntu:18.04  /bin/bash
$ strace ls
execve("/bin/ls", ["ls"], 0x7ffc69a65580 /* 8 vars */) = 0
... it works fine ...


Fine! Everything works, and the secret is revealed! That's just ...



Why --cap-add=SYS_PTRACEdoes it solve the problem?



We still haven't explained why it --cap-add=SYS_PTRACEsolves the emerging challenge problem. The main page docker runexplains how the argument works as follows --cap-add:



--cap-add=[]
   Add Linux capabilities


None of this has anything to do with seccomp rules! What's the matter?



Let's take a look at the Docker source code.



If the documentation does not help already, all that remains for us is to plunge into the source.

Go has one nice feature: thanks to dependency vending in the Go repository, you grepcan walk through the entire repository and find the code you are interested in. So I github.com/moby/mobycloned and scoured him for expressions of the kind rg CAP_SYS_PTRACE.



In my opinion, this is what happens here: in the implementation of seccomp in the container, in the contrib / seccomp / seccomp_default.go section, there is a lot of code that, through the seccomp rule, checks whether a process with privileges has permission to use system calls in accordance with this privilege.



		case "CAP_SYS_PTRACE":
			s.Syscalls = append(s.Syscalls, specs.LinuxSyscall{
				Names: []string{
					"kcmp",
					"process_vm_readv",
					"process_vm_writev",
					"ptrace",
				},
				Action: specs.ActAllow,
				Args:   []specs.LinuxSeccompArg{},
			})




There is also code there, which in moby and for profiles / seccomp / seccomp.go , and for profile seccomp, by definition, performs similar operations, so we probably found our answer!



Docker --cap-addcan do more than said



In the end, it seems that --cap-addit does not exactly what it says on the main page, and should rather look like --cap-add-and-also-whitelist-some-extra-system-calls-if-required. And it seems to be true: if you have the privilege of the spirit CAP_SYS_PTRACE, which allows you to use a system call process_vm_readv, but the call is blocked Seccomp profile, you is not much help, so that the authorization to use the system calls process_vm_readvand ptracethrough CAP_SYS_PTRACElooks reasonable.



Turns out to stracework in the latest versions of Docker



For kernel versions 4.8 and higher, thanks to this commit , Docker 19.03 finally allowed system calls ptrace. Except, on my laptop, Docker is still version 18.09.7, and this commit is obviously missing.



That's all!



It turned out to be interesting to deal with this problem, and I think that this is a good example of a non-trivially interacting moving โ€œfillingโ€ of containers.



If you liked this post, you might like my magazine โ€œ How Containers Work โ€, its 24 pages explain the features of the Linux kernel for organizing container work. There you can see the privileges and seccomp-bpf .



All Articles