How Ivan localized the error in the backend

In the comments to one of my articles about basic Linux shell commands for testers, it was rightly noted that it did not specify the use of commands during testing. I thought it was better late than never, so I decided to tell the story of the Backend QA engineer Vanya, who encountered unexpected service behavior and tried to figure out exactly where the error happened.







What Vanya tested



Vanya knew that he was going to test the "nginx + service" bundle.

Here I will immediately make a remark: such a bundle was chosen for this article simply because it can most clearly demonstrate the use of various utilities when debugging a problem and because it is very simple to configure and raise. In real terms, it can be either just a service or a chain of services that make requests to each other.


The default HTTP server Python SimpleHTTPServer acts as a service, which, in response to a request without parameters, displays the contents of the current directory:



[root@ivan test_dir_srv]# ls -l
total 0
-rw-r--r-- 1 root root 0 Aug 25 11:23 test_file
[root@ivan test_dir_srv]# python3 -m http.server --bind 127.0.0.1 8000
Serving HTTP on 127.0.0.1 port 8000 (http://127.0.0.1:8000/) ...


Nginx is configured as follows:



upstream test {
    server 127.0.0.1:8000;
}

server {
    listen       80;

    location / {
        proxy_pass http://test;
    }
}


Vanya needed to test a single test case: to check that the request for / works. He checked and everything worked:



MacBook-Pro-Ivan:~ ivantester$ curl http://12.34.56.78
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
<ul>
<li><a href="test_file">test_file</a></li>
</ul>
<hr>
</body>
</html>


But then at one point on the test bench, the developers updated something, and Vanya received an error:



MacBook-Pro-Ivan:~ ivantester$ curl http://12.34.56.78
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.14.2</center>
</body>
</html>


He decided not to throw this incomprehensible error to the developers, but to get ssh access to the server and figure out what was going on there. He had little knowledge in the field of this kind of debugging of problems, but he really wanted to learn, so he armed himself with search engines, logic and went to localize the bug.



Vanya's first thought: logs



Indeed, if an error has occurred, then you just need to find it in the log file. But first you need to find the log file itself. Vanya went to Google and found out that often the logs are in the / var / log directory . Indeed, the nginx directory was found there:



[root@ivan ~]# ls /var/log/nginx/
access.log  access.log-20200831  error.log  error.log-20200831


Ivan looked at the last lines of the error log and understood what was wrong: the developers made a mistake in the nginx configuration, a typo crept into the upstream port.



[root@ivan ~]# tail /var/log/nginx/error.log
2020/08/31 04:36:21 [error] 15050#15050: *90452 connect() failed (111: Connection refused) while connecting to upstream, client: 31.170.95.221, server: , request: "GET / HTTP/1.0", upstream: "http://127.0.0.1:8009/", host: "12.34.56.78"


What conclusion can be drawn from this? Logs are the best friend of testers and developers when localizing bugs. If there is some unexpected behavior of the service, but there is nothing in the logs, then this is a reason to return the task to development with a request to add logs. After all, if nginx hadn't written to the log about an unsuccessful attempt to reach the upstream, then, you see, it would have been more difficult to look for a problem?


At that moment, Vanya thought: “What if the nginx logs were in a different directory? How would I find them? " In a couple of years, Vanya will have more experience with services in Linux, and he will know that the path to the log file is often passed to the service as a command line argument, or it is contained in a configuration file, the path to which is also often passed to the service as a command line argument. Well, ideally, the path to the log file should be written in the service documentation.



By the way, through the configuration file you can find the path to the log file in nginx:



[root@ivan ~]# ps ax | grep nginx | grep master
root     19899  0.0  0.0  57392  2872 ?        Ss    2019   0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
[root@ivan ~]# grep "log" /etc/nginx/nginx.conf
error_log  /var/log/nginx/error.log warn;
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
    access_log  /var/log/nginx/access.log  main;


What if there is nothing in the logs?



In his free time, Vanya decided to think about how he would have coped with the task if nginx had not written anything to the log. Vanya knew that the service was listening on port 8000, so he decided to look at the traffic on this port on the server. The tcpdump utility helped him with this . With the correct configuration, it saw a request and a response:



Dump traffic on port 8000
[root@ivan ~]# tcpdump -nn -i lo -A port 8000
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
09:10:42.114284 IP 127.0.0.1.33296 > 127.0.0.1.8000: Flags [S], seq 3390176024, win 43690, options [mss 65495,sackOK,TS val 830366494 ecr 0,nop,wscale 8], length 0
E..<..@.@..............@.............0.........
1~c.........
09:10:42.114293 IP 127.0.0.1.8000 > 127.0.0.1.33296: Flags [S.], seq 4147196208, ack 3390176025, win 43690, options [mss 65495,sackOK,TS val 830366494 ecr 830366494,nop,wscale 8], length 0
E..<..@.@.<..........@...110.........0.........
1~c.1~c.....
09:10:42.114302 IP 127.0.0.1.33296 > 127.0.0.1.8000: Flags [.], ack 1, win 171, options [nop,nop,TS val 830366494 ecr 830366494], length 0
E..4..@.@..............@.....111.....(.....
1~c.1~c.
09:10:42.114329 IP 127.0.0.1.33296 > 127.0.0.1.8000: Flags [P.], seq 1:88, ack 1, win 171, options [nop,nop,TS val 830366494 ecr 830366494], length 87
E.....@.@..b...........@.....111...........
1~c.1~c.GET / HTTP/1.0
Host: test
Connection: close
User-Agent: curl/7.64.1
Accept: */*

09:10:42.114333 IP 127.0.0.1.8000 > 127.0.0.1.33296: Flags [.], ack 88, win 171, options [nop,nop,TS val 830366494 ecr 830366494], length 0
E..4R/@.@............@...111...p.....(.....
1~c.1~c.
09:10:42.115062 IP 127.0.0.1.8000 > 127.0.0.1.33296: Flags [P.], seq 1:155, ack 88, win 171, options [nop,nop,TS val 830366494 ecr 830366494], length 154
E...R0@.@............@...111...p...........
1~c.1~c.HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/3.7.2
Date: Mon, 07 Sep 2020 13:10:42 GMT
Content-type: text/html; charset=utf-8
Content-Length: 340

09:10:42.115072 IP 127.0.0.1.33296 > 127.0.0.1.8000: Flags [.], ack 155, win 175, options [nop,nop,TS val 830366494 ecr 830366494], length 0
E..4.@.@..............@...p.11......(.....
1~c.1~c.
09:10:42.115094 IP 127.0.0.1.8000 > 127.0.0.1.33296: Flags [P.], seq 155:495, ack 88, win 171, options [nop,nop,TS val 830366494 ecr 830366494], length 340
E...R1@.@..<.........@...11....p.....|.....
1~c.1~c.<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
<ul>
<li><a href="test_file">test_file</a></li>
</ul>
<hr>
</body>
</html>

09:10:42.115098 IP 127.0.0.1.33296 > 127.0.0.1.8000: Flags [.], ack 495, win 180, options [nop,nop,TS val 830366494 ecr 830366494], length 0
E..4.
@.@..............@...p.13......(.....
1~c.1~c.
09:10:42.115128 IP 127.0.0.1.8000 > 127.0.0.1.33296: Flags [F.], seq 495, ack 88, win 171, options [nop,nop,TS val 830366494 ecr 830366494], length 0
E..4R2@.@............@...13....p.....(.....
1~c.1~c.
09:10:42.115264 IP 127.0.0.1.33296 > 127.0.0.1.8000: Flags [F.], seq 88, ack 496, win 180, options [nop,nop,TS val 830366495 ecr 830366494], length 0
E..4..@.@..............@...p.13 .....(.....
1~c.1~c.
09:10:42.115271 IP 127.0.0.1.8000 > 127.0.0.1.33296: Flags [.], ack 89, win 171, options [nop,nop,TS val 830366495 ecr 830366495], length 0
E..4R3@.@............@...13 ...q.....(.....
1~c.1~c.
^C
12 packets captured
24 packets received by filter
0 packets dropped by kernel




With an incorrect configuration (with port 8009 in the nginx upstream), there was no traffic on port 8000. Vanya was delighted: now even if the developers forgot to write to the log in case of network errors, you can still at least find out if the traffic is going to the desired host or port.



What conclusion can be drawn from this story? Even if there are no logs, Linux has utilities that can help with localizing problems.


And if not a network?



Everything worked well, but one day Vanya received an error again, this time different:



MacBook-Pro-Ivan:~ ivantester$ curl http://12.34.56.78
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        <title>Error response</title>
    </head>
    <body>
        <h1>Error response</h1>
        <p>Error code: 404</p>
        <p>Message: File not found.</p>
        <p>Error code explanation: HTTPStatus.NOT_FOUND - Nothing matches the given URI.</p>
    </body>
</html>


Vanya went to the server again, but this time the problem was not related to the network. The service log also said File not found , and Vanya decided to figure out why such an error suddenly appeared. He knows that there is a process python3 -m http.server , but he does not know the contents of which directory this service displays (or, in other words, which is the current working directory of this process). He finds out with the lsof command :



[root@ivan ~]# ps aux | grep python | grep "http.server"
root     20638  0.0  0.3 270144 13552 pts/2    S+   08:29   0:00 python3 -m http.server
[root@ivan ~]# lsof -p 20638 | grep cwd
python3 20638 root  cwd    DIR     253,1      4096 1843551 /root/test_dir_srv2


It can also be done using the pwdx command or using the proc directory :



[root@ivan ~]# pwdx 20638
20638: /root/test_dir_srv2
[root@ivan ~]# ls -l /proc/20638/cwd
lrwxrwxrwx 1 root root 0 Aug 31 08:37 /proc/20638/cwd -> /root/test_dir_srv2


Such a directory really exists on the server, and it contains a file named test_file . What's the matter? Ivan googled and found the strace utility , with which you can see what system calls a process performs (by the way, there is a good article on strace on Habré, and not even one ). You can either start a new process through strace , or connect with this utility to an already running process. The second option suited Vanya:



Strace output
[root@ivan ~]# strace -ff -p 20638
strace: Process 20638 attached
restart_syscall(<... resuming interrupted poll ...>) = 0
poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500)   = 1 ([{fd=4, revents=POLLIN}])
accept4(4, {sa_family=AF_INET, sin_port=htons(57530), sin_addr=inet_addr("127.0.0.1")}, [16], SOCK_CLOEXEC) = 5
clone(child_stack=0x7f2beeb28fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f2beeb299d0, tls=0x7f2beeb29700, child_tidptr=0x7f2beeb299d0) = 21062
futex(0x11204d0, FUTEX_WAIT_PRIVATE, 0, NULLstrace: Process 21062 attached
 <unfinished ...>
[pid 21062] set_robust_list(0x7f2beeb299e0, 24) = 0
[pid 21062] futex(0x11204d0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 20638] <... futex resumed> )       = 0
[pid 20638] futex(0x921c9c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 27, {1598879772, 978949000}, ffffffff <unfinished ...>
[pid 21062] futex(0x921c9c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x921c98, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
[pid 20638] <... futex resumed> )       = 0
[pid 20638] futex(0x921cc8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 21062] futex(0x921cc8, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 20638] <... futex resumed> )       = 0
[pid 20638] futex(0x921cc8, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 20638] poll([{fd=4, events=POLLIN}], 1, 500 <unfinished ...>
[pid 21062] recvfrom(5, "GET / HTTP/1.1\r\nConnection: upgr"..., 8192, 0, NULL, NULL) = 153
[pid 21062] stat("/root/test_dir_srv/", 0x7f2beeb27350) = -1 ENOENT (No such file or directory)
[pid 21062] open("/root/test_dir_srv/", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 21062] write(2, "127.0.0.1 - - [31/Aug/2020 09:16"..., 70) = 70
[pid 21062] write(2, "127.0.0.1 - - [31/Aug/2020 09:16"..., 60) = 60
[pid 21062] sendto(5, "HTTP/1.0 404 File not found\r\nSer"..., 184, 0, NULL, 0) = 184
[pid 21062] sendto(5, "<!DOCTYPE HTML PUBLIC \"-//W3C//D"..., 469, 0, NULL, 0) = 469
[pid 21062] shutdown(5, SHUT_WR)        = 0
[pid 21062] close(5)                    = 0
[pid 21062] madvise(0x7f2bee329000, 8368128, MADV_DONTNEED) = 0
[pid 21062] exit(0)                     = ?
[pid 21062] +++ exited with 0 +++
<... poll resumed> )                    = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500)   = 0 (Timeout)
poll([{fd=4, events=POLLIN}], 1, 500^Cstrace: Process 20638 detached
 <detached ...>




Usually strace output is quite voluminous (and maybe very large), so it is more convenient to immediately redirect it to a file and then look for the necessary system calls in it. In this case, you can immediately find that the service is trying to open the directory / root / test_dir_srv / - someone renamed it and did not restart the service after that, so it returns 404.



If you immediately understand which system calls you need to look at, you can use the option -e :



[root@ivan ~]# strace -ff -e trace=open,stat -p 20638
strace: Process 20638 attached
strace: Process 21396 attached
[pid 21396] stat("/root/test_dir_srv/", 0x7f2beeb27350) = -1 ENOENT (No such file or directory)
[pid 21396] open("/root/test_dir_srv/", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 21396] +++ exited with 0 +++
^Cstrace: Process 20638 detached


: « » , strace. , , (, / ), . ltrace.


- ?



Vanya did not stop there and found out that there is a GNU Project Debugger - GDB . With its help, you can " get into" the process and even slightly modify it. And Vanya decided to try to find the last error using GDB . He suggested that since the service displays the contents of the directory, then you can try to put a breakpoint on the open () function and see what happens:

Gdb utility output
[root@ivan ~]# gdb -p 23998
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 23998
… <        debugging symbols...>
...
0x00007f2284c0b20d in poll () at ../sysdeps/unix/syscall-template.S:81
81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-34.el7.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-16.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) set follow-fork-mode child
(gdb) b open
Breakpoint 1 at 0x7f2284c06d20: open. (2 locations)
(gdb) c
Continuing.
[New Thread 0x7f227a165700 (LWP 24030)]
[Switching to Thread 0x7f227a165700 (LWP 24030)]

Breakpoint 1, open64 () at ../sysdeps/unix/syscall-template.S:81
81	T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) n
83	T_PSEUDO_END (SYSCALL_SYMBOL)
(gdb) n
_io_FileIO___init___impl (opener=<optimized out>, closefd=<optimized out>, mode=<optimized out>, nameobj=0x7f227a68f6f0, self=0x7f227a68f6c0) at ./Modules/_io/fileio.c:381
381	                Py_END_ALLOW_THREADS
(gdb) n
379	                self->fd = open(name, flags, 0666);
(gdb) n
381	                Py_END_ALLOW_THREADS
(gdb) print name
$1 = 0x7f227a687c90 "/root/test_dir_srv/"
(gdb) q
A debugging session is active.

	Inferior 1 [process 23998] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/local/bin/python3.7, process 23998
[Inferior 1 (process 23998) detached]




After the c ( continue ) command, Vanya launched curl in another console , hit a breakpoint in the debugger and began to execute this program (that is, the service) step by step. As soon as he found open in some path name , he printed the value of this variable and saw " / root / test_dir_srv / ".

GDB is a powerful tool, and here's the simplest use case. Sometimes it can help in reproducing any complex cases (for example, you can pause the process at the right time and reproduce the race condition), it also helps with reading core dump files.


What if Docker?



At one point, DevOps decided that the service would now be deployed with a Docker container, and it was necessary to retest all the cases that Vanya found. Vanya easily googled the following:



  1. You can use tcpdump , strace and gdb inside a container, but you need to keep in mind Linux capabilities (there is an article that explains why strace did not work in a container without --cap-add = SYS_PTRACE ).
  2. The --pid option can be used .


But he wondered if it was possible to see all the traffic going to the container (or from the container) directly from the host. In tcpdump have the possibility to display any traffic interface (option -i ), each container corresponds to one virtual interface Veth (this can be seen, for example, through the ifconfig or the ip a ), but how do you know what the container which interface corresponds to? If the container does not use host networking , then inside it there will be an eth0 network interface through which it can communicate over the network with other containers and the host. All that remains is to find the ifindex of which interface on the host matches the iflink interfaceeth0 of the container (you can read what this means here ).



[root@ivan ~]# for f in `ls /sys/class/net/veth*/ifindex`; do echo $f; cat $f; done | grep -B 1 `docker exec test_service_container cat /sys/class/net/eth0/iflink` | head -1
/sys/class/net/veth6c18dba/ifindex


Now you can run tcpdump on the veth6c18dba interface :



tcpdump -i veth6c18dba


But there is an easier way: you can find the IP address of the container on its network and listen to traffic on it:



[root@ivan ~]# docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' test_service_container
172.17.0.10
[root@ivan ~]# tcpdump -i any host 172.17.0.10


Conclusion: debugging in a Docker container is not a big deal. The utilities work in it, and you can use docker logs to read the logs .


conclusions



As a responsible engineer, Vanya decided to briefly outline new information for himself in the internal knowledge base. Here's what he wrote:



  • Logs are man's best friend. If unexpected behavior of the service is encountered and at the same time it does not write anything to the log, this is a reason to ask the developers to add logs.
  • , , . , Linux , .
  • tcpdump. , .
  • «» strace, ltrace gdb.
  • , Docker-.
  • /proc/PID. , /proc/PID/fd .
  • Utilities ps , ls , stat , lsof , ss , du , top , free , ip , ldconfig , ldd and others can also help to get various information about the system, files or processes .


I hope this story was useful to you, and at least once it will help you understand what is the matter when you debug something in Linux. If Vanya missed something, share it in the comments!



All Articles