The syscall is the mechanism by which user programs interact with the Linux kernel, and strace is a powerful tool for keeping track of them. To better understand how the operating system works, it is helpful to understand how they work.
The operating system can be divided into two modes of operation:
- Kernel mode is the privileged mode used by the operating system kernel.
- User mode is the mode in which most user applications run.
Users typically use command line utilities and a graphical interface (GUI) for their daily work. At the same time, system calls invisibly work in the background, referring to the kernel to do the work.
System calls are very similar to function calls in the sense that arguments are passed in and they return values. The only difference is that system calls work at the kernel level, but functions do not. Switching from user mode to kernel mode is done using a special interrupt mechanism .
Most of these details are hidden from the user in the system libraries (glibc on Linux systems). System calls are universal in nature, but despite this, the mechanics of their execution are largely hardware-dependent.
This article explores several practical examples of parsing system calls using
strace
. The examples use Red Hat Enterprise Linux, but all commands should work on other Linux distributions:
[root@sandbox ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)
[root@sandbox ~]#
[root@sandbox ~]# uname -r
3.10.0-1062.el7.x86_64
[root@sandbox ~]#
First, make sure you have the necessary tools installed on your system. You
strace
can check if it is installed using the command below. To view the version, strace
run it with the -V parameter:
[root@sandbox ~]# rpm -qa | grep -i strace
strace-4.12-9.el7.x86_64
[root@sandbox ~]#
[root@sandbox ~]# strace -V
strace -- version 4.12
[root@sandbox ~]#
If
strace
not installed, install by running:
yum install strace
For example, create a test directory in
/tmp
and two files using the command touch
:
[root@sandbox ~]# cd /tmp/
[root@sandbox tmp]#
[root@sandbox tmp]# mkdir testdir
[root@sandbox tmp]#
[root@sandbox tmp]# touch testdir/file1
[root@sandbox tmp]# touch testdir/file2
[root@sandbox tmp]#
(I
/tmp
only use a directory because everyone has access to it, but you can use any other directory .)
Use the command to
ls
check that testdir
files have been created in the directory :
[root@sandbox tmp]# ls testdir/
file1 file2
[root@sandbox tmp]#
You probably use the command
ls
every day without realizing that system calls are running under the hood. This is where abstraction comes into play. This is how this command works:
-> (glibc) ->
The command
ls
calls functions from the Linux system libraries (glibc). These libraries, in turn, call system calls, which do most of the work.
If you want to know what functions were called from the glibc library, then use the command
ltrace
followed by the command ls testdir/
:
ltrace ls testdir/
If
ltrace
not installed, then install:
yum install ltrace
There will be a lot of information on the screen, but don't worry - we'll cover that later. Here are some of the important library functions from the output
ltrace
:
opendir("testdir/") = { 3 }
readdir({ 3 }) = { 101879119, "." }
readdir({ 3 }) = { 134, ".." }
readdir({ 3 }) = { 101879120, "file1" }
strlen("file1") = 5
memcpy(0x1665be0, "file1\0", 6) = 0x1665be0
readdir({ 3 }) = { 101879122, "file2" }
strlen("file2") = 5
memcpy(0x166dcb0, "file2\0", 6) = 0x166dcb0
readdir({ 3 }) = nil
closedir({ 3 })
By examining this output, you can probably understand what's going on. The named directory is
testdir
opened using a library function opendir
, followed by calls to functions readdir
that read the contents of the directory. Finally, a function is called closedir
that closes the previously opened directory. For now, ignore other functions such as strlen
and memcpy
.
As you can see, you can easily see the library functions being called, but in this article we will focus on the system calls that are called by the system library functions.
To view system calls use
strace
with the command ls testdir
as shown below. And again you get a bunch of incoherent information:
[root@sandbox tmp]# strace ls testdir/
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
brk(NULL) = 0x1f12000
<<< truncated strace output >>>
write(1, "file1 file2\n", 13file1 file2
) = 13
close(1) = 0
munmap(0x7fd002c8d000, 4096) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
[root@sandbox tmp]#
As a result of execution,
strace
you will receive a list of system calls executed during the execution of the command ls
. All system calls can be divided into the following categories:
- Process management
- File management
- Directory and file system management
- Other
There is a convenient way to analyze the received information - write the output to a file using the option
-o
.
[root@sandbox tmp]# strace -o trace.log ls testdir/
file1 file2
[root@sandbox tmp]#
This time, there will be no data on the screen - the command will
ls
work as expected, displaying a list of files and writing all output strace
to a file trace.log
. For a simple command, the ls
file contains almost 100 lines:
[root@sandbox tmp]# ls -l trace.log
-rw-r--r--. 1 root root 7809 Oct 12 13:52 trace.log
[root@sandbox tmp]#
[root@sandbox tmp]# wc -l trace.log
114 trace.log
[root@sandbox tmp]#
Take a look at the first line in the file
trace.log
:
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
- At the beginning of the line is the name of the system call being executed - execve.
- The text in parentheses is the arguments passed to the system call.
- The number after the = sign (in this case, 0) is the value returned by the system call.
Now the result doesn't seem too scary, does it? And you can apply the same logic for other lines as well.
Pay attention to the only command that you called -
ls testdir
. You know the name of the directory used by the command ls
, so why not use grep
for testdir
in the file trace.log
and see what it finds? Look closely at the result:
[root@sandbox tmp]# grep testdir trace.log
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0
openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[root@sandbox tmp]#
Going back to the above analysis
execve
, can you tell what the next system call is doing?
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
You don't have to remember all the system calls and what they do: everything is in the documentation. Man pages are in a hurry to help! Make sure the package is installed before running man command
man-pages
:
[root@sandbox tmp]# rpm -qa | grep -i man-pages
man-pages-3.53-5.el7.noarch
[root@sandbox tmp]#
Remember that you need to add "2" between the command
man
and the syscall name. If you read man
about man
( man man
), you will see that section 2 is reserved for system calls. Likewise, if you need information about library functions, you need to add 3 between man
and the name of the library function.
Below are the section numbers
man
:
1. .
2. (, ).
3. ( ).
4. ( /dev).
To view the documentation for a system call, run man with the name of that system call.
man 2 execve
According to the documentation, a system call
execve
executes a program that is passed to it in parameters (in this case it is ls
). Additional parameters for ls are also passed to it. In this example, it is testdir
. Hence, this system call simply runs ls
with testdir
as a parameter:
'execve - execute program'
'DESCRIPTION
execve() executes the program pointed to by filename'
The next system call
stat
is passed a parameter testdir
:
stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0
To view the documentation use
man 2 stat
. The stat system call returns information about the specified file. Remember that everything in Linux is a file, including directories.
Next, the system call
openat
opens testdir
. Note that the return value is 3. This is the file descriptor that will be used in subsequent system calls:
openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
Now open the file
trace.log
and notice the line following the system call openat
. You will see a system call getdents
that does most of the work required to execute the command ls testdir
. Now let's execute grep getdents
for the file trace.log
:
[root@sandbox tmp]# grep getdents trace.log
getdents(3, /* 4 entries */, 32768) = 112
getdents(3, /* 0 entries */, 32768) = 0
[root@sandbox tmp]#
The documentation (
man getdents
) says that it getdents
reads directory entries, which is what we actually need. Note that the argument for getdent
3 is the file descriptor obtained earlier from the system call openat
.
Now that the contents of the directory have been received, we need a way to display the information in the terminal. So, we do
grep
for another system call write
, which is used to output to the terminal:
[root@sandbox tmp]# grep write trace.log
write(1, "file1 file2\n", 13) = 13
[root@sandbox tmp]#
In the arguments, you can see the names of the files to be output:
file1
and file2
. For the first argument (1), remember that on Linux, three file descriptors are opened by default for any process:
- 0 - standard input stream
- 1 - standard output stream
- 2 - standard error stream
Thus, the system call
write
takes file1
and file2
the standard output, which is a terminal, denotes the number 1.
Now that you know what the system calls done most of the work for the team
ls testdir/
. But what about the other 100+ system calls in the file trace.log
?
The operating system does a lot of supporting things to start the process, so a lot of what you see in the file
trace.log
is initializing and cleaning up the process. Take a full look at the trace.log file and try to understand what happens when the command is run ls
.
Now you can analyze system calls for any program. The strace utility also provides many useful command line options, some of which are described below.
By default
strace
, it does not display all information about system calls. However, it has an option -v verbose
that will show additional information about each system call:
strace -v ls testdir
It is good practice to use a parameter
-f
to keep track of child processes created by a running process:
strace -f ls testdir
But what if you only want the names of the system calls, the number of times they were run, and the percentage of time spent executing? You can use the option
-c
to get these statistics:
strace -c ls testdir/
If you want to trace a certain system call, for example,
open
and ignore others, then you can use the option -e
with the name of the system call:
[root@sandbox tmp]# strace -e open ls testdir
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
file1 file2
+++ exited with 0 +++
[root@sandbox tmp]#
What if you need to filter by multiple system calls? Don't worry, you can use the same option
-e
and separate the required system calls with a comma. For example, for write
and getdent
:
[root@sandbox tmp]# strace -e write,getdents ls testdir
getdents(3, /* 4 entries */, 32768) = 112
getdents(3, /* 0 entries */, 32768) = 0
write(1, "file1 file2\n", 13file1 file2
) = 13
+++ exited with 0 +++
[root@sandbox tmp]#
So far, we've only tracked explicit command runs. But what about the commands that were run earlier? What if you want to track demons? For this, you
strace
have a special option -p
to which you can pass the process ID.
We will not start the daemon, but use a command
cat
that displays the contents of the file passed to it as an argument. But if you do not specify an argument, then the command cat
will simply wait for input from the user. After entering text, it will display the entered text on the screen. And so on until the user clicks Ctrl+C
to exit.
Run the command
cat
on one terminal.
[root@sandbox tmp]# cat
On another terminal, find the process ID (PID) with the command
ps
:
[root@sandbox ~]# ps -ef | grep cat
root 22443 20164 0 14:19 pts/0 00:00:00 cat
root 22482 20300 0 14:20 pts/1 00:00:00 grep --color=auto cat
[root@sandbox ~]#
Now start
strace
with the option -p
and PID you found with ps
. After starting, it strace
will display information about the process to which it connected, as well as its PID. Now strace
monitors the system calls made by the command cat
. The first system call you'll see is read, waiting for input from thread 0, that is, from standard input, which is now the terminal the command is running on cat
:
[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0,
Now go back to the terminal where you left the command running
cat
and enter some text. For demonstration, I entered x0x0
. Please note that I cat
simply repeated what I entered and x0x0
the screen will appear twice.
[root@sandbox tmp]# cat
x0x0
x0x0
Go back to the terminal where you
strace
connected to the process cat
. Now you see two new system calls: the previous one read
, which has now read x0x0
, and one more for writing write
, which writes x0x0
back to the terminal, and again a new one read
, which is waiting for a read from the terminal. Note that standard input (0) and standard output (1) are on the same terminal:
[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0, "x0x0\n", 65536) = 5
write(1, "x0x0\n", 5) = 5
read(0,
Imagine the benefits of launching
strace
for daemons: you can see everything that is happening in the background. Complete the commandcat
by clicking Ctrl+C
... This will also terminate the sessionstrace
since the monitored process has been terminated.
To view the timestamps of system calls use the option
-t
:
[root@sandbox ~]#strace -t ls testdir/
14:24:47 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
14:24:47 brk(NULL) = 0x1f07000
14:24:47 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2530bc8000
14:24:47 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
14:24:47 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
What if you want to know the time spent between system calls? There is a handy option
-r
that shows the time taken to execute each system call. Pretty helpful, isn't it?
[root@sandbox ~]#strace -r ls testdir/
0.000000 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
0.000368 brk(NULL) = 0x1966000
0.000073 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb6b1155000
0.000047 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
0.000119 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
Conclusion
The utility is
strace
very handy for learning system calls in Linux. For other command line options, see man and online documentation.