Understanding system calls in Linux with strace

The translation of the article has been prepared especially for students of basic and advanced courses Administrator Linux.






The syscall is the mechanism by which user programs interact with the Linux kernel, and strace is a powerful tool for keeping track of them. To better understand how the operating system works, it is helpful to understand how they work.



The operating system can be divided into two modes of operation:



  • Kernel mode is the privileged mode used by the operating system kernel.
  • User mode is the mode in which most user applications run.




Users typically use command line utilities and a graphical interface (GUI) for their daily work. At the same time, system calls invisibly work in the background, referring to the kernel to do the work.



System calls are very similar to function calls in the sense that arguments are passed in and they return values. The only difference is that system calls work at the kernel level, but functions do not. Switching from user mode to kernel mode is done using a special interrupt mechanism .



Most of these details are hidden from the user in the system libraries (glibc on Linux systems). System calls are universal in nature, but despite this, the mechanics of their execution are largely hardware-dependent.



This article explores several practical examples of parsing system calls using strace. The examples use Red Hat Enterprise Linux, but all commands should work on other Linux distributions:



[root@sandbox ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)
[root@sandbox ~]#
[root@sandbox ~]# uname -r
3.10.0-1062.el7.x86_64
[root@sandbox ~]#




First, make sure you have the necessary tools installed on your system. You stracecan check if it is installed using the command below. To view the version, stracerun it with the -V parameter:



[root@sandbox ~]# rpm -qa | grep -i strace
strace-4.12-9.el7.x86_64
[root@sandbox ~]#
[root@sandbox ~]# strace -V
strace -- version 4.12
[root@sandbox ~]#




If stracenot installed, install by running:



yum install strace




For example, create a test directory in /tmpand two files using the command touch:



[root@sandbox ~]# cd /tmp/
[root@sandbox tmp]#
[root@sandbox tmp]# mkdir testdir
[root@sandbox tmp]#
[root@sandbox tmp]# touch testdir/file1
[root@sandbox tmp]# touch testdir/file2
[root@sandbox tmp]#




(I /tmponly use a directory because everyone has access to it, but you can use any other directory .)



Use the command to lscheck that testdirfiles have been created in the directory :



[root@sandbox tmp]# ls testdir/
file1  file2
[root@sandbox tmp]#




You probably use the command lsevery day without realizing that system calls are running under the hood. This is where abstraction comes into play. This is how this command works:



   ->    (glibc) ->  




The command lscalls functions from the Linux system libraries (glibc). These libraries, in turn, call system calls, which do most of the work.



If you want to know what functions were called from the glibc library, then use the command ltracefollowed by the command ls testdir/:



ltrace ls testdir/




If ltracenot installed, then install:



yum install ltrace




There will be a lot of information on the screen, but don't worry - we'll cover that later. Here are some of the important library functions from the output ltrace:



opendir("testdir/")                                  = { 3 }
readdir({ 3 })                                       = { 101879119, "." }
readdir({ 3 })                                       = { 134, ".." }
readdir({ 3 })                                       = { 101879120, "file1" }
strlen("file1")                                      = 5
memcpy(0x1665be0, "file1\0", 6)                      = 0x1665be0
readdir({ 3 })                                       = { 101879122, "file2" }
strlen("file2")                                      = 5
memcpy(0x166dcb0, "file2\0", 6)                      = 0x166dcb0
readdir({ 3 })                                       = nil
closedir({ 3 })    




By examining this output, you can probably understand what's going on. The named directory is testdiropened using a library function opendir, followed by calls to functions readdirthat read the contents of the directory. Finally, a function is called closedirthat closes the previously opened directory. For now, ignore other functions such as strlenand memcpy.



As you can see, you can easily see the library functions being called, but in this article we will focus on the system calls that are called by the system library functions.



To view system calls use stracewith the command ls testdiras shown below. And again you get a bunch of incoherent information:



[root@sandbox tmp]# strace ls testdir/
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
brk(NULL)                               = 0x1f12000
<<< truncated strace output >>>
write(1, "file1  file2\n", 13file1  file2
)          = 13
close(1)                                = 0
munmap(0x7fd002c8d000, 4096)            = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
[root@sandbox tmp]#


As a result of execution, straceyou will receive a list of system calls executed during the execution of the command ls. All system calls can be divided into the following categories:



  • Process management
  • File management
  • Directory and file system management
  • Other




There is a convenient way to analyze the received information - write the output to a file using the option -o.



[root@sandbox tmp]# strace -o trace.log ls testdir/
file1  file2
[root@sandbox tmp]#




This time, there will be no data on the screen - the command will lswork as expected, displaying a list of files and writing all output straceto a file trace.log. For a simple command, the lsfile contains almost 100 lines:



[root@sandbox tmp]# ls -l trace.log
-rw-r--r--. 1 root root 7809 Oct 12 13:52 trace.log
[root@sandbox tmp]#
[root@sandbox tmp]# wc -l trace.log
114 trace.log
[root@sandbox tmp]#




Take a look at the first line in the file trace.log:



execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0




  • At the beginning of the line is the name of the system call being executed - execve.
  • The text in parentheses is the arguments passed to the system call.
  • The number after the = sign (in this case, 0) is the value returned by the system call.




Now the result doesn't seem too scary, does it? And you can apply the same logic for other lines as well.



Pay attention to the only command that you called - ls testdir. You know the name of the directory used by the command ls, so why not use grepfor testdirin the file trace.logand see what it finds? Look closely at the result:



[root@sandbox tmp]# grep testdir trace.log
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0
openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[root@sandbox tmp]#




Going back to the above analysis execve, can you tell what the next system call is doing?



execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0




You don't have to remember all the system calls and what they do: everything is in the documentation. Man pages are in a hurry to help! Make sure the package is installed before running man command man-pages:



[root@sandbox tmp]# rpm -qa | grep -i man-pages
man-pages-3.53-5.el7.noarch
[root@sandbox tmp]#




Remember that you need to add "2" between the command manand the syscall name. If you read manabout man( man man), you will see that section 2 is reserved for system calls. Likewise, if you need information about library functions, you need to add 3 between manand the name of the library function.



Below are the section numbers man:



1.       .
2.   (,  ).
3.   (  ).
4.   (    /dev).




To view the documentation for a system call, run man with the name of that system call.



man 2 execve




According to the documentation, a system call execveexecutes a program that is passed to it in parameters (in this case it is ls). Additional parameters for ls are also passed to it. In this example, it is testdir. Hence, this system call simply runs lswith testdiras a parameter:



'execve - execute program'

'DESCRIPTION
       execve()  executes  the  program  pointed to by filename'




The next system call statis passed a parameter testdir:



stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0




To view the documentation use man 2 stat. The stat system call returns information about the specified file. Remember that everything in Linux is a file, including directories.



Next, the system call openatopens testdir. Note that the return value is 3. This is the file descriptor that will be used in subsequent system calls:



openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3




Now open the file
trace.log
and notice the line following the system call openat. You will see a system call getdentsthat does most of the work required to execute the command ls testdir. Now let's execute grep getdentsfor the file trace.log:



[root@sandbox tmp]# grep getdents trace.log
getdents(3, /* 4 entries */, 32768)     = 112
getdents(3, /* 0 entries */, 32768)     = 0
[root@sandbox tmp]#




The documentation ( man getdents) says that it getdentsreads directory entries, which is what we actually need. Note that the argument for getdent3 is the file descriptor obtained earlier from the system call openat.



Now that the contents of the directory have been received, we need a way to display the information in the terminal. So, we do grepfor another system call write, which is used to output to the terminal:



[root@sandbox tmp]# grep write trace.log
write(1, "file1  file2\n", 13)          = 13
[root@sandbox tmp]#




In the arguments, you can see the names of the files to be output: file1and file2. For the first argument (1), remember that on Linux, three file descriptors are opened by default for any process:



  • 0 - standard input stream
  • 1 - standard output stream
  • 2 - standard error stream




Thus, the system call writetakes file1and file2the standard output, which is a terminal, denotes the number 1.



Now that you know what the system calls done most of the work for the team ls testdir/. But what about the other 100+ system calls in the file trace.log?



The operating system does a lot of supporting things to start the process, so a lot of what you see in the file trace.logis initializing and cleaning up the process. Take a full look at the trace.log file and try to understand what happens when the command is run ls.



Now you can analyze system calls for any program. The strace utility also provides many useful command line options, some of which are described below.



By default strace, it does not display all information about system calls. However, it has an option -v verbosethat will show additional information about each system call:



strace -v ls testdir




It is good practice to use a parameter -fto keep track of child processes created by a running process:



strace -f ls testdir




But what if you only want the names of the system calls, the number of times they were run, and the percentage of time spent executing? You can use the option -cto get these statistics:



strace -c ls testdir/




If you want to trace a certain system call, for example, openand ignore others, then you can use the option -ewith the name of the system call:



[root@sandbox tmp]# strace -e open ls testdir
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
file1  file2
+++ exited with 0 +++
[root@sandbox tmp]#




What if you need to filter by multiple system calls? Don't worry, you can use the same option -eand separate the required system calls with a comma. For example, for writeand getdent:



[root@sandbox tmp]# strace -e write,getdents ls testdir
getdents(3, /* 4 entries */, 32768)     = 112
getdents(3, /* 0 entries */, 32768)     = 0
write(1, "file1  file2\n", 13file1  file2
)          = 13
+++ exited with 0 +++
[root@sandbox tmp]#




So far, we've only tracked explicit command runs. But what about the commands that were run earlier? What if you want to track demons? For this, you stracehave a special option -pto which you can pass the process ID.



We will not start the daemon, but use a command catthat displays the contents of the file passed to it as an argument. But if you do not specify an argument, then the command catwill simply wait for input from the user. After entering text, it will display the entered text on the screen. And so on until the user clicks Ctrl+Cto exit.



Run the command caton one terminal.



[root@sandbox tmp]# cat




On another terminal, find the process ID (PID) with the command ps:



[root@sandbox ~]# ps -ef | grep cat
root      22443  20164  0 14:19 pts/0    00:00:00 cat
root      22482  20300  0 14:20 pts/1    00:00:00 grep --color=auto cat
[root@sandbox ~]#




Now start stracewith the option -pand PID you found with ps. After starting, it stracewill display information about the process to which it connected, as well as its PID. Now stracemonitors the system calls made by the command cat. The first system call you'll see is read, waiting for input from thread 0, that is, from standard input, which is now the terminal the command is running on cat:



[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0,




Now go back to the terminal where you left the command running catand enter some text. For demonstration, I entered x0x0. Please note that I catsimply repeated what I entered and x0x0the screen will appear twice.



[root@sandbox tmp]# cat
x0x0
x0x0




Go back to the terminal where you straceconnected to the process cat. Now you see two new system calls: the previous one read, which has now read x0x0, and one more for writing write, which writes x0x0back to the terminal, and again a new one read, which is waiting for a read from the terminal. Note that standard input (0) and standard output (1) are on the same terminal:



[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0, "x0x0\n", 65536)                = 5
write(1, "x0x0\n", 5)                   = 5
read(0,




Imagine the benefits of launching stracefor daemons: you can see everything that is happening in the background. Complete the command
cat
by clicking
Ctrl+C
... This will also terminate the session
strace
since the monitored process has been terminated.



To view the timestamps of system calls use the option -t:



[root@sandbox ~]#strace -t ls testdir/

14:24:47 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
14:24:47 brk(NULL)                      = 0x1f07000
14:24:47 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2530bc8000
14:24:47 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
14:24:47 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3




What if you want to know the time spent between system calls? There is a handy option -rthat shows the time taken to execute each system call. Pretty helpful, isn't it?



[root@sandbox ~]#strace -r ls testdir/

0.000000 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
0.000368 brk(NULL)                 = 0x1966000
0.000073 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb6b1155000
0.000047 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
0.000119 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3




Conclusion



The utility is stracevery handy for learning system calls in Linux. For other command line options, see man and online documentation.






All Articles