Debugging the kernel from the command line with bpftrace

This is another article in the series "BPF for the little ones" ( 0 , 1 , 2 ) and the first in a series of practical articles on Linux tracing with modern tools.







From it you will learn about the program and the language bpftrace



- the easiest way to dive into the world of BPF from a practical point of view, even if you know absolutely nothing about BPF. The utility bpftrace



allows you to create handler programs directly from the command line using a simple language and connect them to a huge number of kernel and user space events. Take a look at the KPDV below ... congratulations, you already know how to trace system calls with bpftrace



!







Unlike the previous articles in the series, this one is relatively short and its main part is written in the format of a tutorial, so after five minutes of reading you can create handlers and collect statistics about any event in the Linux kernel right from the command line. The end of the article discusses alternatives - ply



and BCC. Minutes in the phrase "after five minutes of reading" are meant to be Mercurial. Unique skills are not guaranteed to emerge after five minutes of reading.









Content





Installing bpftrace



. sudo apt install bpftrace



( ). bpftrace



, .







. bpftrace



, , , , BTF. , bpftrace



.







master- bpftrace docker image bpftrace



. , :







$ docker pull quay.io/iovisor/bpftrace:latest
$ cd /tmp
$ docker run -v $(pwd):/o quay.io/iovisor/bpftrace:latest /bin/bash -c "cp /usr/bin/bpftrace /o"
$ sudo ./bpftrace -V
bpftrace v0.11.4
      
      





bpftrace glibc



, docker image glibc



.







, . , execve(2)



, :







$ sudo ./bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s -> %s\n", comm, str(uptr(args->filename))); }'
Attaching 1 probe...
bash -> /bin/echo
bash -> /usr/bin/ls
gnome-shell -> /bin/sh
sh -> /home/aspsk/bin/geeqie
sh -> /usr/local/sbin/geeqie
...
      
      





, , , :







$ sudo mv /tmp/bpftrace /usr/local/bin
      
      





, , , BTF, bpftrace



, docker . bpftrace



. .







. bpftrace



/ BTF, kernel headers. , bpftrace



.







BTF, , BTF kernel headers, , , . BTF , .







?



. , , , . , . : « ?!» , ,







$ sudo apt install bpftrace
The following NEW packages will be installed:
  bpftrace
Preparing to unpack .../bpftrace_0.11.0-1_amd64.deb ...
Unpacking bpftrace (0.11.0-1) ...
Setting up bpftrace (0.11.0-1) ...
Processing triggers for man-db (2.9.3-2) ...
$
      
      





, bpftrace



. ? , bpftrace



. - :









, , - , .







Bpftrace: hello world



bpftrace



awk , BEGIN



END



, bpftrace



, . :







# bpftrace -e 'BEGIN { printf("Hello world!\n"); }'
Attaching 1 probe...
Hello world!
^C
      
      





"Hello world!"



. , Ctrl-C



, bpftrace



— . bpftrace



exit



. , END



:







# bpftrace -e '
 BEGIN { printf("Hello world!\n"); exit(); }
 END { printf("So long\n"); }
 '
Attaching 2 probes...
Hello world!
So long
      
      





Kprobes —



— , , , , kprobes (Kernel Probe — ) , , . , , kprobes



, , , , Linux BPF.







, schedule



:







$ sudo bpftrace -e '
    k:schedule { @[comm] = count(); }
    i:s:2 { exit();}
    END { print(@, 10); clear(@); }
'
Attaching 3 probes...
@[Timer]: 147
@[kworker/u65:0]: 147
@[kworker/7:1]: 153
@[kworker/13:1]: 158
@[IPC I/O Child]: 170
@[IPC I/O Parent]: 172
@[kworker/12:1]: 185
@[Web Content]: 229
@[Xorg]: 269
@[SCTP timer]: 1566
      
      





@



.







-? :







$ sudo bpftrace -l 'k:*' | wc -l
61106
      
      





. , , kprobe, , , kprobes.







kretprobes



kprobe



kretprobe



. kprobe



, kretporobe



. retval



.







, vfs_write



( ):







$ sudo bpftrace -e 'kr:vfs_write { @ = hist(retval); } i:s:2 { exit(); }'
Attaching 2 probes...

@:
[1]                  606 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)             1223 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)               0 |                                                    |
[32, 64)              25 |@                                                   |
      
      





uprobes uretprobes



, ( ), . kprobes



. bpftrace



uprobes



uretprobes



— .







, ( UID ):







$ sudo bpftrace -e 'uretprobe:/bin/bash:readline { printf("readline: [%d]: \"%s\"\n", uid, str(uptr(retval))); }'
Attaching 1 probe...
readline: [1000]: "echo "hello habr""
readline: [0]: "echo "hello from root""
^C
      
      





, 2



CONFIG_DEBUG_INFO_BTF=y



, kprobes, , bpf trampolines. BTF , . , / .







Tracepoints —



- , , , . tracepoints . :







$ sudo bpftrace -l 't:*'
      
      





, kprobes



:







$ sudo bpftrace -l 't:*' | wc -l
1782
      
      





tracepoints , API: , tracepoint, , . , bpftrace



, tracepoint, :







$ sudo bpftrace -lv tracepoint:thermal:thermal_temperature
tracepoint:thermal:thermal_temperature
    __data_loc char[] thermal_zone;
    int id;
    int temp_prev;
    int temp;
      
      





kprobe



, BTF, , , . BTF kprobes



kfuncs



.







usdt —



. ( ):







#include <sys/sdt.h>
#include <sys/time.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    struct timeval tv;

    for (;;) {
        gettimeofday(&tv, NULL);
        DTRACE_PROBE1(test, probe, tv.tv_sec);
        sleep(1);
    }
}
      
      





test:probe



, tv.tv_sec



— . , systemtap-sdt-dev



( ). , :







$ cc /tmp/test.c -o /tmp/test
$ sudo bpftrace -l 'usdt:/tmp/test'
usdt:/tmp/test:test:probe
      
      





/tmp/test



,







$ sudo bpftrace -e 'usdt:/tmp/test:test:probe { printf("  %u\n", arg0); }'
Attaching 1 probe...
  1612903991
  1612903992
  1612903993
  1612903994
  1612903995
...
      
      





arg0



tv.tv_sec



, breakpoint.







Perf



bpftrace



, Perf. , bpftrace



:







  • software



    : -
  • hardware



    : PMCs
  • interval



    :
  • profile



    :


`software`



perf, :







# bpftrace -l 's:*'
software:alignment-faults:
software:bpf-output:
software:context-switches:
software:cpu-clock:
software:cpu-migrations:
software:dummy:
software:emulation-faults:
software:major-faults:
software:minor-faults:
software:page-faults:
software:task-clock:
      
      





, , ,







# bpftrace -e 'software:cpu-migrations:10 { @[comm] = count(); }'
Attaching 2 probes...
^C@[kworker/u65:1]: 1
@[bpftrace]: 1
@[SCTP timer]: 2
@[Xorg]: 2
      
      





CPU . perf_event_open(2)



, , cpu-migrations



, man- PERF_COUNT_SW_CPU_MIGRATIONS



.







`hardware`



hardware counters, bpftrace



BPF. , :







bpftrace -l 'h*'
hardware:backend-stalls:
hardware:branch-instructions:
hardware:branch-misses:
hardware:bus-cycles:
hardware:cache-misses:
hardware:cache-references:
hardware:cpu-cycles:
hardware:frontend-stalls:
hardware:instructions:
hardware:ref-cycles:
      
      





( , . PERF_COUNT_HW_BRANCH_MISSES



):







bpftrace -e 'hardware:branch-misses:100000 { @[tid] = count(); }'
Attaching 3 probes...
@[1055]: 4
@[3078]: 4
@[1947]: 5
@[1066]: 6
@[2551]: 6
@[0]: 29
      
      





`interval` `profile`



interval



profile



. CPU, — CPU.







, . , :







$ sudo bpftrace -e '
    kr:vfs_write { @ = hist(retval); }
    interval:s:2 { print(@); clear(@); }
'
      
      





vfs_write



, interval



, , @



.







profile



:







# bpftrace -e '
  profile:hz:99 { @[kstack] = count(); }
  i:s:10 { exit(); }
  END { print(@,1); clear(@); }
'
Attaching 3 probes...
@[
    cpuidle_enter_state+202
    cpuidle_enter+46
    call_cpuidle+35
    do_idle+487
    cpu_startup_entry+32
    start_secondary+345
    secondary_startup_64+182
]: 14455
      
      





profile



99 , exit()



, END



@



— ( , , , ).







Bpftrace: tutorial





, bpftrace



:







# bpftrace
      
      





, . , bpftrace



. , bpftrace



. , — .







-l



. . ( - , , .) :







# bpftrace -l '*kill_all*'
kprobe:rfkill_alloc
kprobe:kill_all
kprobe:btrfs_kill_all_delayed_nodes
      
      





tracepoints



:







# bpftrace -l 't:*kill*'
tracepoint:cfg80211:rdev_rfkill_poll
tracepoint:syscalls:sys_enter_kill
tracepoint:syscalls:sys_exit_kill
tracepoint:syscalls:sys_enter_tgkill
tracepoint:syscalls:sys_exit_tgkill
tracepoint:syscalls:sys_enter_tkill
tracepoint:syscalls:sys_exit_tkill
      
      





tracepoint:syscalls



, , bpftrace



. X



:







tracepoint:syscalls:sys_enter_X
tracepoint:syscalls:sys_exit_X
      
      





- , , execve(2)



. , , - tracepoint



-v



, :







# bpftrace -lv 't:s*:sys_*_execve'
tracepoint:syscalls:sys_enter_execve
    int __syscall_nr;
    const char * filename;
    const char *const * argv;
    const char *const * envp;
tracepoint:syscalls:sys_exit_execve
    int __syscall_nr;
    long ret;
      
      





( *



, syscalls



sys_enter_execve



sys_exit_execve



). , -lv



args



:







# bpftrace -e '
    t:s*:sys_enter_execve { printf("ENTER: %s\n", str(uptr(args->filename))); }
    t:s*:sys_exit_execve { printf("EXIT: %s: %d\n", comm, args->ret); }
'
Attaching 2 probes...
ENTER: /bin/ls
EXIT: ls: 0
ENTER: /bin/lss
EXIT: bash: -2
      
      





.







args->filename



. , str



, : , uptr



. bpftrace



, . , , bpftrace -lv



, , , execve ( __user



).







comm



, . args->ret



. , " " , . , - , : /bin/ls



exec



0 ( ls



/bin/ls



), /bin/lss



exec



-2



bash



.







. , . , . uptr



, .







`bpftrace`



bpftrace



awk (. 6 The AWK Programming Language



) .







<probe> <filter> { <action> }
      
      





,







# bpftrace -e 'BEGIN { printf("Hello\n") } END { printf("World\n") }'
      
      





<probe>



BEGIN



END



, <action>



printf



.

<filter>



,

,







# bpftrace -e 'p:s:1 /cpu == 0/ { printf("  CPU%d\n", cpu); }'
Attaching 1 probe...
  CPU0
  CPU0
^C
      
      





, CPU 0.







. , /cpu == 0/



?







<filter>



. , write



. — sys_enter_write



sys_exit_write



:







# cat /tmp/write-times.bt
t:syscalls:sys_enter_write {
  @[tid] = nsecs
}

t:syscalls:sys_exit_write /@[tid]/ {
  @write[comm] = sum((nsecs - @[tid]) / 1000);
  delete(@[tid]);
}

END {
  print(@write, 10);
  clear(@write);
  clear(@);
}
      
      





, . :







# bpftrace /tmp/write-times.bt
      
      





, sys_enter_write



, write



@



, tid



.







, sys_exit_write



, /@[tid]/



, . , , - write



. ( ) @write



@[tid]



.







, ^C



, END



, @write



@



, bpftrace



.







. , /@[tid]/



?







:



bpftrace



C , , :?



, ++



, --



. if {} else {}



. while



unroll



( , BPF ). , .







: . $



, :







# bpftrace -e 'BEGIN { $x = 1; printf("%d\n", ++$x); exit(); }'
# bpftrace -e 'BEGIN { if (1) { $x = 1; } printf("%d\n", ++$x); exit(); }'
      
      





— :







# bpftrace -e 'BEGIN { $x = 1; exit(); } END { printf("%d\n", $x); }'
      
      





, , @



. "" @



, write



(@[tid]



). ( bpftrace



— . , , : . , , .)







, : ? bpftrace



. kprobes



, tracepoints



— , tracepoint, Perf, , — . tracepoints



, kprobes



, Perf.







kprobes



arg0



, arg1



, .. , . :







#include <linux/skbuff.h>
#include <linux/ip.h>

k:icmp_echo {
    $skb = (struct sk_buff *) arg0;
    $iphdr = (struct iphdr *) ($skb->head + $skb->network_header);
    @pingstats[ntop($iphdr->saddr), ntop($iphdr->daddr)]++;
}
      
      





, . kprobe



icmp_echo



, ICMPv4 echo request. , arg0



, — sk_buff



, . IP @pingstats



. , , IP ! , kprobe, user space, .







. , kernel headers. sk_buff



iphdr



, . ( BTF, — , .) icmp_echo



sk_buff



$skb



. $skb



, , , , iphdr



. ntop



bpftrace



, IP .







. . uptr



kptr



. : vfs_write



, — struct file



, <linux/fs.h>



. vfs_write



. (Hint: struct file



kretprobe



?)







. bpftrace



, , , sudo



.









ping



— , CPU. count



. :







# bpftrace -e 'p:hz:5000 { @x++; @y = count(); } i:s:10 { exit(); }'
Attaching 2 probes...
@x: 760528
@y: 799002
      
      





10 5000 16 @x



@y



. ++



. count()



, CPU- : CPU , .







bpftrace



, - . — . .







# bpftrace -e 'kr:vfs_write { @ = hist(retval); } i:s:10 { exit() }'
Attaching 2 probes...

@:
[1]             14966407 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)             1670 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)             123 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)        7531982 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
      
      





vfs_write



, , - ! ( 20 ):







# bpftrace -e '
    kr:vfs_write /retval == 1/ { @[pid, comm] = hist(retval); }
    i:s:10 { exit() }
    END { print(@, 1); clear(@); }'
Attaching 3 probes...
@[133835, dd]:
[1]             14254511 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
      
      





, , retval



. , — . , END



, . ? :







# tr '\000' ' ' < /proc/133835/cmdline
dd if=/dev/zero of=/dev/null bs=1
      
      





. bpftrace



, .







: flame graphs



profile



. . , profile



:







profile:hz:rate
profile:s:rate
profile:ms:rate
profile:us:rate
      
      





HZ, — , , .







, , , :







bpftrace -e 'profile:hz:99 { print(kstack); exit() }'
Attaching 1 probe...

        cpuidle_enter_state+202
        cpuidle_enter+46
        call_cpuidle+35
        do_idle+487
        cpu_startup_entry+32
        rest_init+174
        arch_call_rest_init+14
        start_kernel+1724
        x86_64_start_reservations+36
        x86_64_start_kernel+139
        secondary_startup_64+182
      
      





-.







, , . flame graphs, .







, bpftrace



:







# bpftrace -e 'profile:hz:333 { @[kstack] = count(); } i:s:10 { exit() }' > /tmp/raw
# wc -l /tmp/raw
3374 /tmp/raw
      
      





333 ( kstack



@



, kstack



— ).







FlameGraph :







$ git clone https://github.com/brendangregg/FlameGraph.git
$ cd FlameGraph
$ ./stackcollapse-bpftrace.pl /tmp/raw > /tmp/ready
$ ./flamegraph.pl /tmp/ready > /tmp/kstack.svg
      
      





bpftrace



, ( , gist SVG):













, CPU dd



, /dev/zero



/dev/null



. , .







. bpftrace



. FlameGraph



.









, bpftrace



. , .







BCC:



BCC , BPF — 100 . — , , Netflix, Facebook . . , .







libbcc



libbpf



, -C , . , . .







BCC - . , bpfcc-tools



. . ,







# funccount-bpfcc 'vfs_*' -d 5
Tracing 67 functions for "b'vfs_*'"... Hit Ctrl-C to end.

FUNC                                    COUNT
b'vfs_statfs'                               1
b'vfs_unlink'                               1
b'vfs_lock_file'                            2
b'vfs_fallocate'                           31
b'vfs_statx_fd'                            32
b'vfs_getattr'                             80
b'vfs_getattr_nosec'                       88
b'vfs_open'                               108
b'vfs_statx'                              174
b'vfs_writev'                            2789
b'vfs_write'                             6554
b'vfs_read'                              7363
Detaching...
      
      





vfs_



. -p



, PID . , mplayer



, :







# funccount-bpfcc 'vfs_*' -d 5 -p 29380
Tracing 67 functions for "b'vfs_*'"... Hit Ctrl-C to end.

FUNC                                    COUNT
b'vfs_write'                              208
b'vfs_read'                               629
Detaching...
      
      





BCC



BCC. mutex_lock



mutex_unlock



. , .







#! /usr/bin/python3

from bcc import BPF
from ctypes import c_int
from time import sleep, strftime
from sys import argv

b = BPF(text="""
BPF_PERCPU_ARRAY(mutex_stats, u64, 2);

static inline void inc(int key)
{
    u64 *x = mutex_stats.lookup(&key);
    if (x)
        *x += 1;
}

void do_lock(struct pt_regs *ctx) { inc(0); }
void do_unlock(struct pt_regs *ctx) { inc(1); }
""")

b.attach_kprobe(event="mutex_lock", fn_name="do_lock")
b.attach_kprobe(event="mutex_unlock", fn_name="do_unlock")

print("%-8s%10s%10s" % ("TIME", "LOCKED", "UNLOCKED"))

while 2 * 2 == 4:

    try:
        sleep(1)
    except KeyboardInterrupt:
        exit()

    print("%-8s%10d%10d" % (
          strftime("%H:%M:%S"),
          b["mutex_stats"].sum(0).value,
          b["mutex_stats"].sum(1).value))

    b["mutex_stats"].clear()
      
      





. , — BPF



:







from bcc import BPF
      
      





BPF, . BPF



-C.







BPF_PERCPU_ARRAY(mutex_stats, u64, 2);

static inline void inc(int key)
{
    u64 *x = mutex_stats.lookup(&key);
    if (x)
        *x += 1;
}

void do_lock(struct pt_regs *ctx)   { inc(0); }
void do_unlock(struct pt_regs *ctx) { inc(1); }
      
      





C, , BPF



C.

, mutex_stats



u64



, . , PERCPU , , CPU . inc



, mutex_stats



. .

, do_lock



do_unlock



.







BPF



libllvm



, , . kprobes



:







b.attach_kprobe(event="mutex_lock", fn_name="do_lock")
b.attach_kprobe(event="mutex_unlock", fn_name="do_unlock")
      
      





:







print("%-8s%10s%10s" % ("TIME", "LOCKED", "UNLOCKED"))

while 2 * 2 == 4:

    try:
        sleep(1)
    except KeyboardInterrupt:
        exit()

    print("%-8s%10d%10d" % (
          strftime("%H:%M:%S"),
          b["mutex_stats"].sum(0).value,
          b["mutex_stats"].sum(1).value))

    b["mutex_stats"].clear()
      
      





mutex_stats



. sum



mutex_stats



, CPU:







sum(index) {
    result = 0
      CPU {
        result += cpu->mutex_stats[index]
    }
    return result
}
      
      





. :







$ sudo ./bcc-tool-example
TIME        LOCKED  UNLOCKED
18:06:33     11382     12993
18:06:34     11200     12783
18:06:35     18597     22553
18:06:36     20776     25516
18:06:37     59453     68201
18:06:38     49282     58064
18:06:39     25541     27428
18:06:40     22094     25280
18:06:41      5539      7250
18:06:42      5662      7351
18:06:43      5205      6905
18:06:44      6733      8438
      
      





- 18:06:35 youtube Firefox, youtube 18:06:40 . , , youtube .







, C, libbpf



CO-RE. libbpf



, libllvm



, , . , BCC libbpf



+ CO-RE BCC, . libbpf-tools. BCC to libbpf conversion guide ( ).







Ply: bpftrace



ply



, Tobias Waldekranz 2015 , bpftrace



. awk- BPF, ,







ply 'tracepoint:tcp/tcp_receive_reset {
    printf("saddr:%v port:%v->%v\n", data->saddr, data->sport, data->dport);
}'
      
      





ply



: libc



(). , BPF . , , ply



ply script language -> BPF



.







, ply



, , — ply



, , . ply



, , , , C libbpf



— , ., , Building BPF applications with libbpf-bootstrap.









:









bpftrace



, BCC Linux:







  • The bpftrace One-Liner Tutorial. bpftrace



    , . , , .







  • bpftrace Reference Guide. , bpftrace



    , . , bpftrace



    .







  • BCC Tutorial. bpftrace



    ( libbpf



    ), BCC, BCC Reference Guide, , .







  • Brendan Gregg, «BPF Performance Tools». BPF Linux — BCC. BPF .







  • Brendan Gregg, «Systems Performance: Enterprise and the Cloud, 2nd Edition (2020)». «Systems Performance». : BPF, Solaris, . «BPF Performance Tools» «?», «?», ( BPF ).










All Articles