How do profilers work in Ruby and Python?

The translation of the article was prepared in anticipation of the start of the advanced course "Python Developer" .



The original article can be read here .










Hello! As an aperitif to the Ruby profiler, I wanted to talk about how existing Ruby and Python profilers work. It will also help answer the question that many people ask me: "How to write a profiler?"



In this article, we will focus on processor profilers (and not, let's say, memory / heap profilers). I will cover some basic approaches to writing a profiler, provide code examples and show many examples of popular profilers in Ruby and Python, and also show you how they work under the hood.



Probably, there may be errors in the article (in preparation for writing it, I partially looked through the code of 14 libraries for profiling, and I was not familiar with many of them until now), so please let me know if you find them ...



2 types of profilers



There are two main types of processor profilers - sampling and tracing profilers.



Tracing profilers record every function call in your program, ultimately providing a report. Sampling profilers take a statistical approach, they write the stack every few milliseconds, generating a report based on this data.



The main reason to use a sampling profiler instead of a tracing profiler is because it is lightweight. You take 20 or 200 pictures per second - it doesn't take much time. These profilers will be very effective if you have a serious performance problem (80% of the time is spent calling one slow function), since 200 snapshots per second will be enough to identify the problem function!



Profilers



Next, I will give a general summary of the profilers discussed in this article ( from here ). I will explain the terms used in this article ( setitimer , rb_add_event_hook , ptrace ) a little later. Interestingly, all profilers are implemented using a small set of basic features.



Python profilers







"Gdb hacks" is not really a Python profiler - it links to a website that explains how to implement a hacker profiler as a shell script wrapper around gdb . This is specifically about Python, as newer versions of gbd will actually deploy the Python stack for you. Something like pyflame for the poor.



Ruby profilers







Almost all of these profilers live inside your process



Before we get into the details of these profilers, there is one very important thing - all of these profilers, except pyflame , run inside your Python / Ruby process. If you are inside a Python / Ruby program, you usually have easy access to the stack. For example, here's a simple Python program that prints the contents of the stack of each running thread:



import sys
import traceback

def bar():
    foo()

def foo():
    for _, frame in sys._current_frames().items():
        for line in traceback.extract_stack(frame):
            print line

bar()


Here is the console output. You can see that it has function names from the stack, line numbers, file names - whatever you might need if you are profiling.



('test2.py', 12, '<module>', 'bar()')
('test2.py', 5, 'bar', 'foo()')
('test2.py', 9, 'foo', 'for line in traceback.extract_stack(frame):')


It's even simpler in Ruby: you can use puts caller to get the stack.



Most of these profilers are performance extensions of C, so they differ slightly, but such extensions for Ruby / Python programs also have easy access to the call stack.



How tracing profilers work



I have listed all Ruby and Python tracing profiles in the tables above: rblineprof, ruby-prof, line_profiler and cProfile . They all work in a similar way. They record every function call and are C extensions to reduce overhead.



How do they work? In both Ruby and Python, you can specify a callback that is triggered when various interpreter events occur (for example, "a function call" or "line of code execution"). When the callback is called, it writes the stack for later analysis.



It's helpful to see exactly where these callbacks are in the code, so I'll link to the relevant lines of code on github.



In Python, you can customize the callback with PyEval_SetTraceor PyEval_SetProfile. This is described in the documentation sectionProfiling and Tracing in Python. It says, " PyEval_SetTracesimilar to PyEval_SetProfileexcept that the trace function receives line number events."



The code:



  • line_profilersets up its callback using PyEval_SetTrace: see line_profiler.pyxline 157
  • cProfilesets up its callback using PyEval_SetProfile: see _lsprof.c line 693 (cProfile is implemented using lsprof )


In Ruby, you can customize your callback with rb_add_event_hook. I couldn't find any documentation about it, but this is how it looks.



rb_add_event_hook(prof_event_hook,
      RUBY_EVENT_CALL | RUBY_EVENT_RETURN |
      RUBY_EVENT_C_CALL | RUBY_EVENT_C_RETURN |
      RUBY_EVENT_LINE, self);


Signature prof_event_hook:



static void
prof_event_hook(rb_event_flag_t event, VALUE data, VALUE self, ID mid, VALUE klass)




Something like PyEval_SetTracein Python, but in a more flexible form - you can choose which events you want to be notified about (for example, "only function calls").



The code:



  • ruby-prof rb_add_event_hook : ruby-prof.c 329
  • rblineprof rb_add_event_hook : rblineprof.c 649


tracing-



The main disadvantage of tracing profilers implemented in this way is that they add a fixed amount of code for each function / line call that is executed. It can make you make the wrong decisions! For example, if you have two implementations of something - one with a lot of function calls and one without, which usually take the same amount of time, then the first, with a lot of function calls, will appear slower when profiling.



To illustrate, I created a small file named test.pywith the following content and compared the execution time python -mcProfile test.pyand python test.py. python. test.pycompleted in about 0.6 s, and in python -mcProfile test.pyabout 1 s. So for this particular example I cProfileadded an extra ~ 60% overhead.

The documentation cProfilesays:

Python's interpreted nature adds so much run-time overhead that deterministic profiling tends to add a little processing overhead in normal applications.


This seems like a pretty reasonable statement - the previous example (which makes 3.5 million function calls and nothing else) is obviously not a regular Python program, and almost any other program will have less overhead.

I haven't checked the ruby-profRuby tracing profiler, but its README says the following:

Most programs will run about half as slow, while highly recursive programs (such as the Fibonacci series test) will run three times slower .


How sampling profilers generally work: setitimer



Time to talk about the second kind of profiler: sampling profilers!

Most sampling profilers in Ruby and Python are implemented using a system call setitimer. What it is?



Let's say you want to take a snapshot of the program stack 50 times per second. This can be done as follows:



  • Ask the Linux kernel to send you a signal every 20 milliseconds (using a system call setitimer);
  • Register a signal handler for a stack snapshot when a signal is received;
  • When profiling is complete, ask Linux to stop signaling you and provide the result!


If you want to see a practical use case setitimerfor implementing a sampling profiler, I think the stacksampler.pybest example is a useful, working profiler, and it is about 100 lines in Python. And this is very cool!



The reason it stacksampler.pyonly takes 100 lines in Python is because when you register a Python function as a signal handler, the function is passed onto the current stack of your program. Therefore, stacksampler.pyregistering a signal handler is very simple:



def _sample(self, signum, frame):
   stack = []
   while frame is not None:
       stack.append(self._format_frame(frame))
       frame = frame.f_back

   stack = ';'.join(reversed(stack))
   self._stack_counts[stack] += 1


It simply pops a stack from a frame and increments the number of times that a particular stack has been viewed. So simple! So cool!



Let's take a look at all of our other profilers that they use setitimerand find out where in the code they call setitimer:



  • stackprof (Ruby): stackprof.c 118
  • perftools.rb (Ruby): , , , , gem (?)
  • stacksampler (Python): stacksampler.py 51
  • statprof (Python): statprof.py 239
  • vmprof (Python): vmprof_unix.c 294


The important thing about setitimer- you need to decide how to count the time. Do you want 20ms real time? 20ms user cpu time? 20ms user + system cpu time? If you look closely at the links above, you will notice that these profilers actually use different things setitimer- sometimes the behavior is customizable and sometimes not. The man page is setitimershort and well worth reading for all the possible configurations. pointed out one interesting use-case



@mgedminon Twittersetitimer . This issue and this issue reveal more details.



One INTERESTING drawback of profilers based onsetitimer- what timers trigger signals! Signals sometimes interrupt system calls! System calls sometimes take a few milliseconds! If you take snapshots too often, you can make the program execute system calls indefinitely!



Sampling profilers that don't use setitimer



There are several sampling profilers that do not use setitimer:



  • pyinstrumentuses PyEval_SetProfile(so it's kind of a tracing profiler), but it doesn't always collect stack snapshots when the trace callback is called. Here is the code that picks the timing of the stack trace snapshot . Read more about this solution in this blog . (basically: setitimerallows you to profile only the main thread in Python)
  • pyflameprofiles Python code outside of a process using a system call ptrace. He uses a loop where he takes pictures, sleeps for a certain amount of time and does the same thing again. Here's a call to wait.
  • python-flamegraphtakes a similar approach where it starts a new thread in your Python process and gets stack traces, sleeps and retries. Here's a call to wait .


All 3 of these profilers take real-time snapshots.



Pyflame blog posts



I've spent almost all of my time in this post on profilers other than pyflame, but it actually interests me the most because it profiles your Python program from a separate process, which is why I want my Ruby profiler to work similarly.



Of course, everything is a little more complicated than I described. I won't go into details, but Evan Klitzke has written many good articles about this on his blog:





More information can be found at eklitzke.org . These are all very interesting things that I'm going to read about more closely - perhaps it ptracewill turn out to be better than process_vm_readvfor implementing a Ruby profiler! It has process_vm_readvless overhead as it does not stop the process, but it can also give you an incorrect snapshot as it does not stop the process :). In my experiments, getting conflicting images was not a big problem, but I think that here I will conduct a series of experiments.



That's all for today!



There are many important subtleties that I did not go into in this post - for example, I basically said that vmprofand stacksampler- are similar (they are not -vmprofsupports string profiling and profiling of Python functions written in C, which I believe makes the profiler more complex). But they have some of the same basic principles, and so I think today's review will be a good starting point.






TDD with and without pytest. Free webinar






Read more:






All Articles