Today we are sharing with you a translation of an article from the creator of FunctionTrace, a Python profiler with an intuitive graphical interface that can profile multiprocessor and multithreaded applications and uses an order of magnitude less resources than other Python profilers. It doesn't matter if you're just learning web development in Python or have been using it for a long time - it's always good to understand what your code is doing. About how this project appeared, about the details of its development - further under the cut.
Introduction
The Firefox Profiler was the cornerstone of Firefox during the Project Quantum era . When you open the example entry , you see a powerful web-based performance analysis interface that includes call trees, stack diagrams, fire diagrams, and more. All data filtering, scaling, slicing and transforming actions are saved in a URL that can be shared. Results can be shared in a bug report, your findings can be documented, compared with other records, or information can be passed on for further research. Firefox DevEdition has a built-in profiling thread. This flow makes communication easy. Our goal is to empower all developers, even outside of Firefox, to collaborate productively.
Previously, Firefox Profiler imported other formats starting with Linux perf and Chrome profiles . Over time, developers have added more formats. Today, the first projects are emerging to adapt Firefox for analysis tools. FunctionTrace is one such project. Matt tells the story of how the instrument was made.
FunctionTrace
I recently created a tool to help developers better understand what's going on in their Python code. FunctionTrace is a no-sampling profiler for Python that runs on unmodified applications with very low overhead - less than 5%. It is important to note that it is integrated with Firefox Profiler. This allows you to interact with profiles graphically, making it easier to discover patterns and make changes to your codebase.
I'll go over the development goals of FunctionTrace and share the technical implementation details. At the end we will play with a little demo .
An example of a FunctionTrace profile opened in Firefox Profiler.
Tech debt as motivation
Codebases tend to get larger over time. Especially when working on complex projects with a lot of people. Some languages deal better with this problem. For example, the capabilities of the Java IDE have been around for decades. Or Rust and its strong typing, which makes refactoring very easy. Sometimes it seems that as codebases in other languages grow, it becomes harder to maintain. This is especially true for older Python code. At least we're all Python 3 right now, right?
Making large-scale changes or refactoring unfamiliar code can be extremely difficult. It's much easier for me to change the code correctly when I see all the interactions of the program and what it does. Often, I even find myself rewriting pieces of code that I never intended to touch: the inefficiency is obvious when I see it in the visualization.
I wanted to understand what was going on in the code without having to read hundreds of files. But didn't find the tools that fit my needs. Also, I lost interest in building such a tool myself due to the amount of UI work involved. And the interface was necessary. My hopes for a quick understanding of program execution were rekindled when I stumbled upon the Firefox profiler.
The profiler provided all the hard to implement elements - an intuitive, open source user interface that displays stack plots, time-bound log markers, fire charts, and gives stability, the nature of which is binding to a famous web browser. Any tool that can write a properly formatted JSON profile can reuse all of the graphical analysis capabilities mentioned earlier.
FunctionTrace design
Luckily, I already had a week of vacation planned after I discovered the Firefox profiler. And I had a friend who wanted to develop an instrument with me. He also took a day off that week.
Objectives
We had several goals when we started developing FunctionTrace:
- The ability to see everything that happens in the program.
- .
- , .
The first goal had a significant impact on design. The last two added engineering complexity. We both knew from past experience with similar tools that the frustration is that we won't see too short function calls. When you record a 1ms tracking record, but you have important and faster functions, you are missing out on a lot of what is happening inside your program.
We also knew we needed to track all function calls. Therefore, we could not use the sampling profiler. Also, I recently spent some time with code where Python functions execute other Python code, often through a shell middleware script. Based on this, we wanted to be able to trace the child processes.
Initial implementation
To support multiple processes and descendants, we settled on a client-server model. Python clients send trace data to the Rust server. The server aggregates and compresses the data before generating a profile, which can be consumed by the Firefox profiler. We chose Rust for several reasons, including strong typing, striving for consistent performance and predictable memory usage, and ease of prototyping and refactoring.
We prototyped the client as a Python module called
python -m functiontrace code.py
. This made it easy to use the built-in trace hooks to log execution. The original implementation looked like this:
def profile_func(frame, event, arg):
if event == "call" or event == "return" or event == "c_call" or event == "c_return":
data = (event, time.time())
server.sendall(json.dumps(data))
sys.setprofile(profile_func)
The server is listening on a Unix domain socket . The data is then read from the client and converted to JSON by the Firefox profiler .
The profiler supports various types of profiles such as perf logs . But we decided to generate JSON of the internal profiler format. It requires less space and maintenance than adding a new supported format. It is important to note that the profiler maintains backward compatibility between profile versions. This means that any profile designed for the current version of the format is automatically converted to the latest version when downloaded in the future. The profiler also refers to strings with integer identifiers. This saves a lot of space by using deduplication (it is trivial to useindexmap ).
Several optimizations
Mostly the original code worked. On every function call and return, Python called the hook. The hook sent a JSON message to the server over a socket to convert it to the desired format. But it was incredibly slow. Even after batching the socket calls, we saw at least eight times the overhead of some test programs.
After seeing such costs, we went down to the C level using the C API for Python . And they got an overhead coefficient of 1.1 on the same programs. After that, we were able to carry out another key optimization, replacing calls
time.time()
to rdtsc operations viaclock_gettime()
... We've reduced the performance overhead of calling functions to multiple instructions and generating 64 bits of data. This is much more efficient than chaining Python calls and complex arithmetic on a mission-critical path.
I mentioned that tracing of multiple threads and child processes is supported. This is one of the most difficult parts of the client, so it is worth discussing some lower level details.
Support for multiple streams
The handler for all threads is installed via
threading.setprofile()
. We register through a handler like this when we set up the thread state. This ensures that Python is running and the GIL is being held. This simplifies some of the assumptions:
// This is installed as the setprofile() handler for new threads by
// threading.setprofile(). On its first execution, it initializes tracing for
// the thread, including creating the thread state, before replacing itself with
// the normal Fprofile_FunctionTrace handler.
static PyObject* Fprofile_ThreadFunctionTrace(..args..) {
Fprofile_CreateThreadState();
// Replace our setprofile() handler with the real one, then manually call
// it to ensure this call is recorded.
PyEval_SetProfile(Fprofile_FunctionTrace);
Fprofile_FunctionTrace(..args..);
Py_RETURN_NONE;
}
When the hook is called
Fprofile_ThreadFunctionTrace()
, it allocates the structure ThreadState
. This structure contains information needed by the thread to log events and communicate with the server. We then send an init message to the profile server. Here we notify the server to start a new stream and provide some initial information: time, PID, etc. After initialization, we replace the hook with Fprofile_FunctionTrace()
one that does the actual tracing.
Support for child processes
When working with multiple processes, we assume that the child processes are started through the Python interpreter. Unfortunately, child processes are not called with
-m functiontrace
, so we don't know how to track them down. To ensure that child processes are monitored, the $ PATH environment variable is changed at startup . This ensures that Python is pointing to an executable that knows about functiontrace
:
# Generate a temp directory to store our wrappers in. We'll temporarily
# add this directory to our path.
tempdir = tempfile.mkdtemp(prefix="py-functiontrace")
os.environ["PATH"] = tempdir + os.pathsep + os.environ["PATH"]
# Generate wrappers for the various Python versions we support to ensure
# they're included in our PATH.
wrap_pythons = ["python", "python3", "python3.6", "python3.7", "python3.8"]
for python in wrap_pythons:
with open(os.path.join(tempdir, python), "w") as f:
f.write(PYTHON_TEMPLATE.format(python=python))
os.chmod(f.name, 0o755)
An interpreter with an argument
-m functiontrace
is called inside the wrapper. Finally, an environment variable is added at startup. The variable indicates which socket is used to communicate with the profile server. If the client initializes and sees an environment variable already set, it recognizes the child process. It then connects to the existing server instance, allowing its trace to be correlated with that of the original client.
FunctionTrace now
FunctionTrace's implementation today has a lot in common with the implementation described above. At a high level the customer is tracked through FunctionTrace a call like this:
python -m functiontrace code.py
. This line loads a Python module for some customization, and then calls the C module to set various trace hooks. These hooks include the sys.setprofile
memory allocation hooks mentioned above , as well as custom hooks with interesting features like builtins.print
or builtins.__import__
. In addition, an instance functiontrace-server
is spawned, a socket is set up to communicate with it, and it is guaranteed that future threads and child processes communicate with the same server.
On every trace event, the Python client sends a MessagePack entry... It contains minimal event information and a timestamp in the stream memory buffer. When the buffer is full (every 128KB), it is flushed to the server over the shared socket and the client continues to do its job. The server listens to each client asynchronously, quickly consuming traces into a separate buffer to avoid blocking them. The thread corresponding to each client can then parse each trace event and convert it to the appropriate final format. After all connected clients exit, logs for each topic are combined into a complete profile log. Finally, this is all sent to a file, which can then be used with Firefox's profiler.
Lessons learned
A Python C module provides significantly more power and performance, but at the same time has a high cost. More code is required, good documentation is harder to find, few features are readily available. C modules seem to be an underutilized tool for writing high-performance Python modules. I say this based on some of the FunctionTrace profiles I've seen. We recommend a balance. Write most of the non-performing, mission-critical code in Python and call inner loops or C setup code for parts of your program where Python doesn't shine.
JSON encoding and decoding can be incredibly slow when there is no need for readability. We switched to
MessagePack
for client-server communication and found it to be just as easy to work with, while cutting some benchmark times in half!
Supporting multi-threaded profiling in Python is quite difficult. It's understandable why it was not a key feature in previous Python profilers. It took several different approaches and a lot of segmentation errors before we got a good idea of how to work with the GIL while maintaining high performance.
You can get a demanded profession from scratch or Level Up in skills and salary by taking online SkillFactory courses:
- Python for Web Development Course (9 months)
- Profession Web developer (8 months)
- Training for the Data Science profession from scratch (12 months)
- - Data Science (14 )
- - Data Analytics (5 )
- (18 )
E
- Machine Learning (12 )
- « Machine Learning Data Science» (20 )
- «Machine Learning Pro + Deep Learning» (20 )
- (6 )
- DevOps (12 )
- iOS- (12 )
- Android- (18 )
- Java- (18 )
- JavaScript (12 )
- UX- (9 )
- Web- (7 )