How to write neat machine learning pipelines

Hello, Habr.



The topic of pipelining and parallelization of machine learning has been in our work for a long time. In particular, I wonder if a specialized book with an emphasis on Python is enough for this, or if a more overview and possibly complex literature is needed. We decided to translate an introductory article on Machine Learning Pipelines, covering both architectural and more applied considerations. Let's discuss whether searches in this direction are relevant.





Have you ever written a machine learning pipeline that took a long time to run? Or worse: have you reached the stage where you need to save intermediate parts of the pipeline to disk so that you can study the stages of the pipeline one by one, relying on checkpoints? Or worse: have you ever tried to refactor such disgusting machine learning code before putting that code into production - and found it took months? Yes, everyone who has worked on machine learning pipelines for a long time has had to deal with this. So why not build a good pipeline that gives us enough flexibility and the ability to easily refactor the code for later shipping to production?



First, let's define a machine learning pipeline and discuss the idea of ​​using breakpoints between pipeline stages. Then we'll see how you can implement such breakpoints so you don't get shot in the foot when moving the pipeline to production. We will also discuss data streaming and the trade-offs associated with object-oriented programming (OOP) encapsulation that you have to go through in pipelines when specifying hyperparameters.



WHAT ARE CONVEYORS?



ConveyorIs a sequence of steps in data transformation. It is created according to the old pipe-and-filter design pattern (remember, for example, the unix bash commands with "|" pipes or ">" redirect operators). However, pipelines are objects in code. Therefore, you can have a class for each filter (that is, for each stage in the pipeline), as well as another class for combining all of these stages into a finished pipeline. Some pipelines can combine other pipelines in series or in parallel, have many inputs or outputs, etc. It is convenient to think of machine learning pipelines as:



  • Channel and filters . The stages of the pipeline process data, and the stages manage their internal state, which can be learned from the data.
  • . ; , . , – .
  • (DAG). , . : , , , , (, fit_transform ), , ( RNN). , , .








Conveyor Methods Conveyors (or pipeline stages) must have the following two methods:



  • " Fit " for training on data and acquiring a state (eg, such a state is the weights of a neural network)
  • β€œ Transform ” (or β€œpredict”) for actually processing the data and generating a prediction.
  • Note: If a pipeline stage does not require one of these methods, then the stage can inherit from NonFittableMixin or NonTransformableMixin, which will provide an implementation of one of these methods by default so that it does nothing.




The following methods can also optionally be defined in pipeline stages :



  • β€œfit_transform” , , , .
  • β€œ Setup ” which will call the β€œsetup” method at each of these stages of the pipeline. For example, if a pipeline stage contains a TensorFlow, PyTorch, or Keras neural network, then these stages could create their own neural graphs and register to work with the GPU in the β€œsetup” method before fitting. It is not recommended to create graphs directly in the stage constructors before fitting; there are several reasons for this. For example, before starting, the steps can be copied many times with different hyperparameters as part of the Automatic Machine Learning algorithm, which searches for the best hyperparameters for you.
  • β€œ Teardown ”, this method is functionally the opposite of β€œsetup”: it tears down resources.




The following methods are provided by default, providing hyperparameter control:







Pipeline re-fitting, mini-batching, and online learning



For algorithms that use mini-batching, such as learning deep neural networks (DNN), or algorithms that learn online, such as reinforcement learning (RL), for pipelines or their stages ideal it is suitable to chain several calls so that they follow exactly one after another, and on the fly they are adjusted to the size of mini-batches. This feature is supported in some pipelines and at some stages of the pipelines, but at some stage the achieved fit may be reset due to the fact that the β€œfit” method will be called again. It all depends on how you programmed your pipeline stage. Ideally, a pipeline stage should only be flushed after calling the β€œteardown” method and then calling the β€œsetup”Until the next fit, and the data was not flushed between fittings or during conversion.



USING CHECK POINTS IN CONVEYORS



It is good practice to use breakpoints in pipelines until you need to use this code for other purposes and change the data. If you don't use the correct abstractions in your code, you may be shooting yourself in the foot.



Pros and cons of using checkpoints in pipelines:



  • Breakpoints can speed up workflow if the programming and debugging steps are in the middle or at the end of the pipeline. This eliminates the need to re-calculate the first stages of the pipeline every time.
  • ( , ), , . , , – . , , , , , .
  • Perhaps you have limited computing resources, and the only viable option for you is to run one step at a time on the available hardware. You can use a breakpoint, then add a few more steps after it, and then the data will be used from where you left off if you want to re-execute the entire structure.




Disadvantages of using breakpoints in pipelines:



  • It uses disks, so if you do it wrong, your code can slow down. To speed things up, you can at least use RAM Disk or mount the cache folder to your RAM.
  • This can take a lot of disk space. Either a lot of RAM space when using a directory mounted to RAM.
  • The state stored on disk is harder to manage: your program has the additional complexity needed to make the code run faster. Note that from a functional programming perspective, your functions and code will no longer be clean, as you need to manage the side effects associated with disk usage. Side effects associated with managing the state of the disk (your cache) can be the breeding ground for all sorts of weird bugs
...



Some of the most difficult bugs in programming are known to arise from cache invalidation issues.



In Computer Science, there are only two really tricky things: cache invalidation and entity naming. - Phil Carlton




Advice on how to properly manage state and cache in pipelines.



It is known that programming frameworks and design patterns can be a limiting factor - for the simple reason that they govern certain rules. Hopefully, this is done to keep your code management tasks as simple as possible, so you avoid mistakes yourself and your code doesn't end up messy. Here are my five cents about designing in the context of pipelines and state management:



CONVEYOR STAGES MUST NOT CONTROL TEST POINT SETTINGS IN THE DATA ISSUED BY




To manage this, a special pipelining library must be used that can do all this for you.



Why?



Why shouldn't pipeline stages control the placement of checkpoints in the data they produce? For the same good reasons that you use a library or framework when working, and do not reproduce the corresponding functionality yourself:



  • You will have a simple toggle switch that will make it easy to fully activate or deactivate breakpoints before deploying to production.
  • , , , : , , . , .
  • / (I/O) . , . : , . ?
  • , , – . , , .
  • , , , , , , . , . .
  • , , (, , ) . , ( , ) . , , , , , , . , . , .




That's cool. With the right abstraction, you can now program machine learning pipelines to dramatically speed up the hyperparameter tuning phase; to do this, you need to cache the intermediate result of each test, skipping the pipeline stages over and over again, when the hyperparameters of the intermediate pipeline stages remain unchanged. Moreover, when you are ready to release the code to production, you can immediately turn off caching entirely, rather than refactoring the code for this for a whole month.



Don't hit this wall.



STREAMING DATA IN MACHINE LEARNING CONVEYORS



Parallel processing theory states that pipelines are a data streaming tool that allows you to parallelize pipeline stages. Laundry exampleillustrates well both this problem and its solution. For example, a second stage in the pipeline might start processing partial information from the first stage of the pipeline, while the first stage continues to compute new data. Moreover, for the second stage of the conveyor to work, it is not required that the first stage completely completes its stage of processing all data. Let's call these special pipelines streaming (see here and here ).



Don't get me wrong, working with scikit-learn pipelines is very enjoyable. But they are not rated for streaming. Not only scikit-learn, but most of the existing pipelined libraries don't take advantage of streaming capabilities when they could. There are multithreading issues throughout the Python ecosystem. In most pipelined libraries, each stage is completely blocking and requires transformation of all data at once. There are only a few streaming libraries available.



Activating streaming can be as simple as using a class StreamingPipelineinstead ofPipelineto link the stages one by one. At the same time, the size of the mini-batch and the size of the queue are indicated (in order to avoid excessive RAM consumption, this ensures more stable work in production). Ideally, such a structure would also require multithreaded queues with semaphores, as described in the problem of the provider and consumer : to organize the transfer of information from one stage of the pipeline to another.



At our company, Neuraxle already manages to do one thing better than scikit-learn: it's about sequential pipelines that can be used using the MiniBatchSequentialPipeline class .... So far, this thing is not multi-threaded (but this is in the plans). At a minimum, it already passes data to the pipeline in the form of mini-batches during the fitting or transformation process, before collecting the results, which allows working with large pipelines just like in scikit-learn , but this time using mini-batching, as well as numerous other possibilities, including: hyperparameter spaces, installation methods, automatic machine learning, and so on.



Our Parallel Data Streaming Solution in Python



  • The fitting and / or transform method can be called many times in a row to improve fitting with new mini-batches.
  • , -. , , .
  • , . , setup. , , . , TensorFlow, , , , C++, Python, GPU. joblib . .
  • , . , – , , , , .
  • . , , ; , . , , , , ( Joiner). , . , , , .




Moreover, we want to ensure that any object in Python can be shared between threads so that it is serializable and reloadable. In this case, the code can be dynamically sent for processing on any worker (it can be another computer or process), even if the necessary code itself is not on this worker. This is done using a chain of serializers specific to each class that embodies the pipeline stage. By default, each of these steps has a serializer that allows you to process regular Python code, and for more intricate code, use the GPU and import code in other languages. Models are simply serialized using their saversand then reloaded into the worker. If the worker is local, then objects can be serialized to a disk located in RAM, or to a directory mounted in RAM.



COMPROMISES FOR INCAPSULATION



There is one more annoying thing left in most of the libraries for pipelined machine learning. It's about how hyperparameters are handled. Take scikit-learn for example. Hyperparameter spaces (also known as statistical distributions of hyperparameter values ) often need to be specified outside the pipeline with underscores as separators between stages of the pipeline (s). Whereas Random Search and Grid Searchallow you to explore hyperparameter grids or hyperparameter probability spaces as defined in scipy distributions , scikit-learn itself does not provide default hyperparameter spaces for each classifier and transformer. The responsibility for performing these functions can be assigned to each of the objects in the pipeline. Thus, the object will be self-sufficient and will contain its own hyperparameters. This does not violate the principle of single responsibility, the principle of open / close and the principles of SOLID object-oriented programming.



COMPATIBILITY AND INTEGRATION When coding



machine learning pipelines, it is useful to keep in mind that they must be compatible with many other tools, in particular scikit-learn., TensorFlow , Keras , PyTorch, and many other machine and deep learning libraries.

For example, we wrote a method .tosklearn()that allows us to turn pipeline stages or an entire pipeline into BaseEstimatora base object of the scikit-learn library. As for other machine learning libraries, the task comes down to writing a new class that inherits from ours BaseStepand to overriding in specific code the operations of fitting and transforming, as well as, possibly, setting and demolishing. You also need to define a saver that will save and load your model. Here is the documentation for the class BaseStepand examples for it.



CONCLUSION



To summarize, we note that the code of machine learning pipelines, ready to go to production, must meet many quality criteria, which are quite achievable if you adhere to the right design patterns and structure the code well. Note the following:



  • In machine learning code, it makes sense to use pipelines, and define each stage of the pipeline as an instance of a class.
  • The entire structure can then be optimized with breakpoints to help find the best hyperparameters and repeatedly execute the code on the same data (but possibly with different hyperparameters or modified source code).
  • , RAM. , .
  • , – BaseStep, , .



All Articles