Industrial Machine Learning: 10 Design Principles

Industrial Machine Learning: 10 Design Principles



Nowadays, new services, applications and other important programs are created every day that allow you to create incredible things: from software for controlling a SpaceX rocket to interacting with a kettle in the next room via a smartphone.



And, at times, every novice programmer, whether he is a passionate start-up or an ordinary Full Stack or Data Scientist, sooner or later comes to the realization that there are certain rules for programming and software development that greatly simplify life.



In this article, I will briefly describe 10 principles of how to program industrial machine learning so that it can be easily integrated into an application / service, based on the 12-factor App method proposed by the Heroku team.... My initiative is to increase the awareness of this technique, which can help many developers and people from Data Science.



This article is the prologue to a series of articles on Industrial Machine Learning. In them, I will continue to talk about how, in fact, to make a model and run it in production, create an API for it, as well as examples from various fields and companies that have built-in ML in their systems.



Principle 1. One code base



Some programmers at the first stages, out of laziness to figure it out (or for some reason) forget about Git. They forget either from the word completely, that is, they throw files to each other in the drive / just throw text / send doves, or they don't think over their workflow, and each commit to their own branch, and then to the master.



This principle says: have one codebase and many deployments.



Git can be used both in production and in research and development (R&D), where it is used less often.



For example, in the R&D phase, you can leave commits with different data processing methods and models, in order to then choose the best one and easily continue working with it further.



Secondly, in production this is an irreplaceable thing - you will need to constantly look at how your code changes and know which model gave the best results, which code worked in the end and what happened, because of what it stopped working or started to issue incorrect results. That's what commits are for!



And also you can create a package of your project by placing it, for example, on Gemfury, and then simply importing functions from it for other projects so as not to rewrite them 1000 times, but more on that later.



Principle 2. Clearly declare and isolate dependencies



Each project has different libraries that you import from outside in order to apply them somewhere. Whether it is Python libraries, or libraries of other languages ​​of different purposes, or system tools - your task is:





Thus, developers who will join your team in the future will be able to quickly familiarize themselves with the libraries and their versions that are used in your project, as well as you will be able to control the versions and the libraries themselves installed for a specific project, which will help you avoid incompatibility of libraries or their versions.



Your app also doesn't have to rely on system tools that might be installed on a particular OS. These tools must also be declared in the dependencies manifest. This is necessary in order to avoid situations when the version of the tools (as well as their availability) does not match the system tools of a particular OS.



Thus, even if curl can be used on almost all computers, you should still declare it in dependencies, since when migrating to another platform it may not exist or the version will not be the one you originally needed.



For example, your requirements.txt might look like this:



# Model Building Requirements
numpy>=1.18.1,<1.19.0
pandas>=0.25.3,<0.26.0
scikit-learn>=0.22.1,<0.23.0
joblib>=0.14.1,<0.15.0

# testing requirements
pytest>=5.3.2,<6.0.0

# packaging
setuptools>=41.4.0,<42.0.0
wheel>=0.33.6,<0.34.0

# fetching datasets
kaggle>=1.5.6,<1.6.0


Principle 3. Configurations



Many have heard of stories where various development guys accidentally uploaded code to GitHub into open repositories with passwords and other keys from AWS, waking up the next day with a debt of $ 6,000, or even with all $ 50,000.







Of course, these cases are extreme, but very revealing. If you store your credentials or other data required for configuration inside the code, you are making a mistake, and I think it is not worth explaining why.



An alternative to this is to store configurations in environment variables. You can read more about environment variables here .



Examples of data that is usually stored in environment variables:



  • Domain names
  • API URL / URI's
  • Public and private keys
  • Contacts (mail, phones, etc.)


This way, you don't have to change the code all the time if your configuration variables change. This will save you time, effort and money.



For example, if you use the Kaggle API to perform tests (for example, you download and run the model through it to test that the model works well at startup), then private keys from Kaggle, such as KAGGLE_USERNAME and KAGGLE_KEY, should be stored in environment variables.



Principle 4: Third Party Services



The idea here is to design the program in such a way that there is no distinction between local and third party resources in terms of code. For example, you can connect both local MySQL and third-party. The same goes for various APIs like Google Maps or Twitter API.



In order to disable a third-party service or connect another, you just need to change the keys in the configuration in the environment variables, which I talked about in the paragraph above.



So, for example, instead of specifying each time the path to the files with datasets inside the code, it is better to use the pathlib library and declare the path to the datasets in config.py, so that no matter what service you use (for example, CircleCI), the program was able to find out the path to the datasets, taking into account the structure of the new file system in the new service.



Principle 5. Build, release, runtime



Many people in Data Science find it useful to learn software writing skills. If we want our program to crash as rarely as possible and to run smoothly as long as possible, we need to divide the release process of the new version into 3 stages:



  1. . , . .
  2. — config, . .
  3. . .


Such a system for releasing new versions of a model or the entire pipeline allows the division of roles between administrators and developers, allows tracking versions and prevents unwanted stoppages of the program.



For the release task, many different services have been created in which you can write processes to run yourself in a .yml file (for example, in CircleCI this is config.yml to support the process itself). Wheely is great at creating packages for projects.



You will be able to create packages with different versions of your machine learning model, and then package them and refer to the necessary packages and their versions in order to use the functions that you wrote from there. This will help you create an API for your model, and your package can be placed on Gemfury for example.



Principle 6. We run your model as one or several processes



Moreover, the processes should not have shared data. That is, the processes must exist separately, and all kinds of data must exist separately, for example, on third-party services such as MySQL or others, depending on what you need.



That is, it is definitely not worth storing data inside the file system of the process, otherwise it can lead to clearing of this data during the next release / change of configurations or transfer of the system on which the program is running.



But there is an exception: for machine learning projects, you can store the library cache so that you do not reinstall them every time you launch a new version, if no additional libraries or any changes in their versions have been made. Thus, you will shorten the time to launch your model in the industry.



To run the model as several processes, you can create a .yml file in which you just indicate the necessary processes and their sequence.



Principle 7. Recyclability



The processes that run on the model in your application should be easy to start and stop. Thus, it will allow you to quickly deploy code changes, configuration changes, quickly and flexibly scale and prevent possible breakdowns of the working version.



That is, your process with a model should:



  • . ( , , ) . , — .
  • . , , , . DevOps , , (, , , , !)


Principle 8: Continuous Deployment / Integration



Many companies use separation between the application development and deployment teams (making the application available to end users). This can greatly slow down software development and progress towards improving it. It also spoils the DevOps culture, where development and integration are roughly combined.



Therefore, this principle states that your development environment should be as close as possible to your production environment.



This will allow:



  1. Reduce release time tenfold
  2. Reduce the number of errors due to code incompatibility.
  3. It also helps to reduce the workload, since the developers and people deploying the application are now one team.


The tools that allow you to work with this are CircleCI, Travis CI, GitLab CI and others.



You can quickly make additions to the model, update it, and immediately launch it, while it will be easy, in case of failures, to return very quickly to the working version without the end user even noticing it. This can be done especially quickly and easily if you have good tests.



Minimize differences !!!



Principle 9. Your Logs



Logs (or "Logs") are events recorded, usually in text format, that occur within the application (event stream). Simple example: "2020-02-02 - system level - process name". They are designed so that the developer can literally see what happens when the program is running. He sees the progress of the processes, and understands whether he is as the developer himself intended.



This principle says that you should not store your logs inside your file system - you just need to "display" them, for example, do it in the standard output of the system stdout. And in this way, the flow can be monitored in the terminal during development.



Does this mean that you do not need to save the logs at all? Of course not. It's just that your application shouldn't do this - leave it to third-party services. Your application can only redirect the logs to a specific file or terminal for live viewing, or redirect it to a general purpose storage system (such as Hadoop). Your application itself should not store or interact with logs.



Principle 10. Test!



For industrial machine learning, this phase is extremely important, as you need to understand that the model is working correctly and gives what you want.



Tests can be created with pytest, and tested with a small dataset if you have a regression / classification task.



Do not forget to set the same seed for deep learning models so that they do not constantly produce different results.



This was a short description of 10 principles, and, of course, it is difficult to use them without trying and seeing how they work, so this article is just a prologue to a series of interesting articles in which I will reveal how to create industrial machine learning models. how to integrate them into systems, and how these principles can make life easier for all of us.



I'll also try to use cool principles that someone can leave in the comments if they want.



All Articles