. OTUS Machine Learning: . , -, — (Senior Data Scientist Oura) , .
Our primary focus as Data Scientists is to process data and develop and improve machine learning models. There is an opinion that data processing is the most time-consuming stage in the entire project, and the accuracy of the model determines the success of the information product. However, the industry is now in a transitional stage "from the era of discovery to the era of implementation" (Superpowers in the field of artificial intelligence: China, Silicon Valley, and the new world order in this area is dictated by Li Kaifu). The picture is widening now, and the focus is shifting from building the model to providing the model to users as a service and from the performance of the model to its business value. The most famous example here is Netflix, which never used the winning models of their $ 1 Million Algorithm Due To Engineering Costs - WIRED , despite the promised significant performance gains from these engines .
From understanding to reality (slides from the Strata Data conference - Kubeflow explained: Portable machine learning on Kubernetes ) The
implementation of the model is extremely important, and information products can now be considered as software products, because they have a similar project structure, management and life cycle. Therefore, we have the right to use all known techniques from the field of software development to deploy machine learning models in production.
Containerization is a method that is widely used to deploy software products both on a cloud platform and on a local server. Basically, we are talking about packaging code and dependencies in a box called a container. Following is the definition of a container in the context of software development:
From the Docker site, a
Container is a standard unit of software that packages code and all of its dependencies so that an application can run quickly and reliably across different computing environments.
Docker is a platform that can help you accelerate your development, containerization, and deployment of our machine learning model to other environments. In this article series, I'll show you how to store a model, use it as an API endpoint, containerize your ML application, and run it on the Docker engine.
Question one "Why Docker?"
Before we start, you will need to register with a Docker ID if you don't have one, and then use that ID to download and install Docker on your machine.
When I first started my job at the bank, I was assigned a project that involved data processing, and the first MVP (minimum viable product) had to be delivered in a month. It sounds stressful, but we in the team use Agile methodology in the development of all major products, and the main goal of this MVP was to test the hypothesis about the practicality and effectiveness of the product (for more information on the Agile methodology, see the book by Eric Ries "Lean Startup" ). My manager wanted me to deploy my model on his laptop, that is, run it and use it for prediction.
If you imagined all the steps that I needed to take to prepare the manager's laptop to run my project, then you might have many questions, such as:
- What operating system will the model need to run on since he uses Macbook and ThinkPad? I could, of course, ask him about this, but suppose that at that moment in my life my boss was very nasty, and did not want me to know this information. (This thought is here to make you aware of the operating system dependency issue, and my boss is a really good person.)
- Second question: "Does he have Python installed?" If so, which version, 2 or 3? Which one: 2.6, 2.7 or 3.7?
- What about required packages like scikit-learn, pandas and numpy? Does it have the same versions that I have on my machine?
With all these questions in mind, this is what I had to do with his computer to get my model running on it.
- Install Python.
- Install all packages.
- Set up environment variables.
- Transfer the code to the car.
- Run the code with the required parameters.
All of these steps take a lot of effort, and there is a risk of incompatibilities when running the code in different environments.
So, if you already have Docker installed and running, you can open a terminal and run the following command:
docker run --rm -p 5000:5000 datascienceexplorer/classifier
After a couple of minutes, you should see something similar in your terminal:
* Serving Flask app "main" (lazy loading)
* Environment: production
WARNING: Do not use the development server in a production environment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
Now open your favorite browser and go to this address:
http://localhost:5000/apidocs/
Click on the predict line from the API and then on the try-out button on the right, the interface will look like this:
Swagger page for the API on the backend
Remember the standard Iris Flowers dataset you played with? This little application will help you predict the type of flower based on some measurement information based on a classification model. In fact, you are already using my machine learning model as a service, and everything you installed is just Docker, and I didn't need to install Python or any packages on your machine.
This is the strength of Docker. It helps me solve dependency problems so that I can quickly deploy my code across different environments, or in this case, your machine.
DevOps Data Science
Now, hopefully I've motivated you enough to keep reading, and if you just want to skip these parts and go straight to the code, then that's okay, as that means you want to containerize your machine learning model with Docker and expose it as service. However, for now we will have to stop a little and put all the materials about machine learning and Docker aside to think about DevOps in Data Science and why it is needed there at all.
What is DevOps?
From Wikipedia,The goal of software developers is the timely delivery of code with all the necessary functionality, while ease of use, reliability, scalability, network part, firewall, infrastructure, etc. often remain operational problems. Due to differing end goals and likely KPIs, these teams usually don't get along under the same roof. Therefore, the DevOps specialist could act as a liaison and help these teams work together, or even take responsibility of both parties, so that in the end you have one team, leading development from start to finish. After all, you can't just give your computer to the client, because the code works as it should on it.
DevOps is a collection of practices that combine software development and information technology services, the goal of which is to shorten the system development lifecycle and ensure a continuous delivery of high-quality software.
But with Jupyter notebook I'm happy !!!Data Scientists have a similar story, because again, you can't just pick up and give your laptop running Jupyter Notebook for the client to just use it. We need a way to use the model in such a way that it can serve a large number of users anytime, anywhere and get up with minimal downtime (usability, reliability, scalability).
For this reason, companies are looking for data analysts with DevOps skills who can deploy and deploy their machine learning models in production and deliver business value to the company, rather than just proving concepts and focusing on improving model accuracy. Such people are called unicorns.
There are many ways to deploy a machine learning model, but Docker is a powerful tool that gives you the flexibility you need while maintaining the robustness and encapsulation of your code. Of course, we won't ask our customers to install Docker and open a terminal to run it. But this containerization phase will eventually become the foundation when you start working with real projects where you have to deploy your models to cloud platforms or on-premises servers.
Storing the trained model
Back at university, we learned that a Data Science project consists of six stages, as shown in the picture below. If automating and deploying the model to production is our ultimate goal, then how do we “get” the model into the deployment phase?
Six stages of a Data Science project
The easiest way you can think of is to copy everything from our notebook, paste it into a .py file and run it. However, every time we need to make a prediction, we will run this file and train the model again on the same data. If this approach is somehow applicable for simple models with a small training dataset, it will not be effective for complex models with a lot of training data (think about how long it will take you to train an ANN or CNN model). This means that when a user submits a model request for prediction, he will have to wait from several minutes to several hours to get the result, since it will take a lot of time to complete the model training stage.
How to store the model immediately after it has been trained?
In most cases, a machine learning model in Python will be stored in memory as a Python object during code execution, and then deleted after the program finishes. If we could save this object to the hard disk immediately after the model is trained, then the next time we need to make a prediction, we can simply load the finished model into memory and not go through the initialization and training stages. In computer science, the process of converting an object into a stream of bytes for storage is called serialization . In Python, this can be easily done with a package called pickle , which has native Python support out of the box. Python developers also call "pickling" the process of serializing an object using pickle....
In Jupyter notebook, you can easily save the model object (in my case "knn") to a pkl file , which is located in the same directory as the code:
import pickle
with open('./model.pkl', 'wb') as model_pkl:
pickle.dump(knn, model_pkl)
Saving the model to the current directory
My notebook, I recommend to take from here , so that further we will have similar results. Alternatively, you can use your own model, but make sure you have all the required packages as well as the correct model inputs.
Completed the first step, you have saved the trained model. Further we will reuse the model for forecasting, but more on that in the second part of the article.