In this post, I put together a checklist that I constantly refer to while working on a comprehensive machine learning project.
Why do I need a checklist at all?
Since you have to deal with numerous elements of a project (preparation, questions, models, tweaks, etc.), it's easy to lose track. It will guide you through the next steps and nudge you to check if each task was completed successfully or not.
Sometimes we try to find a starting point, the checklist helps you extract the right information (data) from the right sources to establish relationships and uncover correlation ideas.
It is recommended that each part of the project go through the review paradigm.
As Atul Gawande says in his book "The Checklist Manifesto",
the scope and complexity of what we know has surpassed our individual ability to deliver benefits correctly, safely and reliably.
So let me walk you through this clear and concise list of actions that will reduce your workload and improve your results ...
Machine Learning Projects Checklist
Here are 8 steps you should follow in almost every project. Some of the steps can be performed interchangeably in order.
1. Define the problem from a high-level perspective
This is to understand and formulate the business logic of the problem. This should tell you:
- the nature of the problem (controlled / uncontrolled, classification / regression),
- the type of solutions you can develop
- what metrics should you use to measure performance?
- Is machine learning the right approach for solving this problem?
- manual approach to solving the problem.
- inherent prerequisites for the problem
2. Define data sources and get data
In most cases, this step can be done before the first step if you have the data and want to identify the questions (problems) around it in order to make better use of the input data.
Based on your problem definition, you will need to identify the data sources, which can be database, data warehouse, sensors, and so on. To deploy an application to production, this step must be automated by developing data pipelines that allow incoming data to enter the system.
- list the sources and amount of data you need.
- check if the location will be a problem.
- check if you are allowed to use the data for your purposes or not.
- get the data and convert it to a workable format.
- check data type (text, categorical, numeric, time series, images)
- select a sample for final testing.
3. Initial data exploration
At this stage, you study all the features that affect your result / forecast / goal. If you have a huge amount of data, try this step to make your analysis more manageable.
Steps:
- use Notebook Jupyter as it provides a simple and intuitive interface for exploring data.
- define the target variable
- define feature types (categorical, numeric, text, etc.)
- analyze the relationship between features.
- add multiple data visualizations to easily interpret the impact of each feature on the target variable.
- document your research results.
4. Exploratory data analysis for data preparation
It's time to draw on the lessons from the previous step by defining functions for data transformation, cleanup, feature selection / design, and scaling.
- Writing functions to transform data and automate the process for upcoming data packages.
- Write functions to clean up data (imputing missing values โโand handling outliers)
- Write functions for selecting and designing features โ remove redundant features, format object transformations, and other mathematical transformations.
- Features scaling - standardization of features.
5. Develop a basic model and then explore other models to select the best
Create a very basic model that should serve as the foundation for all other complex machine learning model. Checklist of steps:
- Train several commonly used models such as naive bayes model, linear regression model, SVM, etc., using the default parameters.
- Measure and compare the performance of each model with the base model and with all others.
- Use N-fold cross-validation for each model and calculate the mean and standard deviation of the N-fold performance metrics.
- Explore the features that have the greatest impact on the goal.
- Analyze the types of errors that models make when predicting.
- Design functions differently.
- Repeat the above steps several times (by trial and error) to make sure we are using the correct functions in the correct format.
- Shortlist the top models based on their performance metrics.
6. Fine-tune your models from the shortlist and check for ensemble methods
This should be one of the crucial steps as you get closer to your final decision. Key points should include:
- Hyperparameter tuning using cross validation.
- Use auto tuning techniques like random search or grid search to find the best configuration for your top models.
- Test ensemble of methods such as vote classifier, etc.
- Test the models with as much data as possible.
- After completing the work, use the test sample we put aside at the beginning to check if it fits well or not.
7. Document your code and communicate your solution
The communication process is diverse. You need to keep in mind all existing and potential stakeholders. Therefore, the main points include:
- Document your code as well as your approach to the entire project.
- Create a dashboard like voila or an insightful presentation with self-explanatory visuals.
- Write a blog / report on how you analyzed features, tested various transformations, etc. Describe your learning curve (failures and methods that worked)
- Finish with the main result and future volume (if any)
8. Deploy your model to production, monitoring
If your project requires deployment testing on real data, you must create a web application or REST API for use on all platforms (web, Android, iOS). Key points (will vary by project) include:
- Save your final trained model to an h5 or pickle file.
- Serve your model with web services, you can use Flask to develop these web services.
- Connect input sources and set up ETL pipelines.
- Manage dependencies with pipenv, docker / Kubernetes (based on scaling requirements)
- You can use AWS, Azure, or Google Cloud Platform to deploy your service.
- Do performance monitoring on real data or just for people to use your model with their data.
Note. The checklist can be adapted depending on the complexity of your project.
Find out the details of how to get a high-profile profession from scratch or Level Up in skills and salary by taking SkillFactory's paid online courses:
- Machine Learning Course (12 weeks)
- Learning Data Science from scratch (12 months)
- Analytics profession with any starting level (9 months)
- Python for Web Development Course (9 months)
Read more
- Trends in Data Scene 2020
- Data Science is dead. Long live Business Science
- Cool Data Scientists don't waste time on statistics
- How to Become a Data Scientist Without Online Courses
- 450 free courses from the Ivy League
- Data Science for the humanities: what is "data"
- Data Science on Steroids: Introducing Decision Intelligence