Hello! This is Pyotr Lukyanchenko (PetrPavlovich). My checklist is a collection of thoughts that have developed over the years of full of bumps and mistakes.
1. Statement of the problem
Always double-check the problem you want to count. What are you going to do? To classify something? Calculate? A clear understanding of the task will determine your next action.
2. Data (Garbage In = Garbage Out)
Always make sure there are no duplicates in the data. The phrase "Garbage In = Garbage Out" means that if the data is collected somehow, then the result will come out somehow. By the way, that is why there is a separate profession of Data Engineer - specialists who, often with heroic labor, clean out simply disgusting data. They know how to identify outliers deviations in them, remove them, correct them, so that later analysts can work with high-quality data sets.
3. Subject area
Always know the subject area in which you are building your regression. This will help test the hypotheses for realism. And with that understanding, you will avoid the wasted effort of counting silly regressions from the series "How the rate of melting glaciers affects the growth of the rabbit population in Australia."
4. Model logic
You cannot work without logic. Understanding the logic of the model, whether there is logic in this relationship is very important. In this case, the result obtained may even be of high quality, but at the same time it cannot be interpreted. Therefore, if it seems that there is no logic, it is better not to count the regression, because in this case it will turn out to be stupidity, which will lead to new erroneous decisions.
5. Metrics on the test is more important than metrics on training
When we train regression, we use a metric to train. This is an MSE metric or an alternative. And when we have counted many regressions, then we can compare them with each other. The R-square metric is already used here.
The regression training metric and the regression evaluation (testing) metric are two different metrics. And if a model has learned well, this does not mean that it will be well tested. Each of these metrics must be carefully and correctly selected.
6 the simpler the regression, the better it will work
And the harder the regression, the more likely it is that something will go wrong.
7. Better a good regression now than a perfect one in an hour
If you've come up with a good regression solution, it's best to stop there. Don't try to do something perfect, super precise. Sometimes trying to improve can actually worsen. Yes, I want to achieve 100 predictions, but in real life there is no 100% quality. Even the best quality metrics on Kaggle are 96-98%.
Now in the calibration of models there is a lot of manual intellectual labor that requires certain skills from a specialist. Yes, we all strive for auto-ML, i.e. Python's automatic selection of the best model. But so far this is an unattainable state, and without understanding the mathematical apparatus, it is impossible to choose the right model. Imagine that you get a time series similar to the chart below, and you are asked "Please predict ...".
On such a date set, you can build a large number of different regressions, where each will give its own forecast. Here's how to choose the best forecast, how to identify outliers in data and many other practical things we go through in the advanced course Mathematics for Data Science .
Therefore, if you are already working or are just going to move into the field of Data Science, but you know mathematics at the level of "passed something at the institute", here you will get all the missing skills.
You can find even more useful information in the author's telegram channel of Peter .