It is already obvious that in 2021, COVID-19 will still remain, as they say, on the agenda. This means that the questions naturally arise: do we have tools to predict the growth and decrease in the incidence, can we predict the development of events in a week, a month or even a year? Let's figure it out.
Given: colossal data science capabilities, three talented specialists.
Find: Ways to predict the spread of COVID-19 a week ahead.
Solution:
In fact, there will be three solutions, follow the publications. And today we will discuss one of them, with Vladislav Kramarenko. He found a model capable of constructing the most accurate forecast * for the entire world for the week ahead.
- Vladislav, hello. Let's discuss in detail what you did: what happened, what is still to be worked on, what mistakes were and how they can be taken into account in the future. Let's start with the main thing: tell me, what machine learning algorithm did you use?
- I stopped at a gradient boosting. The difficulty was that gradient boosts are different and give a different picture. I had the best adaboost score, followed by Catboost.
- I mean, you tried different ones, and adaboost went the best?
- Yes. The best was adaboost, it gave the most moderate forecast. If we saw that everything is growing rapidly, it means that everything will continue to grow rapidly for the algorithm, and other boosts directed the forecast somewhere into the clouds. But adaboost was the most conservative.
- How did you train the model?
- The biggest difficulty in such problems is to find the right way to train models, that is, to make the right choice of training and test samples. If we take 1 day as a test sample and divide all data into training and test data, it turns out that we predict only 1 day. It's not difficult - you just need to randomly scatter the days for training and test, and 1 day can be predicted. I immediately discarded this idea and predicted the last week: that is, I cut off the last week, gave the rest of the days to the training data, and predicted honestly day after day of the last week, that is, I took data from a week ago for prediction. But here, too, a difficulty arose. I made a model that predicts the second week perfectly, added a bunch of features that helped in this, but it turned out that the model,which predicts the second week very well, predicts the third very poorly. I'm starting to think that maybe it would be easier to put the data manually and not use machine learning, and such a model could be better.
- Are you talking about looking at the number with your eyes and drawing a line further?
- Analyze monthly statistics. This data fits well on some curve. All these statistics are rather strange, and not all sick people get into it. Thus, the statistics do not reflect the number of cases. I know that some guys use the SEIR model (epidemiological model) for such a task. I also thought to use it, but then I decided that we should know exactly how many people are sick, but we do not know. This model is tied to how many one person infects, how many people get sick. If we do not know this data, then we will not be able to work with it. In my opinion, such a model would give an erroneous prediction. *
* We will analyze the advantages and disadvantages of the SEIR model with Nikolai Kobalo in the next article
It seems reasonable to me that people who do this must first do everything using a computer, and then manually edit and fix it. The machine sometimes gives out all sorts of nonsense. For example, she sees that in China the number of cases has not grown for a long time, but at the same time in other regions nothing has grown for a long time, and then an explosive growth began. And on the basis of this, the machine “understands” that the same must be done for China, which, in fact, has already a plateau. And he starts to give not 80k, but abruptly leaves a million. I had this in one of the models.
- And what about traditional models? What do you think of them? Time series analyzes like ARIMA?
- I tried ARIMA a couple of times, but it never gave a better result than gradient boosting. It would seem that with the help of ARIMA any process can be explained, but it turned out that it does not always work better. There are also a bunch of parameters, the process must be stationary, and so on. Even if you integrate, it is not a fact that a stationary process will turn out.
- A question about trees. Trees don't extrapolate. How do you get them to extrapolate?
- To do this, you need to predict not the total number of infected, but something else. It is clear that if we predict the total, then in some region like Moscow it will not be possible to predict, since trees cannot predict more than they saw in the training sample. I took the logarithm of the ratio of the sick for today and the previous days. These numbers (0.3,1, maybe 2) are in the training sample and the model is obtained. It is clear that we will not be able to predict a sharp increase of 500 times. This model is beyond the power. But if we are talking, for example, about the ratio of the gain for today to the gain for yesterday, the figure will be about one, and we have different such values in the sample - in this case, the model predicts perfectly.
- As a target in the final model, did you take the logarithm of the ratio of the sick today to the sick yesterday?
- Yes. I also tried the ratio of deltas: "how much has grown for today", divided by "how much has grown for yesterday." It worked well too. But the "total number" and "increase in the number of cases per day" worked poorly.
- What did you take as explanatory variables?
- I took about 4 previous days. It worked. I took information about the population, the number of smokers, etc. Added a lot of different statistics. And then I spent a week looking at which factors give rise and which ones do not. But the situation is changing too much, these factors turned out to be not stable, rather random.
- What turned out to be stable, apart from the previous values?
- The most important thing that influenced was the number of days since the first disease, tenth, hundredth ... At first I took the number of days from the first infected, but I thought that this was not very good, since often the first infected is quickly isolated, and it does not lead to a sharp rise. Therefore, I began to take 10 infected, and then dropped to 100 and 1000 infected.
At the third stage in this task, I added 50 and 500 infected, and this played a cruel joke on me: the model was heavily retrained and began to predict poorly the next week.
More important data, I tried the self-isolation index. On some week it gave a strong increase, and on some it did not matter at all. I used data on the level of healthcare: what amounts are transferred to doctors, how many doctors in general are in the country, how many elderly people, etc. This was done to predict mortality.
There were various problems that I wanted to solve. Take self-isolation, for example. I realized that the level of self-isolation does not affect tomorrow, but the date in two weeks. And it is not a fact that self-isolation affects the number of cases; maybe, on the contrary, the number of cases affects the level of self-isolation.
, - Casual Inference in ML ( https://ods.ai/tracks/causal-inference-in-ml-df2020/) – 2020 – -, - COVID-19 , .. , .
- What conclusion would you make about ML models in general, not in relation to this problem? Your statement sounds like you need to "look after" models ...
- There are tasks that a computer solves much better than a human. For example, the last competition in which I participated was the Unified State Exam in Russian. My model was better at these tasks than I was. But this is word processing ...
Why are there so many sick people in St. Petersburg and MSC? We are being tested without exception. I will not say that in other regions they test so massively. For example, statistics include 100 people. What does this mean? Three weeks ago they got infected. As a result, we predict not the number of cases, but some other figure. And how this figure correlates with the number of cases is not very clear. The computer cannot predict anything normally if we give it incomprehensible numbers.
Who cares, here's my solution: https://github.com/vlomme/sberbank-covid19-forecast-2020
, – 10, 100, 1000 (, , ).
. . , , , , , 2021.
, , .
* «Forecast the Global Spread of COVID-19»