Every time I take a taxi, a trip report with different information comes to my mail. In particular, they contain the date, time of travel, car model and driver's name. An idea came to me - to analyze reports from Yandex Taxi and get the most interesting information out of them. You, too, have probably always wondered how many times have you driven the same car or how many times have you been driven by the same driver?
The task outlined here can be a good exercise for novice analysts. Everything will be here: python with pandas and HTML parsing and regular expressions and databases with SQL.
We get information
This is an uninteresting part, here I will describe how I retrieved information from the mailbox, to see the code at the end I will attach a link to the Python laptop. The easiest way was to unload the mailbox in * .mbox format. It's easier than dealing with the gmail api, and my mailbox is there. This will not automatically add trips that were after unloading, but for our purposes this is not critical.
In order to parse the archive, we will use the mailbox library. It will allow you to access the basic properties of each message from the mailbox, including the sender and the message body itself.
After choosing the necessary letters, namely those that came from the sender taxi.yandex.ru we immediately face a problem. Yandex periodically changes the structure of its reports. However, the structure has changed globally once, this year. Before that, all the information about the trip was just in the form of a solid text, now it is formed in the form of a table. Therefore, I had to write two separate functions to extract information: if the message contains information in the form of solid text, we simply find the information we need using regular expressions using masks; if in the form of a table, then parse the HTML code of the letter using beautiful soup. We load the obtained data into a dataframe and a cloud database so that later we do not have to rewind the entire box with each admission.
Watching trips
Having received the data in a structured form, it is interesting to see the statistics.
, .
. , , , , -.
. .
- .
- , , . , .
? !
, ?
. , 24 , . . , , , . .
:
3 , , .
DATE |
NAME |
CAR |
CAR_MODEL |
NUMBER |
NAME_HASH |
---|---|---|---|---|---|
2020-06-23 |
|
Toyota |
Camry |
37077 |
-2596682743997844296 |
2020-06-17 |
|
Toyota |
Camry |
37077 |
-2596682743997844296 |
2020-06-05 |
|
Toyota |
Camry |
37077 |
-2596682743997844296 |
2019-11-27 |
|
Toyota |
Camry |
37077 |
-1058569546058211362 |
, , - , . , , .
DATE |
TARIF |
NAME |
CAR |
CAR_MODEL |
NUMBER |
NAME_HASH |
---|---|---|---|---|---|---|
2017-10-11 |
|
|
Hyundai |
i40 |
20377 |
7008433025181534578 |
2020-04-16 |
+ |
|
Toyota |
Camry |
67877 |
7008433025181534578 |
2018-04-11 |
|
|
Kia |
Rio |
67077 |
-2646868843695703984 |
2020-04-17 |
+ |
|
Kia |
Optima |
58777 |
-2646868843695703984 |
, , 150-200 . .
, , , , 29 , .
, , , , . , , , , .
:
pc - 1 , m
, ,
29 - , 1346 - .
, m m+n:
. , , - .
pc , , . , , . , , .
, pc = 0.0062, :
- 257
- 7194
- , +. .. , 2 .
:
pc = 0.0076. :
- 7039
- 209
, . , , , .
, , , : https://colab.research.google.com/drive/1eltee0HilqqVQxpreC9-0w4b08EpMAgM?usp=sharing