How I Analyzed My Taxi Rides

Every time I take a taxi, a trip report with different information comes to my mail. In particular, they contain the date, time of travel, car model and driver's name. An idea came to me - to analyze reports from Yandex Taxi and get the most interesting information out of them. You, too, have probably always wondered how many times have you driven the same car or how many times have you been driven by the same driver?





The task outlined here can be a good exercise for novice analysts. Everything will be here: python with pandas and HTML parsing and regular expressions and databases with SQL.





We get information

This is an uninteresting part, here I will describe how I retrieved information from the mailbox, to see the code at the end I will attach a link to the Python laptop. The easiest way was to unload the mailbox in * .mbox format. It's easier than dealing with the gmail api, and my mailbox is there. This will not automatically add trips that were after unloading, but for our purposes this is not critical.





In order to parse the archive, we will use the mailbox library. It will allow you to access the basic properties of each message from the mailbox, including the sender and the message body itself.





After choosing the necessary letters, namely those that came from the sender taxi.yandex.ru we immediately face a problem. Yandex periodically changes the structure of its reports. However, the structure has changed globally once, this year. Before that, all the information about the trip was just in the form of a solid text, now it is formed in the form of a table. Therefore, I had to write two separate functions to extract information: if the message contains information in the form of solid text, we simply find the information we need using regular expressions using masks; if in the form of a table, then parse the HTML code of the letter using beautiful soup. We load the obtained data into a dataframe and a cloud database so that later we do not have to rewind the entire box with each admission.





Watching trips

Having received the data in a structured form, it is interesting to see the statistics.





, .





Travel cost change over time

. , , , , -.





A narrow peak - leaving for work, a wide one - from work.
- , - .

. .





- .





- , , . , .









? !

, ?





. , 24 , . . , , , . .





:





3 , , .





DATE





NAME





CAR





CAR_MODEL





NUMBER





NAME_HASH





2020-06-23









Toyota





Camry





37077





-2596682743997844296





2020-06-17









Toyota





Camry





37077





-2596682743997844296





2020-06-05









Toyota





Camry





37077





-2596682743997844296





2019-11-27









Toyota





Camry





37077





-1058569546058211362









, , - , . , , .





DATE





TARIF





NAME





CAR





CAR_MODEL





NUMBER





NAME_HASH





2017-10-11













Hyundai





i40





20377





7008433025181534578





2020-04-16





+









Toyota





Camry





67877





7008433025181534578





2018-04-11













Kia





Rio





67077





-2646868843695703984





2020-04-17





+









Kia





Optima





58777





-2646868843695703984





, , 150-200 . .





Distribution of intervals between trips with a recurring driver

, , , , 29 , .





, , , , . , , , , .





Allocation of gaps between trips with repeating cars

:





 pc - 1 , m 





P = \ frac {1} {N} (1-p_c) ^ m

, ,





\ frac {1} {N * p_c} = \ frac {29} {1346}

29 - , 1346 - .





, m m+n:





P_ {m, m + n} = \ frac {(1-p_c) ^ m} {N} * \ frac {1- (1-p_c) ^ {m + n}} {p_c}

. , , - .





 pc  , , . , , . , , .





, pc = 0.0062, :





  • - 257





  • - 7194





- , +. .. , 2 .





:





pc = 0.0076. :





  • - 7039





  • - 209





, . , , , .





, , , : https://colab.research.google.com/drive/1eltee0HilqqVQxpreC9-0w4b08EpMAgM?usp=sharing








All Articles