🖇️ 👶🏻 👩🏽‍⚖️ From heuristics to machine learning: search suggestions in Citymobil 👍🏼 🐦 ☑️

Hello! My name is Mikhail Dyachkov, and at Citymobil I do machine learning. Today I will tell you about our new algorithm for generating search suggestions for final destinations. You will learn how a seemingly simple task turned into an interesting scenario, with the help of which, we hope, we managed to make the life of users a little easier. We continue to closely monitor the work of the new algorithm and will subsequently tweak it to maintain the quality of ranking at a high level. For all users, we will launch the algorithm in the next few weeks, but we are already ready to talk about the long journey that we have come from heuristics to the machine learning algorithm and rolling it out into operation.

I think it’s worth starting by describing the ideal world view of a taxi order scenario from a user interface perspective. I would like our application to understand itself where / where / when / on which car the user wants to leave. In this article, we'll look at our solution to answer the "where" question.

One of the central elements on the first screen (the one that the user sees after logging in) is search suggestions. In the geo-search team, we call them "sajests" (from the English suggestion). They offer the user the final route addresses (“B” points) from his travel history based on the current location of the pin (ie the drop point) and the time of day without entering a search query. We try to help the user to form an order "in one click" with the help of sagests. In the current version of the iOS client application, the sajests look like this:

Geo-search, due to the algorithms for generating search results, can affect one of the most important product metrics for the client application, such as the time spent ordering a taxi ( Time to order , or T2O ), as well as the number of actions that the client took to form an order ( Actions to order , or A2O). Therefore, we decided to develop an algorithm that will predict the most probable end points of the route (points "B") for a given location and time of day.

Heuristic

One of the backend developers of the geo-search team (vasilesk, hello!) came up with a fairly simple heuristic for generating sajests that worked for both start point "A" and end point "B". It should be noted right away that the heuristic did not work on the user's travel history , but on the history of clicks on search results, which entailed some problems. These objects we call "peaks" (from the English. The pick ). The heuristic looked like this:

For the current user, we take all of his historical peaks.
We filter them, leaving those with the same target (from / where).
, , 300 ( «» — 300 «», «» — 300 «»). , GPS- .
, , , , , .
, , 3:00 14:00, , .
- (), , - .
.

This algorithm worked for a while, and was generally good for MVPs (we'll talk about metrics a little later), but it had a number of drawbacks. In addition to the rather primitive logic of work, it was based not on trips, but on the user's picks. Because of this, sometimes users got unobvious search results. And also, due to the "specific" way of storing the history of the peaks, it was rather difficult to carry out quick analytics. Based on this, we decided to try applying machine learning. Next, we will consider the formulation of ranking problems and reduce our problem to a binary classification.

Ranking problem statement

Before talking about features, metrics, and a model, we need to figure out what kind of problem we are trying to solve. Let's go iteratively and first try to formulate a general formulation of the ranking problem. It looks like this:

is a set of objects.

X

$X$

- training sample.

X^{l} = {x_{1}, \dots, x_{l}}

$X^l = \{x_1, \dots, x_l \}$

is the correct order on pairs

i ≺ j

$i \prec j$

Task: construct a ranking function

(i, j)

$(i, j)$

a : X \to ℝ

$a: X \rightarrow ℝ$ , with which

i ≺ j \Rightarrow a (x_{i}) < a (x_{j})

$i \prec j ⇒ a(x_i) < a(x_j)$

Now let's formulate the task of ranking search results by query. It differs from the general ranking problem in that instead of the general set of objects that we need to sort, two sets appear

D

$D$ and

Q

$Q$ - many documents and requests.

D

$D$ - collection of documents (answers).

Q

$Q$ - a lot of requests.

D_{q} \subseteq D

$D_q \subseteq D$ - the set of documents found by query q.

X = Q \times D

$X = Q \times D$ - objects are pairs "request, document":

x \equiv (q, d), q \in Q, d \in D_{q}

$x \equiv (q, d), q \in Q, d \in D_q$

Y

$Y$ - an ordered set of ratings (ratings).

y (q, d) : X \to Y

$y(q, d): X \rightarrow Y$ - relevance scores.

The higher the score

y (q, d)

$y(q, d)$ , the more relevant the document

d

$d$ request

q

$q$ ...

The correct order is defined only between those documents that were found by the same query

q

$q$ :

(q, d) ≺ (q, d^{'}) \Leftrightarrow y (q, d) < y (q, d^{'})

$(q, d) \prec (q, d') \Leftrightarrow y(q, d) < y(q, d')$

In our task of recommending route endpoints, the set of ratings is binary. For the user, the suggested address may be either relevant or irrelevant (excluding cases with a complex route with multiple endpoints). If we consider the task in the context of the user, then

q

$q$ - a request to the service, which contains the client's id , geo-position, date and time;

D_{q}

$D_q$ - many historical endpoints "B" for the user's trips (we only make suggestions based on the addresses of past trips). And every valid answer

d \in D_{q}

$d \in D_q$ on request

q

$q$ can be either relevant to the user (from the current point and at the current time, the user needs to go exactly here) or irrelevant.

For the sake of completeness, it remains only to describe the process of creating a sample of request-response pairs with a target. Consider, for simplicity, one customer who had 5 trips. Let's rank these trips from the very first to the last. For the first trip, we do not know anything about the user's trips, so we cannot offer him a sagest using the described machine learning algorithm (the heuristic for new users works here). For the second trip, we know the final destination from the first trip and we can offer this address to the user if he successfully passes the post-processing procedure (located more than 1 km from the current location, in the same region, etc.). For the third trip, we may already have from one to two possible sadgets,if the end address of the second trip was the same as the end address of the first and if the end addresses were different, respectively. If the sajest coincided with the end point "B" (that is, it fell into the same hex of a fixed size), then we set 1 as the target, otherwise - 0. According to this algorithm, we form all kinds of pairs of the form "request - (possible) response "For each client.

Thus, we have reduced the ranking problem to a binary classification problem. Now we can talk about quality assessment metrics.

Metrics

In ranking problems, a metric that shows the proportion of correct answers from documents

D_{q}

$D_q$ to the top

n

$n$ ranking list when requested

q

$q$ are called Precision @ n . We are interested in Precision @ 1/2/3 , since the total Click Through Rate for the first three positions is about 95%. At the same time, there is only one correct final address (naturally, if the user wants to leave the address from his history), therefore, this metric will just show the proportion of cases when the correct final point "B" falls into the top 1/2/3 addresses that suggested our algorithm.

Recall that in our problem

Y = {0, 1}, y (q, d)

$Y = \{0, 1\},\; y(q, d)$ - relevance,

a (q, d)

$a(q, d)$ Is the required ranking function. Then Precision @ n can be written as:

P_{n} (q) = \frac{1}{n} \sum_{i = 1}^{n} y (q, d_{q}^{(i)})

$P_n(q) = \frac{1}{n}\sum_{i=1}^{n} y(q, d_q^{(i)})$

Signs and model

The features for the model in our problem can be divided into several blocks:

For document only $d \in D_q$ (final address, point "B").
For request only $q$ (starting address, point "A").
Common to request and document $(q, d)$ (route from "A" to "B").
General for the user.

Here are some examples for each of them.

Examples of signs only for the document (point "B"):

Number of trips to point "B" in the last K days.
The number of trips to point "B" by day of the week and time of day.
When was the previous trip to point “B”.
Flag that the previous trip was made to point "B".
Is point “B” a chosen address / home / work.

Examples of characteristics for request only $q$ ( «» + /):

, .
«».
«» K .
«» .
«» //.
/ $q$ .
«».

, $(q, d)$ ( «» “”):

K .
.
Historical trip statistics (mean, quantiles, median trip distance, etc.).

As a result, we got more than 100 features that describe a pair of "request-document" objects. Since we want to maximize Precision @ 1/2/3 , it is logical that we need to predict the probability of a user's trip to a specific destination and rank possible candidates according to the obtained probability. We tried different algorithms and different loss functions, and settled on gradient boosting on trees and logloss . The results that were obtained at the time of using the heuristic:

	Heuristic	ML-algorithm
Precision @ 1	0.657	0.789
Precision @ 2	0.719	0.872
Precision @ 3	0.761	0.923

Production

Naturally, before coming up with some complex algorithms, features and training models, you need to think about how all this will work in battle under load, while not forgetting about scaling. Having gathered with the backend development team, we sketched out a rough outline of how our service should look. We decided to wrap the trained machine learning model in the async-await web framework Sanic, to which the search service will send requests. In addition to vertical scaling, we have implemented the ability to deploy to multiple machines. Requests to the service will be sent to the URL of the load balancer, and then proxying to this or that machine using the Round-robin algorithm will occur. After implementing the first prototype of the service, we realized that we could significantly reduce the volume of queries in MySQL. Since any shift of the pin with the choice of the feed point is a new search request, and therefore to our service, we thought that we could store a cache with the user's travel history for N minutes from the moment of the request to Redis. Thanks to this, we have reduced the load on the base by three times. As a result, the service scheme can be represented as follows:

We store requests to the service and its responses in ElasticSearch, and we transfer and monitor metrics that are responsible for the stability of work in NewRelic.

The general workflow of our service:

The search service sends a request to the search hints service.
The balancer selects one of the machines and sends this request to it.
Inside the machine, the request is sent to one of the open workers or enters the queue.
Inside the worker:
1. We validate the incoming request.
2. We make a request in Redis, if there is no order history for the user, then we go to MySQL and write the received data to Redis.
3. We do basic data preprocessing and collect features for the model.
4. We do it predict_proba()according to all the generated sajests and sort them by "probability".
5. We do additional post-processing of the data and form the response.
6. We return the answer to the search service.

What's next?

Now we are actively testing our model using switchback testing in order to subsequently draw conclusions and implement additional add-ons to the algorithm to improve the ranking quality. In the future, we will try to add additional features to the model, as well as, together with the product designers, prepare a new solution for the "display" of sagests. Of course, it would be great if our application itself understood where / when / where / by what car the user wants to leave. We work in all directions so that a taxi order is really made in one click.

From heuristics to machine learning: search suggestions in Citymobil