Catching prices. A Practical Guide to Sea Procurement
Let's start with the simplest way - let's try to fish with our hands. We open a database of goods from purchases and start looking for a similar product. Chances are high that we won't catch anything by evening.
Let's try to somehow filter the base of goods from purchases. Each item in purchases is assigned an OKPD2 code. The All-Russian Classifier of Products by Economic Activity is a code that contains information about a product.
For example:
Now we have a fishing rod, it has become a little better, but for some reason the fish does not want to hang on the hook.
Do not get upset, we use the bait. We need to find a way to automatically identify similar products within one OKPD2. To do this, you need to present the semantic meaning of the product description in the form of a tensor. To process information, you need to convert it to a number format. To do this, we will use a special mechanism for translating words into the vector space Word2Vec, which translates a word into a sequence of numbers of a given size, called a vector or tensor. Word2Vec is a model that is specially trained to understand the semantic meaning of a word. Products from the database with the same OKPD2 must be converted to tensors. Great, we now have one of our tensor products and a bunch of other product tensors.
We will search for the closest product by cosine distance, the more the product from the database is similar to ours, the less the cosine distance will be. We choose a product with a minimum cosine distance and this will be our desired fish.
Let's analyze the described method in practice. Let's start by converting the product description into a tensor. First, the product description is tokenized, that is, it is split into separate words. To transform words into their semantic meaning, a pre-trained Word2Vec with a dimension of 100 was used (that is, a word is represented by a set of 100 numbers).
We got an array of embeddings. To calculate the cosine distance, it is necessary to calculate the vector representation of the entire text containing the product description. The easiest way to implement is to take the average between all embeddings of description words, while the semantic meaning of the product will be distorted, but this is not critical for solving this problem.
After we have translated our product and all products with the same OKPD into embeddings, the next step is to calculate the cosine distance between them.
As you can see in the diagram below, the closest thing to the “Big fish” product is the “Salmon huge” and “Gold fish” products.
Thus, it can be assumed that the price of a large fish lies in the range between the prices of a goldfish and a huge salmon. The following results were obtained on real data:
Sometimes this approach may not work well. For example, in the product database there will be no product similar to ours. Then the caught fish will be too small, and the module will return an empty range.
The last method we will look at will be net fishing. Yandex will use as a network. A request is formed from the product description, and the first 20 responses are selected for further analysis. It makes no sense to take the following answers, since their relevance is questionable. The texts of the first 20 responses are sent to the price finder. The model selects prices from responses and forms a price range from them. A lot of different fish get into the fishing net, so the price range is very noisy. For filtering, let's drop outliers. If the prices obtained are normally distributed, then only 68 percent of the prices can be retained. That is, let's leave the prices that lie in the range between the difference between the average of all prices and the standard deviation, and the sum of the mean and standard deviation. This is how the product price ranges look in Yandex:
All considered fishing methods have their pros and cons. The choice of which option is preferable is made by each fisherman in accordance with his requirements of accuracy and time. Happy catch!