How we made the automatic selection of similar products

image



In previous articles, I talked about how we learned how to compare products from different sources and fill out a product card - characteristics, images, description. And when the prices of suppliers, the prices of competitors and the characteristics of the goods are known, a logical continuation was the search for information about analogs or simply goods similar in their properties.



This can be used in different ways, for example, show the customer several similar positions on the product card, perhaps he will like some one more. If something is out of stock, a list of similar products in stock will also be useful. The second option is to give this information to the call center employees so that they can quickly (or, in principle, be able) to offer analogs if the requested product is not available, or the analog is better suited to the wishes of the client.



How can you tell if products are similar? You can compare the characteristics, the more matched, the more similar the products. Unfortunately, it doesn't work that easily. In practice, it turns out that, as a rule, there are almost no products where all the characteristics are filled. 80% is a good result. Secondly, some characteristics are more important than others. For example, a 65 "TV is completely different from a 22" TV, although both have 2 USB ports. Or, another example, a metal case and an aluminum case are much closer to each other than to plastic, although they are three different meanings.



Thus, to select similar products, we need to solve the following tasks:



  1. Assign weight characteristics. The diagonal size is important, the number of USB ports is less important.
  2. Determine the range of values ​​of each characteristic, and on it set the function of the distance between values.
  3. Decide on a strategy for handling cases when a characteristic is known for one product, but not for another.
  4. Having the distance between the values ​​of all characteristics, calculate the distance between the goods.
  5. Think about performance, calculating all distance pairs has complexity

    O(N2)

    And if calculating 50 million distances for 10 thousand goods does not seem like a big problem, then 50 billion for 300 thousand is already a lot.


Let's solve these problems. To some extent, this will be research work.



How We Determine Feature Weights



We used two basic ideas with weights.



  • The characteristics that affect the price are important. The converse is not necessarily true. For example, the color of a mobile phone is important enough, but it hardly affects the price.
  • In order to identify important characteristics that do not affect the price, we assume that they are, on average, better filled.


Further, for each category, we assign weights to all characteristics. To do this, do the following:



  1. If the characteristic is numeric, then we consider the correlation with the Pearson price.
  2. If the enumeration is with a mutually exclusive choice (but not numbers), then we order its elements by the average price of goods with this value, and we calculate the correlation with the Spearman price.
  3. If multiple choice is provided, then we reduce it to a set of mutually exclusive (yes / no), and we calculate the correlation of each with the price according to Spearman. We reduce the resulting coefficient depending on the number of options.
  4. We calculate the percentage of filled values ​​for each of the characteristics and increase or decrease its weight obtained earlier.
  5. The obtained values ​​can be used as weights, but in practice, the best result is obtained if they are again nonlinearly transformed, preserving the order.


Each of the steps has its own nuances, for example, how to calculate the price if in one case only retail prices are known, in the other - only wholesale prices, and in the third both those and others. Or one of the stores made a mistake with the price and sells a bedside table at the price of a cabinet from the same series.



How do we calculate the distance between goods



Choosing the algorithm by which we will calculate the distance between the values ​​of the characteristic, we need to keep in mind how we are going to calculate the distance between the goods, having the distances between the individual characteristics and their weight. My intuition tells me to start with just a distance in n-dimensional space, i.e. the square root of the sum of the squares of the distances between the characteristics.



Further, intuition says that in this case the function of the distance between values ​​should be distributive, and even better, if the triangle inequality is fulfilled. I cannot prove the correctness of such requirements, but we will comply with these conditions.



Then the following functions can be taken as a function of distance:



  • β€” , . , 35 , β€” 75 , 40 . .
  • β€” (, ?), . .
  • , .


Now about performance. In practice, it turned out that in a reasonable time (up to 5 minutes) we can calculate pairwise distances between 30 thousand goods. But at the same time, in some categories of goods there are more, for example, there may be a hundred thousand mattresses in the catalog, and in this case we are talking about increasing the time spent 10 times.



Optimization of this case looks like this: we order all products by the value of the characteristic with the highest weight

O(Nβˆ—log(N))

This is faster than

O(N2)

Then we divide all products into overlapping groups (say, overlapping by 20%), and calculate the pairwise distances within each group. Thus, up to 30 thousand products in a category, processing time increases as

O(N2)

and starting from 30 thousand - how

O(Nβˆ—log(N))





results



I will give several examples of the results of automatic search for similar products using this algorithm (the first in the table will be the product for which we were looking for similar products)













Bosch WLT24540OE

Bosch WLN24240OE

Samsung WW80K6210RW

Bosch WLT24460OE

Siemens WS12T440OE

Siemens WS12T540OE

A type automatic

automatic

automatic

automatic

automatic

automatic

Execution free-standing

free-standing

free-standing

free-standing

free-standing

free-standing

Loading laundry frontal

frontal

frontal

frontal

frontal

frontal

Maximum loading 7

7

eight

7

7

7

Colour white

white

white

white

white

white

Energy class A +++

A +++

A +++

A +++

A +++

A +++

Spin class B

B

B

B

B

B

Number of programs fourteen

fifteen

fourteen

fifteen



fourteen

Hatch color silver

white

the black

silver

silver

silver

Maximum spin speed 1200

1200

1200

1200

1200

1200

Ind. time until the end of the program +



+

+

+



Power consumption 2300,00





2300,00

2300,00



Imbalance control +

+

+

+

+

+

Body material plastic

plastic

plastic

plastic

plastic

plastic

Power cord length 1.75

1.75









Embedding









under the countertop

Number of drums

1







1

Selecting the spin speed +

+

+

+

+

+

Canceling spin +

+

+

+

+

+

Bubble generator



+







All programs synthetics

additional rinse

additional rinse

additional rinse

additional rinse

eco wash

Market launch date 2016

2016

2016

2015





Electricity consumption per cycle 0,91

0,91



0,96

0,91

0,91

38,00

38,00



38,00

38,00

38,00





8100,00





8550,00

A

A

A

A

A

A

+

+

+

+

+

+

β€” β€”

β€” β€” β€”
56

56

56

56



56

77

78

75

78

76

77

β€” β€” β€” β€” β€” β€”
β€” β€” β€” β€” β€” β€”










165Β°

32,00





32,00

32,00

32,00



β€” β€” β€” β€” +

β€” β€” β€” β€” β€” β€”


46,00



46,00

46,00

46,00

























+

+

+

+

+

+







β€”

+

β€”



β€”



+

+





+

+

+







+

+

+

+







+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+





β€”

+











β€”












84,80

84,80

85,00

84,80

84,80

84,80

59,80

59,80

60,00

59,80

59,80

60,00

44,50

44,60

45,60

44,40

44,60

44,60

48,60

48,60



48,60

47,40



65,00

63,00

67,00

64,00

65,00

63,00

-























Hotpoint-Ariston WMTF 701 H CIS

Hotpoint-Ariston WMTL 601 L CIS

Gorenje WT62093 468938

Whirlpool AWE 7515/1

Zanussi ZWY51004WA































7

6

6

5.5

5.5











A+

A+

A+

A+

A+

C

C

C

C

C

18

18

18

11













1000

1000

900

1000

1000

2100,00

2100,00



2100,00





+

+

+

+

















1





+

+

+

+



+

+

+

+











1,18



1,02

0,93



50,00



48,00

48,00









8674,00



A

A

A

A

A

β€” β€” β€”

β€”
59

59

59

59

58

75

76

76

76

75

β€” β€” β€”

β€”
β€” β€” β€” β€” β€”










β€” β€” β€” β€”







β€”







42,00























+



+

+

+

β€” β€” β€” β€”

A

A









+



+









+



+

+

+

+

+

+

+

+

+



β€”



β€”

β€” β€” β€”



90,00

90,00

85,00

90,00

89,00

40,00

40,00

40,00

40,00

40,00

60,00

60,00

60,00

60,00

60,00



58,00

58,00

58,00

58,00

-











These examples show that, in principle, the algorithm did a good job and selected in the first case free-standing automatic washing machines with horizontal loading of the same depth with approximately the same maximum load (I am not a great specialist in washing machines, but it is these characteristics that seem important to me). In the second case - also free-standing automatic washing machines, but with top loading. The width and depth of the proposed options are the same. In both cases, activator or built-in machines were not offered, as well as compact wall-mounted machines, although they are in the catalog.



Chances are, a large appliance specialist could have done a better job (we discussed the results in different categories with the salespeople, they approved most of the options, but also suggested options that we did not include in the result). Trying on the result for myself as a buyer, I find such recommendations useful, I did not find gross misses in expectations.



Thus, after the implementation of this algorithm, having only the name of the product as input, we can automatically find it from suppliers and competitors, fill in its characteristics, select images and even offer analogs. This greatly simplifies the work of content managers and sales managers.



All Articles