Why is it difficult to search for very short documents using regular full-text search and what to do if you want to do it.
Introduction
We are all constantly faced with the so-called full-text search - finding documents by a search phrase. The most famous example is Google search.
. , , Elasticsearch. .
DD Planet B2B- Elasticsearch. ( ), .
, Elasticsearch, — , , . .
:
T0=" »",
T1=" ",
T2=" ",
:
"": {0, 1}
"": {0}
"": {1, 2}
"": {2}
— , . , . , , « ». «» {2}, «» — {0}. , . , {0, 2} c ½. , , TF-IDF, .
, , , -, :
- .
: « » « » « » , , « » « », « ». , .
. : , . , , TF-IDF, . - .
— , , « 4», «4», « », « 4» . .
— Elasticsearch . , , .
- .
, . , « » « Windows» «» .
NLP
NLP . NLP (Natural Language Processing) — , .
NLP - , - . , .
«»
NLP — Paraphrase Identification — (, ) , ( ). : « 17:00» « ». ? , .
. . DeepPavlov.ai [1], , . , .
. ( ), . .. -.
, DeepPavlov, — , .
,
, . ? , , Elasticsearch
: , . .
, : — ,
-
,d ( x , y ) = 0 .x = y -
d ( x , y ) = d ( y , x ) . - —
, , .d ( x , z ) ≤ d ( x , y ) + d ( y , z ) .
? (Nearest neighbor search) — . vantage-point tree,
Vantage-point tree
, vantage-point tree [3]. ball-tree, . . , . (vantage-point) ( ).
, (
, K
K ,
— , . , , . cosine Doc2Vec — .
d ( x , y ) = f ( x , y ) + ε ⋅ S D o c 2 V e c ( x ⋅ y )
ε — .
. ? , , , float32. - .
x , , .y
d ( x , y ) = f ( x , y ) + f ( y , x )
. . ,d ( x , z ) ≤ d ( x , y ) + d ( y , z )
x=" ", y=" ", z=" "
,
- . , Doc2Vec — .d ( x , z ) ≥ d ( x , y ) + d ( y , z ) .
, , — , , . , : [2]. — , .
. ( ). , , ( ). , . «» .
( ), . .
. ? , . : , ? vantage-point tree , — vantage-point.
, [2], . , . .
« ». , . , .
. , . GitHub pip install nlp-text-search
.
[1] http://docs.deeppavlov.ai/en/master/.
[2] Yianilos (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. Fourth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 311–321. pny93. http://web.cs.iastate.edu/~honavar/nndatastructures.pdf .