How to build full-text search using neural networks

Why is it difficult to search for very short documents using regular full-text search and what to do if you want to do it.





Introduction



We are all constantly faced with the so-called full-text search - finding documents by a search phrase. The most famous example is Google search.



. , , Elasticsearch. .



DD Planet B2B- Elasticsearch. ( ), .





, Elasticsearch, — , , . .



:



T0=" »", 
T1=" ", 
T2=" ",


:



"": {0, 1}
"": {0}
"": {1, 2}
"": {2}


— , . , . , , « ». «» {2}, «» — {0}. , . , {0, 2} c ½. , , TF-IDF, .





, , , -, :



  1. .

    : « » « » « » , , « » « », « ». , . 

    . : , . , , TF-IDF, .
  2. .

    — , , « 4», «4», « », « 4» . . 

    — Elasticsearch . , , .

  3. .

    , . , « » « Windows» «» .  



NLP



NLP . NLP (Natural Language Processing) — , . 



NLP - , - . , .



«»



NLP — Paraphrase Identification — (, ) , ( ).  : « 17:00» « ». ? , .



. . DeepPavlov.ai [1], , . , .



. ( ), . .. -. 



, DeepPavlov, — , .



,



, . ? , O(N) , Elasticsearch O(logN). .



: , . . 



, : — , d , :



  1. d(x,y)=0, x=y
  2. d(x,y)=d(y,x).
  3. d(x,z)d(x,y)+d(y,z). , , .


? (Nearest neighbor search) — . vantage-point tree, O(logN) [2]. , , , , Kd-. , .



Vantage-point tree



, vantage-point tree [3]. ball-tree, . . , . (vantage-point) ( ).





, ( S vantage-point), . — . . , S , . , .



, K X ( ). , (, ). D — . , .





X , «» . D. «» ? , D>T ( X ), . , «» .





K , DT, «» . .





(f(x,y)) vantage-point tree :



  1. — , . , , . cosine Doc2Vec — . 



    d(x,y)=f(x,y)+εSDoc2Vec(xy)



    ε — .



  2. . ? , , , float32. - . x y, , . 



    d(x,y)=f(x,y)+f(y,x)



  3. d(x,z)d(x,y)+d(y,z). . ,



    x=" ", y=" ", z=" "


    , d(x,z)d(x,y)+d(y,z). - . , Doc2Vec — . 





, , — , , . , : [2]. — , . 



. ( ). , , ( ). , . «» .







( ), . . 





. ? , . : , ? vantage-point tree , — vantage-point. 





, [2], . , . . 



« ». , . , . 



. , . GitHub pip install nlp-text-search.





[1] http://docs.deeppavlov.ai/en/master/.



[2] Yianilos (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. Fourth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 311–321. pny93. http://web.cs.iastate.edu/~honavar/nndatastructures.pdf .



[3] http://stevehanov.ca/blog/?id=130




All Articles