Semantic Search: From Simple Jaccard Similarity to Complex SBERT

, , , ; , , — TF-IDF BM25 Sentence-BERT. , Jupyter.






— . , . , — , Google. , - "what is semantic similarity search?" "traditional vs vector similarity search". Google , . 





$ 1,65 — [1], , . — , . , . , , . .





:





  • .





  • .





  • .





, : , .





— , . A B: .





Jaccard similarity measures the intersection between two sequences versus combining two sequences

: , :





, 3 4 — ,   — 2/8 0,25. : , , — .





Jaccard similarity calculated between two sentences a and b
, a b

, , , b c . , — , , 'the', 'a', 'is' . ., , , . , -, / . . , , .





/ — "". 2- a :





a = {'his thought', 'thought process', 'process is', ...}
      
      



/ :





b c, 0,125.





— — , , :





Levenshtein distance formula

; , ! , — . a b , i j — a b . , :





"Levenshtein" and the misspelled "Livinshten"
"Levenshtein" "Livinshten"

, :





We index the word itself, starting from 1 to its length, the zero index means "nothing" (more on this later).
, 1 , "" ( — ).

! — — , .  a b , :





  - —           'Levenshtein'  'Livinshten'.
- — 'Levenshtein' 'Livinshten'.

— :





, . if min(i, j) = 0 — , i j ? , max(i, j), — i j:





 ,  ,  i / j  0,      max(i, j)
, , i / j 0, max(i, j)

min(i, j) == 0, max(i, j).





, — . if min(i, j) = 0 — , 0? min {. , . — :





           (   )
( )

lev(i-1, j), — , . . . +1 , a[i] != b[i] — .





 a[i] != b[j],  1     —
a[i] != b[j], 1 —

:





— .





, — , a i b j. a b, lev[-1, -1].





    —   [-1, -1]  —
— [-1, -1] —

:





  • TF-IDF.





  • BM25.





  • word2vec/doc2vec.





  • BERT.





  • USE.





(ANN) , , — . TF-IDF, BM25 BERT, , .





1. TF-IDF

, 1970- . : Term Frequency (TF) Inverse Document Frequency (IDF). TF , , .





   (TF) TF-IDF     ("bananas")
(TF) TF-IDF ("bananas")

, f(q, D) f(t, D) . TF — , . "the", TF, , "bananas". , - . , , 'the', 'is' 'it', , 'bananas' 'street'.





, . TF — IDF. , .





    (IDF) TF-IDF   ,   .
(IDF) TF-IDF , .

. IDF 'is', , 'forest'. 'is' 'forest', TF IDF :





TF('is', D) TF('forest', D) a, b c. IDF , IDF('is') IDF('forest') . TF-IDF TF IDF. a 'forest', 'is' 0, IDF('is') 0.





, ? , ( ) TF-IDF .





   TF-IDF   ,    TF-IDF.
TF-IDF , TF-IDF.

TF-IDF, :





TF-IDF. , 20 000 , , , - .





2. BM25

TF-IDF, Okapi BM25, TF-IDF . TF-IDF — , , . 500 ,   "" 6 , — 12 , ? , . BM25 TF-IDF:





 BM25
BM25

, , TF-IDF ! TF:





TF- BM25 ()    TF- TF-IDF ()
TF- BM25 () TF- TF-IDF ()

IDF, — IDF TF-IDF.





IDF  BM25 ()    IDF TF-IDF ()
IDF BM25 () IDF TF-IDF ()

, ? 12 "" , :





  TF-IDF ()  BM25 ()     12       ( x)
TF-IDF () BM25 () 12 ( x)

TF-IDF . , — TF-IDF. ! Python? , TF-IDF.





k b, . 'purple' a, 'bananas' b, c, c — . , , TF-IDF.





, TF-IDF, . [] , . 





, .





3. BERT

BERT (Bidirectional Encoder Representations from Transformers) — , NLP . 12 ( ) BERT





768 , 512 BERT . , . ( ) , . , , , , . , .





    (    ) ,    .
( ) , .

: 512 — . - — — BERT. Sentence-BERT , , [2]. SBERT: sentence-transformers transformers PyTorch. 





, transformers PyTorch, . HF, : SBERT , .





, g , b, — . - — .





768, , ( SBERT, 128, BERT 512). .





, , — embeddings attention_mask. attention_mask , " " (, ), — .





, , .





, :





 ,        SBERT:     b  g
, SBERT: b g

, b g , . , SBERT — , , 0,66 ( ). , () SBERT. , , :





, , . , Bert!





  • [1] Market Capitalization of Alphabet (GOOG), Companies Market Cap.





  • [2] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), Proceedings of the 2019 Conference on Empirical Methods in 2019.









Colab
  • Jaccard,





  • Levenshtein,





  • TF-IDF,





  • BM25,





  • SBERT.





— . , «Machine Learning Deep Learning». Data Science.





,      :





  • Data Scientist





  • Data Analyst





  •  Data Engineering









  • Fullstack-  Python





  • Java-





  • QA-  JAVA





  • Frontend-









  • C++





  •  Unity





  • -





  • iOS-  





  • Android-  









  •  Machine Learning





  • «Machine Learning  Deep Learning»





  • « Data Science»





  • «  Machine Learning Data Science»





  • «Python -»





  • «   »





  •  





  •  DevOps








All Articles