, , , ; , , — TF-IDF BM25 Sentence-BERT. , Jupyter.
— . , . , — , Google. , - "what is semantic similarity search?" "traditional vs vector similarity search". Google , .
$ 1,65 — [1], , . — , . , . , , . .
:
.
.
.
, : , .
— , . A B: .
: , :
, 3 4 — , — 2/8 0,25. : , , — .
, , , b c . , — , , 'the', 'a', 'is' . ., , , . , -, / . . , , .
— / — "". 2- a :
a = {'his thought', 'thought process', 'process is', ...}
/ :
b c, 0,125.
— — , , :
; , ! , — . a b , i j — a b . , :
, :
! — — , . a b , :
— :
, . if min(i, j) = 0 — , i j ? , max(i, j), — i j:
min(i, j) == 0, max(i, j).
, — . if min(i, j) = 0 — , 0? min {. , . — :
lev(i-1, j), — , . . . +1 , a[i] != b[i] — .
:
— .
, — , a i b j. a b, lev[-1, -1].
:
TF-IDF.
BM25.
word2vec/doc2vec.
BERT.
USE.
(ANN) , , — . TF-IDF, BM25 BERT, , .
1. TF-IDF
, 1970- . : Term Frequency (TF) Inverse Document Frequency (IDF). TF , , .
, f(q, D) f(t, D) . TF — , . "the", TF, , "bananas". , - . , , 'the', 'is' 'it', , 'bananas' 'street'.
, . TF — IDF. , .
. IDF 'is', , 'forest'. 'is' 'forest', TF IDF :
TF('is', D) TF('forest', D) a, b c. IDF , IDF('is') IDF('forest') . TF-IDF TF IDF. a 'forest', 'is' 0, IDF('is') 0.
, ? , ( ) TF-IDF .
TF-IDF, :
TF-IDF. , 20 000 , , , - .
2. BM25
TF-IDF, Okapi BM25, TF-IDF . TF-IDF — , , . 500 , "" 6 , — 12 , ? , . BM25 TF-IDF:
, , TF-IDF ! TF:
IDF, — IDF TF-IDF.
, ? 12 "" , :
TF-IDF . , — TF-IDF. ! Python? , TF-IDF.
k b, . 'purple' a, 'bananas' b, c, c — . , , TF-IDF.
, TF-IDF, . [] , .
, .
3. BERT
BERT (Bidirectional Encoder Representations from Transformers) — , NLP . 12 ( ) BERT .
768 , 512 BERT . , . ( ) , . , , , , . , .
: 512 — . - — — BERT. Sentence-BERT , , [2]. SBERT: sentence-transformers transformers PyTorch.
, transformers PyTorch, . HF, : SBERT , .
, g , b, — . - — .
768, , ( SBERT, 128, BERT 512). .
, , — embeddings attention_mask. attention_mask , " " (, ), — .
, , .
, :
, b g , . , SBERT — , , 0,66 ( ). , () SBERT. , , :
, , . , Bert!
Colab
-
-
-
BM25,
-
— . , «Machine Learning Deep Learning». Data Science.