
, , , ; , , β TF-IDF BM25 Sentence-BERT. , Jupyter.
β . , . , β , Google. , - "what is semantic similarity search?" "traditional vs vector similarity search". Google , .
$ 1,65 β [1], , . β , . , . , , . .
:
.
.
.
, : , .
β , . A B: .

: , :
, 3 4 β , β 2/8 0,25. : , , β .

, , , b c . , β , , 'the', 'a', 'is' . ., , , . , -, / . . , , .
β / β "". 2- a :
a = {'his thought', 'thought process', 'process is', ...}
/ :
b c, 0,125.
β β , , :

; , ! , β . a b , i j β a b . , :

, :

! β β , . a b , :

β :
, . if min(i, j) = 0 β , i j ? , max(i, j), β i j:

min(i, j) == 0, max(i, j).
, β . if min(i, j) = 0 β , 0? min {. , . β :

lev(i-1, j), β , . . . +1 , a[i] != b[i] β .
![a[i] != b[j], 1 β a[i] != b[j], 1 β](https://habrastorage.org/getpro/habr/upload_files/12c/fbc/914/12cfbc914c84442521dd0be415c77d46.png)
:
β .
, β , a i b j. a b, lev[-1, -1].
![β [-1, -1] β β [-1, -1] β](https://habrastorage.org/getpro/habr/upload_files/c9d/56c/572/c9d56c5721de0206c637ddecac7be442.png)
:
TF-IDF.
BM25.
word2vec/doc2vec.
BERT.
USE.
(ANN) , , β . TF-IDF, BM25 BERT, , .
1. TF-IDF
, 1970- . : Term Frequency (TF) Inverse Document Frequency (IDF). TF , , .

, f(q, D) f(t, D) . TF β , . "the", TF, , "bananas". , - . , , 'the', 'is' 'it', , 'bananas' 'street'.
, . TF β IDF. , .

. IDF 'is', , 'forest'. 'is' 'forest', TF IDF :

TF('is', D) TF('forest', D) a, b c. IDF , IDF('is') IDF('forest') . TF-IDF TF IDF. a 'forest', 'is' 0, IDF('is') 0.
, ? , ( ) TF-IDF .

TF-IDF, :
TF-IDF. , 20 000 , , , - .
2. BM25
TF-IDF, Okapi BM25, TF-IDF . TF-IDF β , , . 500 , "" 6 , β 12 , ? , . BM25 TF-IDF:

, , TF-IDF ! TF:

IDF, β IDF TF-IDF.

, ? 12 "" , :


TF-IDF . , β TF-IDF. ! Python? , TF-IDF.
k b, . 'purple' a, 'bananas' b, c, c β . , , TF-IDF.
, TF-IDF, . [] , .
, .
3. BERT
BERT (Bidirectional Encoder Representations from Transformers) β , NLP . 12 ( ) BERT .
768 , 512 BERT . , . ( ) , . , , , , . , .

: 512 β . - β β BERT. Sentence-BERT , , [2]. SBERT: sentence-transformers transformers PyTorch.
, transformers PyTorch, . HF, : SBERT , .
, g , b, β . - β .
768, , ( SBERT, 128, BERT 512). .
, , β embeddings attention_mask. attention_mask , " " (, ), β .
, , .
, :

, b g , . , SBERT β , , 0,66 ( ). , () SBERT. , , :
, , . , Bert!
[1] Market Capitalization of Alphabet (GOOG), Companies Market Cap.
[2] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), Proceedings of the 2019 Conference on Empirical Methods in 2019.
Colab
Jaccard,
Levenshtein,
TF-IDF,
BM25,
SBERT.
β . , Β«Machine Learning Deep LearningΒ». Data Science.

, :
Data Scientist
Data Analyst
Data Engineering
Fullstack- Python
Java-
QA- JAVA
Frontend-
C++
Unity
-
iOS-
Android-
Machine Learning
Β«Machine Learning Deep LearningΒ»
Β« Data ScienceΒ»
Β« Machine Learning Data ScienceΒ»
Β«Python -Β»
Β« Β»
DevOps