👸🏻 🕰️ 🍪 Small and fast BERT for Russian 🤩 ⛰️ 👩‍🚒

BERT is a neural network capable of quite well understanding the meaning of texts in human language. First appearing in 2018, this model revolutionized computational linguistics. The basic version of the model takes a long time to pre-train, reading millions of texts and gradually mastering the language, and then it can be further trained on your own applied task, for example, classifying comments or highlighting names, titles and addresses in the text. The standard version of BERT is quite large: it weighs more than 600 megabytes, and processes the sentence in about 120 milliseconds (on the CPU). In this post, I propose a scaled down version of BERT for Russian - 45 megabytes, 6 ms per sentence. There is already a tinybert for English from Huawei , there ismy FastText diminutive , but the small (English-) Russian BERT, it seems, appeared for the first time. But how good is he?

Distillation - the way to small

To create a small BERT, I decided to train my neural network using ready-made models as teachers. I'll explain in more detail now.

In short, BERT works like this: first, the tokenizer splits the text into tokens (pieces ranging in size from one letter to a whole word), embeddings from the table are taken from them, and these embeddings are updated several times using the self-attention mechanism to take into account the context (neighboring tokens). During pre-training, classic BERT performs two tasks: it guesses which tokens in the sentence were replaced with a special token [MASK]

, and whether two sentences followed each other in the text. As it was shown later , the second task is not really needed. But the token [CLS]

that is placed before the beginning of the text and the embedding of which was used for this second task continues to be used, and I also made a bet on it.

Distillation is a way of transferring knowledge from one model to another. It is faster than learning the model from text only. For example, the text [CLS] [MASK]

"right" decision - to put in place the mask token

, but a great model knows tokens

,

,

in this context, too, are relevant, and this knowledge is useful for teaching small model. It can be conveyed by making the small model not only predict the high probability of the correct token

, but reproduce the entire probability distribution of possible masked tokens in the given text.

bert-multilingual (), , , , . 120 , , , 30. 768 312, – 12 3. bert-multilingual, – .

BERT , , . , : RuBERT (, ), LaBSE (, ), Laser (, ) USE (, ). , , [CLS]

, . translation ranking ( LaBSE). , , CLS-, – Laser. T5. , :

( full word masks).
Translation ranking LaBSE: , . hard negatives, .
bert-base-multilingual-cased ( , .. ).
CLS- ( ) DeepPavlov/rubert-base-cased-sentence ( ).
CLS- ( ) CLS- LaBSE.
CLS- ( ) LASER.
CLS- ( ) USE.
T5 ( ) CLS-.

, , ablation study . Colab, learning rate . , , . - : ., OPUS-100 Tatoeba, 2.5 . , , . , rubert-tiny ( ), Huggingface.

?

Python transformers sentencepiece, . , 312- CLS- .

# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny")
model = AutoModel.from_pretrained("cointegrated/rubert-tiny")
# model.cuda()  # uncomment it if you have a GPU

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls(' ', model, tokenizer).shape)
# (312,)

. () , . , Huggingface (, ).

? BERT', . - .

	(CPU)	(GPU)
cointegrated/rubert-tiny	6	3	45
bert-base-multilingual-cased	125	8	680
DeepPavlov/rubert-base-cased-sentence	110	8	680
sentence-transformers/LaBSE	120	8	1.8
sberbank-ai/sbert_large_nlu_ru	420	16	1.6

Colab (Intel(R) Xeon(R) CPU @ 2.00GHz Tesla P100-PCIE) c 1. , GPU , .. .

, rubert-tiny CPU 20 , Heroku ( , , ). , . , - .

BERT – . – , . , feature extractors, – KNN.

RussianSuperGLUE, , , . RuSentEval, , , , - . . , . :

STS: ( ). , " " "- " 4 5 . . , LaBSE, 77%, – 65%, , 40 .

Paraphraser: , . : , , . 43% (, ).

XNLI: , . , " " " , ", . – . DeepPavlov ( ), .

SentiRuEval2016: . , , . , 5, .

OKMLCup: . ROC AUC, bert-base-cased-multilingual.

Inappropriateness: , . , 68% AUC ( , , 79%).

: , 18 68 . , . , . KNN ( ). LaBSE 75%, – 68%, DeepPavlov – 60%, – 58%, – 56%. .

: , , – . LaBSE, ( , ).

factRuEval-2016: (, , ). , F1 ( , ). , NER : 43%, – 67-69%.

RuDReC: . , : 58% 62-67%.

, NER, : CLS . , LS- , ( , ). . , , .

, LaBSE : 6 10 . LaBSE-en-ru, 99 , . 1.8 0.5 , , , . rubert-tiny DeepPavlov Sber, .

, . , , , , . BERT , : https://huggingface.co/cointegrated/rubert-tiny.

There is a lot of work ahead: on the one hand, I want to teach a small BERT to solve problems from RussianSuperGLUE (and not only), on the other hand, I want to drag good small models for controlled text generation into Russian (I have already started doing this for T5). Therefore, like this post, subscribe to my channel about NLP , add interesting problems in the comments and in PM, and if you have a chance to try rubert-tiny, then be sure to leave feedback!

I myself wonder what will happen next.

Small and fast BERT for Russian

Distillation - the way to small

?

More articles: