More efficient pre-training of NLP models with ELECTRA

Recent developments in pre-learning language models have led to significant advances in Natural Language Processing (NLP), giving rise to highly efficient models such as BERT , RoBERTa , XLNet , ALBERT , T5, and many more. These methods, which have different architectures, are nevertheless united by the idea of ​​using large amounts of unlabeled text data to create a general model of natural language understanding, which is then further trained and finely tuned to solve specific applied problems, such as sentiment analysis or building question-answer systems.



The existing pre-training methods fall mainly into two categories:



  • Language Models (LM) such as GPT , which process text in the input from left to right, predicting the next word in a previously defined context;
  • Masked Language Models (MLM) such as BERT, RoBERTa, and ALBERT, which try to predict the masked words of the source text.


The advantage of MLM is that it works bi-directionally, i.e. They "see" the text on either side of the predicted token, in contrast to LMs, which face only one direction. However, MLM (and models like XLNet) also have disadvantages stemming from their pre-training task: instead of predicting every word of the input sequence, they predict only a small masked part - only about 15%, which reduces the amount of information received from one sentence.



image3



. () . : (, GPT), . : (, BERT), , .



«ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators» , BERT’, . ELECTRA – , (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) — , , . , ELECTRA , RoBERTa XLNet GLUE, , , ¼ , - SQuAD. ELECTRA , : 1 GPU , (accuracy), GPT, 30 . ELECTRA TensorFlow .





ELECTRA – (replaced token detection, RTD), ( MLM) ( LM). - (generative adversarial network, GAN), ELECTRA «» «» . , , «[MASK]» ( BERT’), RTD , . , , «cooked» «ate». , , . (.. ) , , . , , (15% BERT). RTD , MLM – ELECTRA «» , , .. . , RTD , .. , .



image4



.



, . , , ELECTRA ( BERT- ), . , , , GAN, , , - GAN . . , ( ELECTRA) NLP . .



image1



. MLM , ELECTRA.





ELECTRA c NLP , , , RoBERTa XLNet 25% , .



image2



x , ( FLOPs), y – GLUE. ELECTRA , NLP . , GLUE, T5, , .. ( 10 , RoBERTa).



, ELECTRA-Small, , GPU 4 . , , TPU , ELECTRA-Small GPT, 1/30 .



, , ELECTRA , ELECTRA-Large ( RoBERTa 10% T5). - SQuAD 2.0 (. ) RoBERTa, XLNet ALBERT GLUE. T5-11b GLUE, ELECTRA 30 10% , T5.



image5



ELECTRA-Large SQuAD 2.0 ( ).



ELECTRA



The code of both ELECTRA pre-training and fine tuning on applied NLP tasks, such as text classification, question-answer tasks and sequence markup, has been published in open access . The code supports fast training of a small ELECTRA model on a single GPU. The weights of pre-trained models such as ELECTRA-Large, ELECTRA-Base and ELECTRA-Small are also posted. While ELECTRA is only available in English, in the future, the developers plan to pre-train the model in other languages.



Authors






All Articles