30 billion parameters: is it realistic to teach Russian GPT-3 at home?

Not so long ago, Sberbank, and then Yandex, announced the creation of super-large Russian language models similar to GPT-3. They not only generate believable text (articles, songs, blogs, etc.), but also solve many different problems, and these problems can often be set in Russian without programming and additional training - something very close to "universal" artificial intelligence ... But, as the authors of Sberbank write in their blog, “ such experiments are available only to companies with significant computing resources.". Training models with billions of parameters costs several tens, or even hundreds of millions of rubles. It turns out that individual developers and small companies are now excluded from the process and can now only use models trained by someone. In this article, I'll try to challenge this thesis by talking about the results of trying to train a model with 30 billion parameters on two RTX 2080Ti cards.





How this article is structured, which parts you can skip

In the first part, I will briefly describe what the language model is and what it gives. If you know this, you can skip the first part without prejudice.





The second part explains the problem with training large models and discusses the possibilities for solving it.





Part 3 talks about the results of trying to train a model with 30 billion parameters on two RTX 2080Ti cards





In conclusion, we discuss why you need to bother learning language models on your own. This part is somewhat subjective and expresses my personal opinion, if you are not interested in the subjective opinion, then you can skip it.





In general, I tried to make the article understandable for a wide range of developers and self-sufficient, so I included a small background to the question and an explanation of important terms for understanding, as a result of which the article became a little bloated. Experts can replace that some things are explained not quite accurately and not completely, but here I tried to prefer an intuitive presentation to a mathematically rigorous one.





With this preface, let's get down to the main presentation.





What are language models and why it matters

, ( , ). , , . .









. - , . :









, . , :





-





, .





, — , «» . , , :





– , , . , , , , .





, . , , , « . ? :» . « :» , . , .





, Google Brain [1] OpenAI [2,3]





, , , , , , .





? , . — . — .





, :





1. *













GPU/





GPU





,





RuGPT3-small





117





5376





32





940





RuGPT3-medium





345





32256





64





5.6





RuGPT3-large





762





43008





128





7.5





RuGPT3-XL





1.3





61440





256





10.7





RuGPT3-12B





12





609484.8*





?





106





YaLM-10B





13





626688.0





?





109





* . ,





** , OpenAI





GPT-

GPT-2 c 340 1.5 . , GPT-2 «» ( ). GPT-3 , , ruGPT3-small ruGPT3-large GPT-2, . GPT-3 GPT-2 , OpenAI 175 . , . ruGPT3 GPT-3, GPT-2 .





, , . , «» - ( ).





«» GPT-3 175 .





, , . , - , . , .





, . 100 — , . , , , "" .





:

, «» - , - , . , , , , .





. , , Portia , . Portia — , . . , , . Portia , , , [4]





, , — , , .





, , Mixture of Experts (MoE). [5]. , , «» . . , ( 1).





Figure 1. Simplified illustration of the Mixture of Experts engine
1. Mixture of Experts

2020 Google Brain , , . Google Switch Transformer [6].





Switch Transformer GPT-2/3 , . ( Masked Language Objective). , .





, Switch Transformer 2.5-6 , ( — . dense) , .





-, - . Switch Transformer c 390 SuperGlue , 13 (SuperGlue 10 ).





( 2, ), , , Switch . , Switch , «». , ( « ?»), Switch .





Figure 2. Comparison of the results of Switch models with monolithic (Dense) models
2. Switch (Dense)

, - ( — ).





, Switch ( ), GPU, , (. . ). .





, , . ?





, (. . , ). [7, 8], . — ( ), .





20GB Common Crawl, - (7GB) (5GB). RTX 2080Ti ( 70000 ).





, GM30B 30 ( 160 GB ). (32 GB ), , .





«» , . , 100, 100





2.









( - )





RuGPT3-small





30.91





RuGPT3-medium





22.96





RuGPT3-large





21.34





GM30B





18.02





, , 1. , , , . , , .





, — - . , . , - ( , , ).





, , . , . ( YaLM, «», ).





: . ,





:





ruGPT3-medium: , .





ruGPT3-large: , , , . , , , "" !





ruGPT3-XL: , . ,





YALM ( «»): , . , , , .





GT30B: , , . , "", , (2), .





« » ruGPT3-large — - .





: , . , . , *





:





:





ruGPT3-medium: , . , , .





ruGPT3-large:  , — , , , , . , , .





ruGPT3-XL: : " ". . .





YAML (c «»):— ! — , . — ! —





GM30B: , . , , , , : - - , , .





* . « ». GM30B , .





, , . . 400Kb (400Kb , « » ).





Figure 3. Comparison of the efficiency of retraining GM30B and ruGPT3-medium
3. GM30B ruGPT3-medium

3 ruGPT3-medium. , .





. , (, ).





, . , , .





.





PARus «Russian SuperGlue». :





« PARus , , , . , 50%»





, « », : « -» « ». , .





3. PARus













RuGPT3-small





0.562





RuGPT3-medium





0.598





RuGPT3-large





0.584





RuGPT3-XL





0.676





GM30B





0.704





YaLM 1B





0.766





Leaderboard «Russian SuperGlue». . GM30B zero-shot learning, , , . .





, , GM30B , RuGPT3-XL.





Russian SuperGlue c PARus

? , , , ruGPT3/YaLM , . .





?

, . ( GitHub Copilot), , , , , - . , , . , , , -.





. GPT3 OpenAI , , «» . «» .





, OpenAI

GPT-3 , , , .. , . , , . " - ". , - . , , .





, «» , , . , Microsoft OpenAI 1 , GPT3 ( OpenAI). Microsoft , , GitHub Copilot, GPT3. , , , source.dev, API , , API , .





( ). . OpenAI GPT-2, , . , talktotransformer InferKit. InferKit , . , API GPT-3 OpenAI InferKit , .





, OpenAI ? talktotransformer, InferKit . .. .





, , , , — GPU , ruGPT3-XL .





, , - . — , ( ) — .





, IT, . - «» - . , , .





— . , , GPT-3 . , , , . , - ( ruGPT3 YaLM - Yet Anoter Language Model - , ).





, . , .





1. Trinh, Trieu H., and Quoc V. Le. "Do language models have common sense?." (2018). https://openreview.net/forum?id=rkgfWh0qKX





2. Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.





3. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.





4. Prete, Frederick R., ed. Complex worlds from simpler nervous systems. MIT press, 2004.





5. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton,and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017





6. Fedus, William, Barret Zoph, and Noam Shazeer. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv preprint arXiv:2101.03961 (2021).





7. Liang, Yuchen, et al. "Can a Fruit Fly Learn Word Embeddings?." arXiv preprint arXiv:2101.06887 (2021).





8. Ryali, Chaitanya, et al. "Bio-inspired hashing for unsupervised similarity search." International Conference on Machine Learning . PMLR, 2020 and








All Articles