Pustobrёkh GPT-2: Russian edition

image


Having plunged into the topic of DL NLP, I came across an interesting repository on the Internet . This is not much not enough - Russian GPT-2 ! Well, that is Russian-speaking. And not some small 117M, but quite a 1250M, which is already quite serious. The author has done a lot of work on adapting the original model and preparing the corpus for training. And the training itself, I suppose, took a lot of time and power. In general, respect to comrade l4rz ! I plan to follow in his footsteps and train the system on my case (which I am currently in the process of preparing) - fortunately, the author left rather detailed instructions on how to approach such a large-scale task ... I will report back on the results! ;)



In the meantime, purely for fun, I propose to the public some experiments with the system, based on the post about the cooler GPT-3 (although the author, obviously, does not consider GPT-3 to be anything outstanding there). On the one hand, the system presented here is much simpler than GPT-3, on the other hand, it is still a trained Russian-language model! I think it's funny.



Disclaimer. The results are largely determined by the corpus on which the network learned. Here's what the author says about it: I scraped a couple of Russian press sites , parsed HTML with beautifulsoup4 and saved parsed texts as well as metadata (headers, TL; DRs, timestamps) for further sorting and postprocessing in PKLs ... In order to push things further the 4Gb dataset (415M tokens) was augmented with 3Gb of filtered fanfics , becoming a 7Gb one (862M tokens).



The experimental technique is simple: I run the network three times with the same input. Network output is marked in bold... I did not make any edits, except for purely cosmetic ones - like missing spaces (which were omitted because of my interference in the code *). On the whole, the experiments repeat those described in the article about GPT-3, and I decided not to comment on them in any way ...;)



Experiment # 1

. >>>

You pour yourself a glass of cranberry juice, but then absentmindedly add a teaspoon of grape juice to it. He looks fine. You try to sniff it, but you have a bad cold, so you don't smell. You are very thirsty. Therefore, you drink coffee with honey and sugar.
>>> new

You pour yourself a glass of cranberry juice, but then absentmindedly add a teaspoon of grape juice to it. He looks fine. You try to sniff it, but you have a bad cold, so you don't smell. You are very thirsty. So you pour yourself coffee again. You feel nauseous. There is nothing wrong with that, you do not know how to drink it anyway. You feel something very bad, but nobody pays attention to you.
>>> new

You pour yourself a glass of cranberry juice, but then absentmindedly add a teaspoon of grape juice to it. He looks fine. You try to sniff it, but you have a bad cold, so you don't smell. You are very thirsty. So you keep eating if you want, but don't know what it looks like.


Experiment # 2

. >>>

You have hosted a small party. You want to serve dinner in the living room. The dining room table is wider than the doorway, and you'll have to drink a glass of hot milk to move it into the living room .
>>> new

You have hosted a small party. You want to serve dinner in the living room. The dining room table is wider than the doorway and takes a few steps to move into the living room .
>>> new

You have organized a small party. You want to serve dinner in the living room. The dining room table is wider than the doorway, and it takes a lot of work to move it into the living room .


Experiment # 3

. >>>

You are a defense attorney and you need to go to court today. While dressing in the morning, you find that the pants from your suit are very dirty. However, your swim trunks are clean and very stylish. These are high-end swimming trunks from France that Isabelle gave you for your birthday. Do you decide what to wear for the trial ?

Oh yeah*
>>> new

You are a defense attorney and you need to go to court today. While dressing in the morning, you find that the pants from your suit are very dirty. However, your trunks are clean and very stylish. These are high-end swimming trunks from France that Isabelle gave you for your birthday. You decide what to wear for the trial . If you don't like something, take clothes from your wardrobe with you.
>>> new

You are a defense attorney and you need to go to court today. While dressing in the morning, you find that the pants from your suit are very dirty. However, your trunks are clean and very stylish. These are high-end swimming trunks from France that Isabelle gave you for your birthday. Do you decide what to wear for the trial ?

- I have many questions for a lawyer.


Experiment # 4

. >>>

Yesterday I left my clothes at the dry cleaner, and today I still need to pick them up. Where are my clothes?

- Idiot, I need to get him back, not do everything there according to plan.
>>> new

Yesterday I left my clothes at the dry cleaner, and today I still need to pick them up. Where are my clothes?

- No, I don't know what these things are, but I can always take one, and this is not always pleasant.
>>> new

Yesterday I left my clothes at the dry cleaner, and today I still need to pick them up. Where are my clothes?

- Okay. I won't bother him, ”I replied.



Experiment # 5

. >>>

Janet and Penny went to the store to buy a present for Jack. Janet said, "I'll buy Jack a top." “Don't buy Jack a top,” says Penny. “He has a top. He'll get it from me when we choose, and if you don't buy Jack, you'll sell Jack.
>>> new

Janet and Penny went to the store to buy a present for Jack. Janet said, "I'll buy Jack a top." “Don't buy Jack a top,” says Penny. “He has a top. He'll get his piece from Jack.
>>> new

Janet and Penny went to the store to buy a present for Jack. Janet said, "I'll buy Jack a top." “Don't buy Jack a top,” says Penny. “He has a top. He will receive . I'll take it. He is always there. And I won't sell. Only unlike Jack. He is not interested in what did not please him so. He will not buy anything other than buying himself.


Experiment # 6

. >>>

At the party, I poured myself a glass of lemonade, but it was too sour, so I added a little sugar. There was no spoon on hand, so I stirred the drink with a cigarette. But that turned out to be a very bad idea because I had been drinking for a long time. Wow, what a wonderful taste!
>>> new

At the party, I poured myself a glass of lemonade, but it was too sour, so I added a little sugar. There was no spoon on hand, so I stirred the drink with a cigarette. But that turned out to be a very bad idea, because the lemonade kept me from drinking.
>>> new

At the party, I poured myself a glass of lemonade, but it was too sour, so I added a little sugar. There was no spoon on hand, so I stirred the drink with a cigarette. But that turned out to be a very bad idea because it was too hot, so I started drinking from my throat.


Experiment # 7

. >>>

, . – .



1. , . . , , . . .



) .

) .



: 2. )



* A small lyrical digression. I didn't like GPT-2 with unlimited text generation (half a page at a time), as it was in the source. And I slashed the output to a minimum , but made it possible to recursively continue (when the previous output is fed into the input). Thus, I cheat a little , because I myself determine when you can / should stop. In this case, the network really brought out a result that can be interpreted with some stretch as an answer (albeit paradoxical), but if we continue the conclusion, then ... Judge for yourself:
Correct answer:

2. A) You DO NOT need to know the rest.

3. B) Dependence.

4. On what basis do you want to drink it?


>>> new

… The



correct answer: these are toxic substances.
>>> new

... The



correct answer is: "Everything is ready."


For sim, that's all ...



PS If the community tells you where you can place a 5Gb Model (so that it would be available with something like wget ) - I will add a link to Colab notebook to the article and anyone who wants to be able to drive the system live ...; ) And then my home "hosting", I'm afraid, will not withstand the habr-effect. In the meantime, I can try what happens with your text as input, if anyone is interested!



UPDATE: Community represented bygrigorovresponded, so here's the promised Notepad ! Now you can experiment yourself, compare with the original (link from the postDesertFlow GPT-2 neural network from OpenAI. Quick start ) and maybe draw some conclusions. ;) For example: does the language matter when teaching the language model?



AUTHOR'S COMMENT : Hi,



yes, of course, I do not mind - otherwise I would not upload the model here.



>>> Does language matter when teaching a language model?



Of course it does - I noticed that models with a small number of parameters work worse with the Russian language. I suppose that this is due to the more complex (less formalized) semantics of Russian as compared to English; I wrote about it in my writeup. Also, the way of transmitting dialogues adopted in Russian, when each replica begins on a new line and is preceded by a dash, without specifying who these replicas belong to, does not at all help the model to correctly identify the structure of the dialogue (and in addition makes it difficult to train, because the model learns to structure any texts like way - the same effect is observed when any markup leaks into the training data).



Another point that I missed (it seemed obvious to me) - if you want to make a finetune of this model, then you need to use that sentencepiece dictionary (sp. *) That comes with the model.



Also, when training 1250M, mainly news, press, and later - fanfiction were used, which is reflected in the nature of the results.



All Articles