We talk periodically on Mediumabout projects that participants create as part of our educational programs, for example, how to build a spoken oracle . Today we are ready to share the results of the spring 2020 semester course.
Some data and analytics
This year we have broken all records for the number of the course: at the beginning of February, there were about 800 people enrolled . Let's be honest, we were not ready for so many participants, so we came up with many points on the go with them. But we will write about this next time.
Let's go back to the participants. Has everyone finished the course? The answer is, of course, obvious. With each new assignment, the number of those willing became less and less. As a result, either because of quarantine, or for other reasons, but by the middle of the course, only half remained. Well, then I had to decide on projects. Seventy works were announced by the participants. And the most popular project - Tweet sentiment extraction - nineteen teams tried to complete the task on Kaggle .
More about the projects presented
Last week we had a final session of the course where several teams presented their projects. If you missed the open seminar, then we have prepared a recording . And below we will try to briefly describe the implemented cases.
Kaggle Jigsaw: Multilingual Toxic Comment Classification
Roman Shchekin (QtRoS), Denis Grushentsev (evilden), Maxim Talimanchuk (mtalimanchuk)
This competition is a continuation of the popular Jigsaw competition to determine toxic text, however, in this case, training takes place on English data, and testing - on multilingual data (including Russian). The assessment is based on the ROC AUC metric. The team took bronze (132 out of 1621) with an ROC AUC of ~ 0.9463. The final model was an ensemble of classifiers:
- XLMRoberta large
- Naive bayes
- Bert base
- Bert base multilingual
- USE multilingual
XLMRoberta large with a linear layer of 1024 * 1 was trained on a basic dataset with the AdamW optimizer. The USE multilingual model was used in the basic version (trained in 16 languages) without additional training. The use of the Bert base was possible due to the automatic translation of the test dataset into English. The training set has been expanded with additional datasets.
The project presentation is available here .
The GitHub of the project is available at this link .
On bert distillation
Nikita Balagansky
As you know, models based on the BERT architecture, while achieving impressive quality ratings, still lag far behind in performance. This is due to the fact that BERT is a model with a large number of weights. There are several ways to reduce the model, one of them is distillation. The idea behind distillation is to create a smaller "student" model that mimics the behavior of the larger "teacher" model. The Russian student model was trained on four 1080ti cards for 100 hours, on a news dataset. As a result, the student's model turned out to be 1.7 times smaller than the original model.... A comparison of the quality of the student and teacher models was made on a dataset to determine the emotional coloring of the Mokoron text. As a result, the student model performed comparable to that of the teacher. The training script was written using the catalyst package . You can read more about the project on Medium .
The project presentation is available here .
The GitHub of the project is available at this link .
Picture: rasa.com
Open Data Science Question Answering
Ilya Sirotkin, Yuri Zelensky, Ekaterina Karpova
It all started with a post in ODS from Ekaterina Karpova. The idea was quite ambitious - to create an autoresponder to questions in the ODS slack community based on the collected dataset of questions and answers. However, preliminary analysis revealed that most of the questions are quite unique, and creating a labeled test sample for assessing quality is a rather laborious task. Therefore, it was decided to first create a classifier to determine whether the question being asked belongs to the ODS slack channel. He would help ODS newbies to ask questions in the relevant channel topic. The pwROC-AUC metric was chosen as a quality assessment.
As part of the project, a comparative analysis of popular text classification models was carried out. The best of them - the RuBERT-based model from DeepPavlov - showed a quality of 0.995 pwROC-AUC. Such high numbers of model quality indicate a high degree of separation (and separability) of the original data. The only channel that is problematic for all the models we have tested is _call_4_colaboration. But why exactly he, it was not possible to find out yet.
Having dealt with this task, the team leaves no hope of returning to the original task of answering questions from ODS users.
The project presentation is available here .
The GitHub of the project is available at this link .
Russian Aspect-Based Sentiment Analysis
Dmitry Bunin
Within the framework of this project, the problem of determining the sentiment relative to a given object in the text was solved (problem C from the Dialogue Evaluation 2015 competition). Both Russian and English data were used as datasets. Basically, modern models based on the ELM architectures (from the RusVectores package) and BERT (from the DeepPavlov package ) were compared . The ELM + CNN model in Russian showed comparable quality with the best model from the competition, despite the small training sample and strong data imbalance.
The project presentation is available here .
The GitHub of the project is available at this link .
Kaggle: Tweet Sentiment Extraction
Kirill Gerasimov
According to the terms of the competition , the task was to extract a key word or phrase from the tweet text that would define the mood of this tweet. The word-level Jaccard Score was used as a quality metric. In this competition, all participants faced noisy data and ambiguous markup. The team used a public laptop model based on the RoBERTa-base as the base model. This model uses a reading comprehension approach, in which the start and end of the key phrase are highlighted (with the obligatory condition that the end is after the beginning). According to the accepted tradition, the ensemble of various models performed faster than individual models. As a result, bronze (135th place out of 2100)... In the experience of the winner of the competition, two-level annotation gives even better speeds.
The project presentation is available here .
The GitHub of the project is available at this link .
Automatic solution of the exam
Mikhail Teterin and Leonid Morozov
The goal of this project is to improve quality metrics on three tasks of the AI Journey 2019 competition (automatic solution of the exam), namely:
- search for main information in the text;
- determining the meaning of a word in a given context;
- placement of punctuation marks in sentences.
In all three problems, the best solution in the competition was surpassed. Much of the improvements are due to the use of additional training data. In the solutions, models based on RuBERT from DeepPavlov showed the best quality .
The project presentation is available here .
The GitHub of the project is available at this link .
In this article, we tried to tell about some of the projects that were presented at the seminar, but of course there were more of them.
Thanks to everyone who took an active part in the course and did not give up. Well, for those who are just learning and looking for interesting problems in the field of NLP, we recommend considering the DeepPavlov Contribute project .The future of Conversational AI is in your hands!