Sentiment analysis in Russian-language texts, part 1: introduction

image

Sentiment analysis has become a powerful tool for large-scale processing of opinions expressed in any text source. The practical application of this tool in English is quite developed, which cannot be said about Russian. In this series of articles, we will look at how and for what purposes the sentiment analysis approaches were used for Russian-language texts, what results were achieved, what problems arose, and also talk a little about promising directions. Unlike previous works, I focused on applied applications, and not on the approaches themselves and their quality of classification. The first part is introductory. We will consider what “sentiment analysis” is, what it is and how it has been used over the past 8 years to analyze Russian-language texts. In the second partLet's take a closer look at each of the 32 major studies I found. In the third and final part (again, next week), we will talk about the common difficulties faced by the researchers, as well as about promising directions for the future.



NB: The article was written for a scientific journal, so there will be many links to sources.


1. Introduction



Sentiment analysis is a class of methods of content analysis in computational linguistics, the main task of which is to classify text according to its mood. By using sentiment analysis, researchers can generalize the sentiment of texts and draw conclusions on different topics. For example, this analysis allows predicting the securities market [1], calculating the index of subjective well-being [2], predicting election results [3], assessing the reaction to some events or news [4]. Sentiment analysis for English is already well developed [5] - [7], while other languages, especially Russian, have received much less attention so far. According to a study by Omnibus GFK [9], 75.4 Russians (90 million people) over the age of 16 use the Internet. There are Russian-speaking diasporas on all continents, but the bulk of them live in the CIS,mostly in Russia and Ukraine. According to a study by W3Techs, Russian is one of the leading languages ​​in terms of prevalence on the Internet. As of April 2020, 8.6% of the 10 million most popular websites in the world were in Russian. Therefore, Russian-language texts are an important source of data for automatic analysis, especially sentiment analysis.



Only one survey study [10] carried out by Viksna and Jekabsons is devoted to the analysis of the sentiment of Russian-language texts. Several others [11] - [14] mention it in the context of a general comparison with existing approaches. Some other studies are devoted to specific aspects of the analysis of the sentiment of Russian-language texts. For example, the assessment of the best approaches [15] - [18], comparison of neural network architectures for sentiment analysis [19], [20], comparison of open Russian-language vocabulary selections for sentiment assessment [21]. However, all of these studies have focused on the approaches themselves and their speed of classification, rather than on practical application and analysis results. I considered only those works, during which the results of the analysis were obtained based on real data. And I did not consider those that are devoted only to training classifiers.This article is a condensed translation of an article published in IEEE Access. If you want more details, or just read in English - youhere .



The second section briefly describes the task of sentiment analysis and current approaches, if you are already familiar with this, feel free to skip. The third section is one of the main, it examines the uses of sentiment analysis for Russian-language texts, it also describes 32 main studies, their insights and weaknesses. The fourth section focuses on current challenges, and the fifth on promising areas.



2. Briefly about the methods of sentiment analysis



Sentiment analysis is a class of methods of content analysis in computational linguistics, the main task of which is to classify text according to its mood. In simple cases, the task of sentiment analysis is reduced to a binary classification of texts into positive and negative. In some cases, add another class of neutral texts. More advanced approaches attempt to identify emotional states associated with a text, such as fear, anger, sadness, or happiness. In a number of approaches, texts are assigned values ​​of a predetermined scale: for example, from -2 for negative to 2 for positive; thus, the analysis is reduced to a regression problem. Aspect-based sentiment analysis is a subset of sentiment analysis, whose task is to determine the attitude towards a specific aspect of the main subject of discussion.All approaches to sentiment analysis can be divided into three groups.



The first is rule-based approaches(rule-based). Most often, they use manually defined classification rules and emotionally marked vocabularies. These rules usually calculate the text class [22] - [24] based on emotional keywords and their combination with other keywords. Despite being excellently effective on subject matter, rule-based methods are poorly generalizable. They are also extremely time consuming to create, especially when there is no access to a suitable sentiment dictionary. The latter is especially characteristic of the Russian language, because there are not as many sources in it as in English, especially in the field of sentiment analysis. The largest Russian-language sentiment dictionaries are RuSentiLex [25] and LINIS Crowd [26]. But they only contain information about the tonality from positive to negative, without the characteristics of emotions. In this way,there are no alternatives to such powerful English-language compilations with vast emotional characteristics as SenticNet [27], SentiWordNet [28] and SentiWords [29].



Second group - machine learning approaches... They use automatic feature extraction from text and apply machine learning algorithms. The classical algorithms for classification of polarity are Naive Bayes Classifier [30], Decision Tree [31], Logistic Regression [32] and Support Vector Machine [33]. In recent years, the attention of researchers has been attracted by deep learning methods, which are significantly superior to traditional methods in sentiment analysis [34]. This is confirmed by the chronology of the SemEval competition, during which the leading solutions successfully used convolutional (CNN) and recurrent (RNN) neural networks [35] - [37], as well as transfer learning methods [38].One of the main features of machine learning-based systems is automatic feature extraction from text. Simple approaches to represent text in vector space typically use the bag of words model. In more complex systems for generating word embeddings, models of distributive semantics are used, for example, Word2Vec [39], GloVe [40], or FastText [41]. There are also algorithms for generating embeddings at the sentence or paragraph level, which are designed to transfer learning in different natural language processing tasks. These algorithms include ELMo [42], Universal Sentence Encoder (USE) [27], Bidirectional Encoder Representations from Transformers (BERT) [43], Enhanced Language Representation with Informative Entities (ERNIE) [44], and XLNet [45].One of their main disadvantages in terms of generating embeddings is the need for large amounts of text for training. However, this is true for all machine learning methods, because all supervised learning algorithms require labeled datasets to train.



Third group - hybrid approaches... They combine the approaches of the two previous types. For example, Kumar and his colleagues have developed a hybrid framework for sentiment analysis in Persian that combines linguistic rules, convolutional neural networks and LSTMs for sentiment classification [46]. Meskele and Frasincar proposed a hybrid ALDONAr aspect analysis model that combines sentiment ontology for capturing sentiment information, BERT for word embeddings, and two CNN layers for extended sentiment classification [47]. The model showed an accuracy of 83.8% on the SenEval 2015 Task 12 dataset [48] and 87.1% on the SemEval 2016 Task 5 dataset [49]. Language models are often used in hybrid algorithms, as are rule-based solutions [50] - [52]. One side,a combination of rule-based methods and machine learning usually produces more accurate results. On the other hand, hybrid approaches inherit the difficulties and limitations of their constituent algorithms.



3.



To find key publications on applied sentiment analysis of Russian-language texts, I searched scientific databases that cover the leading computer science journals and conferences: IEEE Xplore, ACM Digital Library, ScienceDirect, SAGE Journals Online, and Springer Link . To expand the range of sources, in addition to English-language articles, I also studied Russian-language articles from the Russian Science Citation Index (RSCI). The search was carried out by query (('' SENTIMENT '' OR '' POLARITY '') AND ('' ANALYSIS '' OR '' DETECTION '' OR '' CLASSIFICATION '' OR '' OPINION MINING '' OR '' TOPIC MODELING ' ') AND (' 'RUSSIAN' 'or' 'RUSSIA' ')).Most of the relevant articles are found in ScienceDirect , Springer Link and RSCI... I also reviewed the preliminary publications of the works of leading researchers so as not to miss out on new developments. As a result, we managed to collect several thousand potentially relevant articles, not counting gray literature and preprints. The freshest and most cited works were preferred. Then I analyzed the titles, keywords, and introductions of the rest of the publications to narrow down the source selection. Only peer-reviewed articles were searched to improve the quality of the sample. I have excluded gray sources (for example, work in progress, editorials, any dissertation), as well as unsuitable sources for my research (which do not apply sentiment classification models). Then, for further detailing in this article, I manually selected 32 major publications.which described at least one practical approach to the analysis of sentiments in Russian-language texts.



4.



image

. 1. .



I decided to categorize the approaches by data sources, because in this case, the approaches within the categories will have similar goals, challenges and limitations. Although some categories contain only one study, I decided to highlight them due to the fundamental differences in the approaches used, results and difficulties. Also, do not forget that the Russian language has been less studied in terms of sentiment analysis, so the number of works is limited. In fig. 1 presents a set of categories. Most of the approaches relied on social media data analysis to gauge user attitudes toward different topics. For example, attitudes and opinions about the conflict in Ukraine and problems related to migrants. In the last decade, many social networks have turned into modern tools for social engagement [53],therefore, they can be perceived as open and widely available sources of public opinion, or at least as some kind of reflection of it [54]. UGCs from social networks, as the most common source of information, were examined according to three criteria: attitudes towards different topics; social mood indices; features of user interaction with data expressing different moods. Attitudes towards different topics were studied from different points of view. For example, attitudes towards migrants and ethnic groups (for example, [55]), expressions of sentiment during the Ukrainian crisis (for example, [56]), measuring the level of social tension (for example, [57]), or focusing on discourse on some important questions (for example, [58]). Typically these approaches use a combination of topic modeling and sentiment analysis,to highlight themes and related moods. In much of the research (eg, [59] - [67]), where topic modeling is applied without further polarity classification (and therefore not covered in this article), sentiment analysis is referred to as a further developmental stage. In another part of the research (for example, [68]), social attitude indices are calculated on the basis of opinions expressed in social networks in order to obtain an alternative to the traditional subjective well-being index. Finally, another piece of research (eg, [69]) examines patterns of user interaction with content depending on its emotional color. One of the main difficulties in such studies is the extraction of representative data samples and the selection of relevant texts for subsequent analysis.In much of the research (eg, [59] - [67]), where topic modeling is applied without further polarity classification (and therefore not covered in this article), sentiment analysis is referred to as a further developmental stage. In another part of the research (for example, [68]) social attitude indices are calculated on the basis of opinions expressed in social networks in order to obtain an alternative to the traditional index of subjective well-being. Finally, another piece of research (eg, [69]) examines patterns of user interaction with content depending on its emotional color. One of the main difficulties in such studies is the extraction of representative data samples and the selection of relevant texts for further analysis.In much of the research (eg, [59] - [67]), where topic modeling is applied without further polarity classification (and therefore not covered in this article), sentiment analysis is referred to as a further developmental stage. In another part of the research (for example, [68]) social attitude indices are calculated on the basis of opinions expressed in social networks in order to obtain an alternative to the traditional index of subjective well-being. Finally, another piece of research (eg, [69]) examines patterns of user interaction with content depending on its emotional color. One of the main difficulties in such studies is the extraction of representative data samples and the selection of relevant texts for subsequent analysis.in which thematic modeling is applied without further classification of polarity (and therefore they are not covered in this article), sentiment analysis is referred to as a further development stage. In another part of the research (for example, [68]), social attitude indices are calculated on the basis of opinions expressed in social networks in order to obtain an alternative to the traditional subjective well-being index. Finally, another piece of research (eg, [69]) examines patterns of user interaction with content depending on its emotional color. One of the main difficulties in such studies is the extraction of representative data samples and the selection of relevant texts for further analysis.in which thematic modeling is applied without further classification of polarity (and therefore they are not covered in this article), sentiment analysis is referred to as a further development stage. In another part of the research (for example, [68]) social attitude indices are calculated on the basis of opinions expressed in social networks in order to obtain an alternative to the traditional index of subjective well-being. Finally, another piece of research (eg, [69]) examines patterns of user interaction with content depending on its emotional color. One of the main difficulties in such studies is the extraction of representative data samples and the selection of relevant texts for further analysis.



The next most common source of information is reviews of products and services. They were analyzed in terms of the characteristics of the reviewers themselves (eg, [70]), the characteristics of products and services (eg, [71]), and the characteristics of the sellers (eg, [72]). Unlike the analysis of user-generated data from social networks, there is no difficulty in accessing old data. Sites dedicated to reviews often allow users to rate ratings in addition to the review text, so there is no formal need to create a sentiment classification model, because we already know the rating classes. However, in some studies, sentiment classification models are used solely for academic interest. Since social media user data and user reviews often reflect subjective points of view,analyzing this data is different from analyzing news. Usually, journalists try to avoid judgments and outright bias, doubt and ambiguity, since objectivity is at the heart of their profession. or at least neutrality [73]. Therefore, journalists often do not use words related to positive or negative vocabulary, but resort to other ways of expressing their opinion [74].



The third main source was news from the media, which was analyzed according to two criteria: sentiment (for example, [75]) and the formation of economic and business forecasts based on the sentiment of news (for example, [76]). Unlike the analysis of user-generated data from social networks, there is no difficulty in accessing old data, because the media usually does not restrict access to it. However, the authors of some studies have tried to determine the public's attitude to specific topics, which, in my opinion, requires further elaboration. Of course, the media can be considered a reflection of public opinion. But in some cases, editorial policy may have influenced the submission, so news does not always reflect public opinion. The researchers paid a little less attention to the most recent direction: the analysis of the sentiment of textbooks,such studies appeared only in 2019. These works focus on comparing sentiments expressed in different textbooks (eg, [77]) and the impact of these sentiments on the educational process (eg, [78]). The main challenge comes from the lack of mood-specific vocabulary and textbook-oriented learning datasets. Moreover, in the case of analytical texts at the document level, it becomes difficult to associate texts with a certain class of moods, because the texts in textbooks are long and may contain several different emotions at once.textbook oriented. Moreover, in the case of analytical texts at the document level, it becomes difficult to associate texts with a certain class of moods, because the texts in textbooks are long and may contain several different emotions at once.textbook oriented. Moreover, in the case of analytical texts at the document level, it becomes difficult to associate texts with a certain class of moods, because the texts in textbooks are long and may contain several different emotions at once.



In order to capture a wider range of opinions, some studies operate with mixed data sources. In this group, researchers usually study attitudes towards different topics, such as the Ukrainian crisis (eg [79]) or media coverage of Alexei Navalny (eg [80]). Since the sources are mixed, such data can be used for any possible research. However, in addition to the wide range of opinions expressed, authors also face inherent source complexities and limitations.



A summary of the approaches found is presented in Table 1. If we consider the distribution of articles by year, we can see that the number of studies on the sentiment of the Russian-language text increased in 2014-2016 and reached a peak in 2017. The number of articles published in the same journals and conference proceedings varies somewhat. More than one of the analyzed articles was published only in seven journals and collections. Most of the discovered articles were published in the collection of materials of the conference “Digital Transformations and Global Society”.



Table 1. Summary of Discovered Studies. RB - rule-based approaches, ML - machine learning approaches, UNK - unknown approaches, WL - word-level analysis, DL - document-level analysis.



Category Appointment Description Link
UGC . [81] ML (Logit) DL
[82] ML (Logit) DL
[83] ML (Logit) DL
[84] RB (SentiStrength) DL
[55] ML (SVM) DL
. [85] RB (custom) DL
[86] RB (POLYARNIK) DL
[87] RB (SentiMental) DL
[88] UNK (IQBuzz) DL
[56] RB (custom) DL
. [89] ML (SVM) DL
[57] RB (SentiStrength) DL
. [58] DL
2014 . [90] RB (SentiStrength) DL
2011-2012. [91] RB (SentiStrength) DL
-. [92] ML (NBC) DL
. [93] RB (custom) WL, DL
[68] ML (GBM) DL
. [69] ML (BiGRU) DL
, . [70] DL
- . [71] ML (NB, SGD) DL
, . [72] ML (RNTN) DL
. [94] RB (custom) DL
[95] RB (custom) DL
. [96] RB (custom) DL
. [75] UNK (Medialogia) DL
. [76] ML (SVM) DL
. [77] RB (custom) WL
, . [78] ML ( ) DL
[97] UNK (Crimson Hexagon) DL
[79] UNK (Crimson Hexagon) DL
[80] UNK (Medialogia) DL


The ratio of rule-based approaches (40.63%) and machine learning (37.5%) was approximately equal. The first group most often used either individual rule-based models or SentiStrength [22], which has become the most popular algorithm among third-party ready-to-use solutions. And in the second group, logistic regression [32], support vector machine [33] and naive Bayesian classifier [30] were most often used. The most popular were simple machine learning methods, and only 16.7% were on neural networks. However, since 2019, the share of machine learning approaches has significantly exceeded the share of rule-based approaches. 15.6% of the studies found used third-party cloud services such as Medialogia, IQBuzz, and Crimson Hexagon for sentiment analysis.In these cases, I could not determine the approaches used due to the lack of official information on the applied classification algorithms.



In several cases, methodological flaws were found, including a lack of descriptions of preprocessing, data markup, learning process, and classification quality metrics. In a number of cases, the classification model was not validated on a dataset related to the subject area. This is especially true for sentiment analysis using rule-based approaches or third-party services - researchers usually did not manually mark up sets of texts and therefore could not assess the quality of the classification.



5. Next



The second part of the article will be released next week, in which we will take a closer look at each of the 32 main studies that I found. In the third and final part (again, next week), we will talk about the common difficulties faced by the researchers, as well as promising directions for the future. If you want to read the entire article at once and in English, go here .



6. Sources



A complete list of sources can be found here .



All Articles