Introduction
Sentiment analysis is a natural language processing (NLP) technique used to determine if data (text) is positive , negative, or neutral .
Sentiment analysis is fundamental to understanding the emotional nuances of a language. This, in turn, helps to automatically sort the opinions behind reviews, social media discussions, comments, and more.
Although sentimental analysis has become extremely popular in recent years, work on it has continued since the early 2000s. Traditional machine learning techniques such as Naive Bayesian, Logistic Regression, and Support Vector Machines (SVMs) are widely used for large volumes because they scale well. In practice, deep learning (DL) methods have been proven to provide the best accuracy for a variety of NLP tasks, including sentiment analysis; however, they tend to be slower and more expensive to learn and use.
In this article, I want to propose a little-known alternative that combines speed and quality. A baseline model is needed for comparative assessments and conclusions. I chose the time-tested and popular BERT.
Data
, , , .
- 3, .
BERT
TensorFlow Hub. TensorFlow Hubβββ , . , BERT Faster R-CNN, .
!pip install tensorflow_hub !pip install tensorflow_text
small_bert/bert_en_uncased_L-4_H-512_A-8βββ BERT, Β« Well-Read Students Learn Better: On the Importance of Pre-training Compact ModelsΒ». BERT . , BERT. , .
bert_en_uncased_preprocessβββ BERT. , BooksCorpus. Β« Β», , , .
tfhub_handle_encoder = \
"https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1"
tfhub_handle_preprocess = \
"https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
, . - , SOTA(State-of-the-Art).
def build_classifier_model():
text_input = tf.keras.layers.Input(
shape=(), dtype=tf.string, name='text')
preprocessing_layer = hub.KerasLayer(
tfhub_handle_preprocess, name='preprocessing')
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer(
tfhub_handle_encoder, trainable=True, name='BERT_encoder')
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(
3, activation='softmax', name='classifier')(net)
model = tf.keras.Model(text_input, net)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
metric = tf.metrics.CategoricalAccuracy('accuracy')
optimizer = Adam(
learning_rate=5e-05, epsilon=1e-08, decay=0.01, clipnorm=1.0)
model.compile(
optimizer=optimizer, loss=loss, metrics=metric)
model.summary()
return model
30% .
train, valid = train_test_split(
df_train,
train_size=0.7,
random_state=0,
stratify=df_train['Sentiment'])y_train, X_train = \
train['Sentiment'], train.drop(['Sentiment'], axis=1)
y_valid, X_valid = \
valid['Sentiment'], valid.drop(['Sentiment'], axis=1)y_train_c = tf.keras.utils.to_categorical(
y_train.astype('category').cat.codes.values, num_classes=3)
y_valid_c = tf.keras.utils.to_categorical(
y_valid.astype('category').cat.codes.values, num_classes=3)
βββ .
history = classifier_model.fit(
x=X_train['Tweet'].values,
y=y_train_c,
validation_data=(X_valid['Tweet'].values, y_valid_c),
epochs=5)
BERT Accuracy: 0.833859920501709
(Confusion Matrix)βββ , , . , ( ). , .
Classification Reportβββ , .
. , , .
CatBoost
CatBoostβββ . 0.19.1, .
, CatBoost . , βββ CatBoost 20β40 , , CatBoost , . , , .
!pip install catboost
; . .
def fit_model(train_pool, test_pool, **kwargs):
model = CatBoostClassifier(
task_type='GPU',
iterations=5000,
eval_metric='Accuracy',
od_type='Iter',
od_wait=500,
**kwargs
)return model.fit(
train_pool,
eval_set=test_pool,
verbose=100,
plot=True,
use_best_model=True)
CatBoost Pool. Poolβββ , , , .
text_featuresβββ ( ) ( ). , ( : list, numpy.ndarray, pandas.DataFrame, pandas.Series). - , , . feature_names , , pandas.DataFrame , .
:
tokenizersβββ .
dictionariesβββ, .
feature_calcersβββ , .
; .
model = fit_model(
train_pool, valid_pool,
learning_rate=0.35,
tokenizers=[
{
'tokenizer_id': 'Sense',
'separator_type': 'BySense',
'lowercasing': 'True',
'token_types':['Word', 'Number', 'SentenceBreak'],
'sub_tokens_policy':'SeveralTokens'
}
],
dictionaries = [
{
'dictionary_id': 'Word',
'max_dictionary_size': '50000'
}
],
feature_calcers = [
'BoW:top_tokens_count=10000'
]
)
CatBoost model accuracy: 0.8299104791995787
. - ? , , . βββ , .
y_proba_avg = np.argmax((y_proba_cb + y_proba_bert)/2, axis=1)
.
Average accuracy: 0.855713533438652
:
BERT ;
Created a model with CatBoost using built-in word processing capabilities;
we looked at what would happen if we averaged the results of both models.
In my opinion, complex and slow SOTA solutions can be avoided in most cases, especially if speed is a critical need.
CatBoost provides excellent text sentiment analysis capabilities right out of the box. For competitive enthusiasts like Kaggle , DrivenData , etc. CatBoost can provide a good model both as a base solution and as part of an ensemble of models.
The code from the article can be viewed here .