✍🏾 ⛔️ 🐼 Unconventional Sentiment Analysis: BERT vs CatBoost ✊ 🌒 🎇

Introduction

Sentiment analysis is a natural language processing (NLP) technique used to determine if data (text) is positive , negative, or neutral .

Sentiment analysis is fundamental to understanding the emotional nuances of a language. This, in turn, helps to automatically sort the opinions behind reviews, social media discussions, comments, and more.

Although sentimental analysis has become extremely popular in recent years, work on it has continued since the early 2000s. Traditional machine learning techniques such as Naive Bayesian, Logistic Regression, and Support Vector Machines (SVMs) are widely used for large volumes because they scale well. In practice, deep learning (DL) methods have been proven to provide the best accuracy for a variety of NLP tasks, including sentiment analysis; however, they tend to be slower and more expensive to learn and use.

In this article, I want to propose a little-known alternative that combines speed and quality. A baseline model is needed for comparative assessments and conclusions. I chose the time-tested and popular BERT.

Data

— , , . , — .

, , , .

- 3, .

BERT

TensorFlow Hub. TensorFlow Hub — , . , BERT Faster R-CNN, .

!pip install tensorflow_hub
!pip install tensorflow_text

small_bert/bert_en_uncased_L-4_H-512_A-8 — BERT, « Well-Read Students Learn Better: On the Importance of Pre-training Compact Models». BERT . , BERT. , .

bert_en_uncased_preprocess — BERT. , BooksCorpus. « », , , .

tfhub_handle_encoder = \
    "https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1"
tfhub_handle_preprocess = \
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"

, . - , SOTA(State-of-the-Art).

def build_classifier_model():
    
    text_input = tf.keras.layers.Input(
        shape=(), dtype=tf.string, name='text')
    
    preprocessing_layer = hub.KerasLayer(
        tfhub_handle_preprocess, name='preprocessing')
    
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(
        tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(
        3, activation='softmax', name='classifier')(net)
    model = tf.keras.Model(text_input, net)
    
    loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
    metric = tf.metrics.CategoricalAccuracy('accuracy')
    optimizer = Adam(
        learning_rate=5e-05, epsilon=1e-08, decay=0.01, clipnorm=1.0)
    model.compile(
        optimizer=optimizer, loss=loss, metrics=metric)
    model.summary()
    return model

30% .

train, valid = train_test_split(
    df_train,
    train_size=0.7,
    random_state=0,
    stratify=df_train['Sentiment'])y_train, X_train = \
    train['Sentiment'], train.drop(['Sentiment'], axis=1)
y_valid, X_valid = \
    valid['Sentiment'], valid.drop(['Sentiment'], axis=1)y_train_c = tf.keras.utils.to_categorical(
    y_train.astype('category').cat.codes.values, num_classes=3)
y_valid_c = tf.keras.utils.to_categorical(
    y_valid.astype('category').cat.codes.values, num_classes=3)

— .

history = classifier_model.fit(
    x=X_train['Tweet'].values,
    y=y_train_c,
    validation_data=(X_valid['Tweet'].values, y_valid_c),
    epochs=5)

BERT Accuracy: 0.833859920501709

(Confusion Matrix) — , , . , ( ). , .

Classification Report — , .

. , , .

CatBoost

CatBoost — . 0.19.1, .

, CatBoost . , — CatBoost 20–40 , , CatBoost , . , , .

!pip install catboost

; . .

def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        task_type='GPU',
        iterations=5000,
        eval_metric='Accuracy',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=100,
        plot=True,
        use_best_model=True)

CatBoost Pool. Pool — , , , .

text_features — ( ) ( ). , ( : list, numpy.ndarray, pandas.DataFrame, pandas.Series). - , , . feature_names , , pandas.DataFrame , .

tokenizers — .
dictionaries — , .
feature_calcers — , .

; .

model = fit_model(
    train_pool, valid_pool,
    learning_rate=0.35,
    tokenizers=[
        {
            'tokenizer_id': 'Sense',
            'separator_type': 'BySense',
            'lowercasing': 'True',
            'token_types':['Word', 'Number', 'SentenceBreak'],
            'sub_tokens_policy':'SeveralTokens'
        }      
    ],
    dictionaries = [
        {
            'dictionary_id': 'Word',
            'max_dictionary_size': '50000'
        }
    ],
    feature_calcers = [
        'BoW:top_tokens_count=10000'
    ]
)

CatBoost model accuracy: 0.8299104791995787

. - ? , , . — , .

y_proba_avg = np.argmax((y_proba_cb + y_proba_bert)/2, axis=1)

Average accuracy: 0.855713533438652

BERT ;
Created a model with CatBoost using built-in word processing capabilities;
we looked at what would happen if we averaged the results of both models.

In my opinion, complex and slow SOTA solutions can be avoided in most cases, especially if speed is a critical need.

CatBoost provides excellent text sentiment analysis capabilities right out of the box. For competitive enthusiasts like Kaggle , DrivenData , etc. CatBoost can provide a good model both as a base solution and as part of an ensemble of models.

The code from the article can be viewed here .

Unconventional Sentiment Analysis: BERT vs CatBoost

Introduction

Data

BERT

CatBoost

More articles: