Unconventional Sentiment Analysis: BERT vs CatBoost

Introduction

Sentiment analysis is a natural language processing (NLP) technique used to determine if data (text) is positive , negative, or neutral .





Sentiment analysis is fundamental to understanding the emotional nuances of a language. This, in turn, helps to automatically sort the opinions behind reviews, social media discussions, comments, and more.





Although sentimental analysis has become extremely popular in recent years, work on it has continued since the early 2000s. Traditional machine learning techniques such as Naive Bayesian, Logistic Regression, and Support Vector Machines (SVMs) are widely used for large volumes because they scale well. In practice, deep learning (DL) methods have been proven to provide the best accuracy for a variety of NLP tasks, including sentiment analysis; however, they tend to be slower and more expensive to learn and use.





by Giacomo Veneri
by Giacomo Veneri

In this article, I want to propose a little-known alternative that combines speed and quality. A baseline model is needed for comparative assessments and conclusions. I chose the time-tested and popular BERT.





Data

β€Šβ€”β€Š , , . , β€” .





, , , .





- 3, .





 BERT

TensorFlow Hub. TensorFlow Hubβ€Šβ€”β€Š , . , BERT Faster R-CNN, .





!pip install tensorflow_hub
!pip install tensorflow_text
      
      



small_bert/bert_en_uncased_L-4_H-512_A-8β€Šβ€”β€Š BERT, Β« Well-Read Students Learn Better: On the Importance of Pre-training Compact ModelsΒ». BERT . , BERT. , .





bert_en_uncased_preprocessβ€Šβ€”β€Š BERT. , BooksCorpus. Β« Β», , , .





tfhub_handle_encoder = \
    "https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1"
tfhub_handle_preprocess = \
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3"
      
      



, . - , SOTA(State-of-the-Art).





def build_classifier_model():
    
    text_input = tf.keras.layers.Input(
        shape=(), dtype=tf.string, name='text')
    
    preprocessing_layer = hub.KerasLayer(
        tfhub_handle_preprocess, name='preprocessing')
    
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(
        tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(
        3, activation='softmax', name='classifier')(net)
    model = tf.keras.Model(text_input, net)
    
    loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
    metric = tf.metrics.CategoricalAccuracy('accuracy')
    optimizer = Adam(
        learning_rate=5e-05, epsilon=1e-08, decay=0.01, clipnorm=1.0)
    model.compile(
        optimizer=optimizer, loss=loss, metrics=metric)
    model.summary()
    return model
      
      



30% .





train, valid = train_test_split(
    df_train,
    train_size=0.7,
    random_state=0,
    stratify=df_train['Sentiment'])y_train, X_train = \
    train['Sentiment'], train.drop(['Sentiment'], axis=1)
y_valid, X_valid = \
    valid['Sentiment'], valid.drop(['Sentiment'], axis=1)y_train_c = tf.keras.utils.to_categorical(
    y_train.astype('category').cat.codes.values, num_classes=3)
y_valid_c = tf.keras.utils.to_categorical(
    y_valid.astype('category').cat.codes.values, num_classes=3)
      
      



β€Šβ€”β€Š .





history = classifier_model.fit(
    x=X_train['Tweet'].values,
    y=y_train_c,
    validation_data=(X_valid['Tweet'].values, y_valid_c),
    epochs=5)
      
      



BERT Accuracy: 0.833859920501709
      
      



(Confusion Matrix)β€Šβ€”β€Š , , . , ( ). , .





Classification Reportβ€Šβ€”β€Š , .





. , , .





CatBoost

CatBoostβ€Šβ€”β€Š . 0.19.1, .





, CatBoost . , β€Šβ€”β€Š CatBoost 20–40 , , CatBoost , . , , .





!pip install catboost
      
      



; . .





def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        task_type='GPU',
        iterations=5000,
        eval_metric='Accuracy',
        od_type='Iter',
        od_wait=500,
        **kwargs
    )return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=100,
        plot=True,
        use_best_model=True)
      
      



CatBoost Pool. Poolβ€Šβ€”β€Š , , , .





text_featuresβ€Šβ€”β€Š ( ) ( ). , ( : list, numpy.ndarray, pandas.DataFrame, pandas.Series). - , , . feature_names , , pandas.DataFrame , .







:





  • tokenizersβ€Šβ€”β€Š .





  • dictionariesβ€Šβ€”β€Š, .





  • feature_calcersβ€Šβ€”β€Š , .





; .





model = fit_model(
    train_pool, valid_pool,
    learning_rate=0.35,
    tokenizers=[
        {
            'tokenizer_id': 'Sense',
            'separator_type': 'BySense',
            'lowercasing': 'True',
            'token_types':['Word', 'Number', 'SentenceBreak'],
            'sub_tokens_policy':'SeveralTokens'
        }      
    ],
    dictionaries = [
        {
            'dictionary_id': 'Word',
            'max_dictionary_size': '50000'
        }
    ],
    feature_calcers = [
        'BoW:top_tokens_count=10000'
    ]
)
      
      



Accuracy
Accuracy
Loss
Loss
CatBoost model accuracy: 0.8299104791995787
      
      



. - ? , , . β€Šβ€”β€Š , .





y_proba_avg = np.argmax((y_proba_cb + y_proba_bert)/2, axis=1)
      
      



.





Average accuracy: 0.855713533438652
      
      



:





  • BERT ;





  • Created a model with CatBoost using built-in word processing capabilities;





  • we looked at what would happen if we averaged the results of both models.





In my opinion, complex and slow SOTA solutions can be avoided in most cases, especially if speed is a critical need.





CatBoost provides excellent text sentiment analysis capabilities right out of the box. For competitive enthusiasts like Kaggle , DrivenData , etc. CatBoost can provide a good model both as a base solution and as part of an ensemble of models.





The code from the article can be viewed here .








All Articles