Processing and analysis of texts in Python and Spark NLP

Nowadays, no project can do without analysis and word processing, and it just so happens that Python has a wide range of libraries and frameworks for NLP tasks. Tasks can be as trivial: text sentiment analysis, mood, entity recognition (NER) and more interesting bots, comparison of dialogues in support chats - to monitor whether your tech support or sales text scripts should be monitored, or text post-processing after SpeechToText.

A huge number of tools are available for solving NLP problems. Here is a short list of those:

  • CoreNLP

  • NLTK

  • TextBlob

  • Spacy

  • Spark NLP

Speech, as you understand, will focus on the latter, since it includes almost everything that the above libraries can do. There are both free pre - trained models and paid, highly specialized ones, for example, for healthcare .

To run Spark NLP you need Java 8 - it is needed for the Apache Spark framework with which Spark NLP works. Experiments on a server or local machine require a minimum of 16GB of RAM. It is better to install it on some Linux distribution (difficulties may arise on macOS), personally I chose the Ubuntu instance on AWS.

apt-get -qy install openjdk-8

You also need to install Python3 and related libraries

apt-get -qy install build-essential python3 python3-pip python3-dev gnupg2

pip install nlu==1.1.3

pip install pyspark==2.4.7

pip install spark-nlp==2.7.4

Spark NLP pipeline example
Spark NLP

documentAssembler = DocumentAssembler() \
    .setInputCol('text') \

tokenizer = Tokenizer() \
    .setInputCols(['document']) \

embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \
        .setInputCols(['document', 'token']) \

ner_model = NerDLModel.pretrained('ner_dl_bert', 'en') \
    .setInputCols(['document', 'token', 'embeddings']) \

ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \

nlp_pipeline = Pipeline(stages=[

import nlu

pipeline = nlu.load('ner')
result = pipeline.predict(
  text, output_level='document'

I would also like to note that both options for obtaining named entities require some time to initialize Apache Spark, preload models and establish a connection between the Python interpreter and Spark via pyspark. Therefore, you do not really want to restart the script with the code above 10-100 times, you need to provide for preloading and simply process the text by calling predict, in my case I made the initialization of the pipelines I needed during the initialization of Celery workers.

pipeline_registry = PipelineRegistry()

def get_pipeline_registry():
    pipeline_registry.register('sentiment', nlu.load('en.sentiment'))
    pipeline_registry.register('ner', nlu.load('ner'))
    pipeline_registry.register('stopwords', nlu.load('stopwords'))
    pipeline_registry.register('stemmer', nlu.load('stemm'))
    pipeline_registry.register('emotion', nlu.load('emotion'))
    return pipeline_registry

def init_worker(**kwargs):"Initializing pipeline_factory...")

In this way, you can perform NLP tasks without brain pain and with a minimum of effort.

