🙈 🐥 👩🏾‍🤝‍👨🏿 Processing and analysis of texts in Python and Spark NLP 👼🏼 🙍🏽 🛄

Nowadays, no project can do without analysis and word processing, and it just so happens that Python has a wide range of libraries and frameworks for NLP tasks. Tasks can be as trivial: text sentiment analysis, mood, entity recognition (NER) and more interesting bots, comparison of dialogues in support chats - to monitor whether your tech support or sales text scripts should be monitored, or text post-processing after SpeechToText.

A huge number of tools are available for solving NLP problems. Here is a short list of those:

Speech, as you understand, will focus on the latter, since it includes almost everything that the above libraries can do. There are both free pre - trained models and paid, highly specialized ones, for example, for healthcare .

To run Spark NLP you need Java 8 - it is needed for the Apache Spark framework with which Spark NLP works. Experiments on a server or local machine require a minimum of 16GB of RAM. It is better to install it on some Linux distribution (difficulties may arise on macOS), personally I chose the Ubuntu instance on AWS.

apt-get -qy install openjdk-8

You also need to install Python3 and related libraries

apt-get -qy install build-essential python3 python3-pip python3-dev gnupg2

pip install nlu==1.1.3

pip install pyspark==2.4.7

pip install spark-nlp==2.7.4

colab. Spark NLP (pipeline), pipe-, , : . , .

. ( colab)

documentAssembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \
        .setInputCols(['document', 'token']) \
        .setOutputCol('embeddings')

ner_model = NerDLModel.pretrained('ner_dl_bert', 'en') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('ner_chunk')

nlp_pipeline = Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
])

documentAssembler - Document,
tokenizer -
embeddings -
ner_model - . : October 28, 1955 = DATE
ner_converter - October 28, 1955

, - - , Spark NLP, SparkNLP (johnsnowlabs) SparkNLP - , :

import nlu

pipeline = nlu.load('ner')
result = pipeline.predict(
  text, output_level='document'
).to_dict(orient='records')

NER, .

I would also like to note that both options for obtaining named entities require some time to initialize Apache Spark, preload models and establish a connection between the Python interpreter and Spark via pyspark. Therefore, you do not really want to restart the script with the code above 10-100 times, you need to provide for preloading and simply process the text by calling predict, in my case I made the initialization of the pipelines I needed during the initialization of Celery workers.

#  
pipeline_registry = PipelineRegistry()

def get_pipeline_registry():
    pipeline_registry.register('sentiment', nlu.load('en.sentiment'))
    pipeline_registry.register('ner', nlu.load('ner'))
    pipeline_registry.register('stopwords', nlu.load('stopwords'))
    pipeline_registry.register('stemmer', nlu.load('stemm'))
    pipeline_registry.register('emotion', nlu.load('emotion'))
    return pipeline_registry

@worker_process_init.connect
def init_worker(**kwargs):
    logging.info("Initializing pipeline_factory...")
    get_pipeline_registry()

In this way, you can perform NLP tasks without brain pain and with a minimum of effort.

Processing and analysis of texts in Python and Spark NLP

More articles: