Processing and analysis of texts in Python and Spark NLP

Nowadays, no project can do without analysis and word processing, and it just so happens that Python has a wide range of libraries and frameworks for NLP tasks. Tasks can be as trivial: text sentiment analysis, mood, entity recognition (NER) and more interesting bots, comparison of dialogues in support chats - to monitor whether your tech support or sales text scripts should be monitored, or text post-processing after SpeechToText.





A huge number of tools are available for solving NLP problems. Here is a short list of those:





  • CoreNLP





  • NLTK





  • TextBlob





  • Spacy





  • Spark NLP





Speech, as you understand, will focus on the latter, since it includes almost everything that the above libraries can do. There are both free pre - trained models and paid, highly specialized ones, for example, for healthcare .





To run Spark NLP you need Java 8 - it is needed for the Apache Spark framework with which Spark NLP works. Experiments on a server or local machine require a minimum of 16GB of RAM. It is better to install it on some Linux distribution (difficulties may arise on macOS), personally I chose the Ubuntu instance on AWS.





apt-get -qy install openjdk-8







You also need to install Python3 and related libraries





apt-get -qy install build-essential python3 python3-pip python3-dev gnupg2







pip install nlu==1.1.3







pip install pyspark==2.4.7







pip install spark-nlp==2.7.4







colab. Spark NLP (pipeline), pipe-, , : . , .





Spark NLP pipeline example
Spark NLP

. ( colab)





documentAssembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \
        .setInputCols(['document', 'token']) \
        .setOutputCol('embeddings')

ner_model = NerDLModel.pretrained('ner_dl_bert', 'en') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('ner_chunk')

nlp_pipeline = Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
])
      
      



  1. documentAssembler -  Document, 





  2. tokenizer -





  3. embeddings - 





  4. ner_model - . : October 28, 1955 = DATE





  5. ner_converter - October 28, 1955





, - - , Spark NLP, SparkNLP (johnsnowlabs) SparkNLP - , :





import nlu

pipeline = nlu.load('ner')
result = pipeline.predict(
  text, output_level='document'
).to_dict(orient='records')
      
      



NER, .





I would also like to note that both options for obtaining named entities require some time to initialize Apache Spark, preload models and establish a connection between the Python interpreter and Spark via pyspark. Therefore, you do not really want to restart the script with the code above 10-100 times, you need to provide for preloading and simply process the text by calling predict, in my case I made the initialization of the pipelines I needed during the initialization of Celery workers.





#  
pipeline_registry = PipelineRegistry()

def get_pipeline_registry():
    pipeline_registry.register('sentiment', nlu.load('en.sentiment'))
    pipeline_registry.register('ner', nlu.load('ner'))
    pipeline_registry.register('stopwords', nlu.load('stopwords'))
    pipeline_registry.register('stemmer', nlu.load('stemm'))
    pipeline_registry.register('emotion', nlu.load('emotion'))
    return pipeline_registry

@worker_process_init.connect
def init_worker(**kwargs):
    logging.info("Initializing pipeline_factory...")
    get_pipeline_registry()
      
      



In this way, you can perform NLP tasks without brain pain and with a minimum of effort.








All Articles