Nowadays, no project can do without analysis and word processing, and it just so happens that Python has a wide range of libraries and frameworks for NLP tasks. Tasks can be as trivial: text sentiment analysis, mood, entity recognition (NER) and more interesting bots, comparison of dialogues in support chats - to monitor whether your tech support or sales text scripts should be monitored, or text post-processing after SpeechToText.
A huge number of tools are available for solving NLP problems. Here is a short list of those:
Speech, as you understand, will focus on the latter, since it includes almost everything that the above libraries can do. There are both free pre - trained models and paid, highly specialized ones, for example, for healthcare .
To run Spark NLP you need Java 8 - it is needed for the Apache Spark framework with which Spark NLP works. Experiments on a server or local machine require a minimum of 16GB of RAM. It is better to install it on some Linux distribution (difficulties may arise on macOS), personally I chose the Ubuntu instance on AWS.
apt-get -qy install openjdk-8
You also need to install Python3 and related libraries
apt-get -qy install build-essential python3 python3-pip python3-dev gnupg2
pip install nlu==1.1.3
pip install pyspark==2.4.7
pip install spark-nlp==2.7.4
colab. Spark NLP (pipeline), pipe-, , : . , .
documentAssembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')
embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \
.setInputCols(['document', 'token']) \
.setOutputCol('embeddings')
ner_model = NerDLModel.pretrained('ner_dl_bert', 'en') \
.setInputCols(['document', 'token', 'embeddings']) \
.setOutputCol('ner')
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('ner_chunk')
nlp_pipeline = Pipeline(stages=[
documentAssembler,
tokenizer,
embeddings,
ner_model,
ner_converter
])
documentAssembler - Document,
tokenizer -
embeddings -
ner_model - . : October 28, 1955 = DATE
ner_converter - October 28, 1955
, - - , Spark NLP, SparkNLP (johnsnowlabs) SparkNLP - , :
import nlu
pipeline = nlu.load('ner')
result = pipeline.predict(
text, output_level='document'
).to_dict(orient='records')
NER, .
I would also like to note that both options for obtaining named entities require some time to initialize Apache Spark, preload models and establish a connection between the Python interpreter and Spark via pyspark. Therefore, you do not really want to restart the script with the code above 10-100 times, you need to provide for preloading and simply process the text by calling predict, in my case I made the initialization of the pipelines I needed during the initialization of Celery workers.
#
pipeline_registry = PipelineRegistry()
def get_pipeline_registry():
pipeline_registry.register('sentiment', nlu.load('en.sentiment'))
pipeline_registry.register('ner', nlu.load('ner'))
pipeline_registry.register('stopwords', nlu.load('stopwords'))
pipeline_registry.register('stemmer', nlu.load('stemm'))
pipeline_registry.register('emotion', nlu.load('emotion'))
return pipeline_registry
@worker_process_init.connect
def init_worker(**kwargs):
logging.info("Initializing pipeline_factory...")
get_pipeline_registry()
In this way, you can perform NLP tasks without brain pain and with a minimum of effort.