👸🏾 🐴 📛 Everything is possible: solving NLP problems with spacy 🙍🏾 🐭 🐥

Natural language processing is now ubiquitous: voice interfaces and chatbots are rapidly developing, models are being developed for processing large text data, and machine translation continues to develop.

In this article, we will look at the relatively new SpaCy library, which is currently one of the most popular and convenient solutions for text processing in Python. Its functionality allows you to solve a very wide range of tasks: from identifying parts of speech and extracting named entities to creating your own models for analysis.

To begin with, let's take a look at how data is processed in SpaCy. The text loaded for processing sequentially passes through various processing components and is saved as an instance of the Doc object:

Doc is the central data structure in SpaCy, it is in it that sequences of words or, as they are also called, tokens are stored. Within the Doc object, two other object types can be distinguished: Token and Span. Token is a link to individual words of a document, and Span is a link to a sequence of several words (you can create them yourself):

Another important data structure is the Vocab object, which stores a set of reference tables common to all documents. This saves memory and provides a single source of information for all processed documents.

Document tokens are connected to the Vocab object through a hash, using which you can get the initial forms of words or other lexical attributes of tokens:

Now we know how the storage and processing of data in the SpaCy library is arranged. How to take advantage of the opportunities it provides? Let's take a look at the operations that can be used to process text in sequence.

1. Basic operations

Before you start working with text, you should import the language model. For the Russian language, there is an official model from SpaCy that supports tokenization (splitting text into separate tokens) and a number of other basic operations:

from spacy.lang.ru import Russian

After importing and instantiating the language model, you can start processing the text. To do this, you just need to pass the text to the created instance:

nlp = Russian()
doc = nlp("     ,   .")

Working with the resulting Doc object is very similar to working with lists: you can access the desired token by index or make slices from several tokens. And to get the text of a token or slice, you can use the text attribute:

token = doc[0]
print(token.text)

span = doc[3:6]
print(span.text)

For more information about what type of information is contained in the token, the following attributes can be used:

is_alpha - check if the token contains only alphabetic characters
is_punct - check if the token is a punctuation mark
like_num - check if a token is a number

print("is_alpha:    ", [token.is_alpha for token in doc])
print("is_punct:    ", [token.is_punct for token in doc])
print("like_num:    ", [token.like_num for token in doc])

Let's consider another example, where all the tokens preceding the point are displayed on the screen. To get this result, when iterating over tokens, check the following token using the token.i attribute:

for token in doc:
    if token.i+1 < len(doc):
        next_token = doc[token.i+1]
        if next_token.text == ".":
            print(token.text)

2. Operations with syntax

For more complex word processing operations, other models are used. They are specially trained for tasks related to syntax, extraction of named entities, and working with word meanings. For example, for English, there are 3 official models that differ in size. For the Russian language, at the moment, the official model has not yet been trained, but there is already a ru2 model from third-party sources that can work with the syntax.

At the end of this article, we will discuss how to create your own models or additionally train existing ones so that they work better for specific tasks.

To fully illustrate the capabilities of SpaCy, we will use the English language models in this article. Let's set up a small en_core_web_sm model, which is great for demonstrating the possibilities. To install it on the command line, you need to type:

python -m spacy download en_core_web_sm

Using this model, we can get for each of the tokens a part of speech, a role in a sentence and a token on which it depends:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("New Apple MacBook set launch tomorrow")

for token in doc:
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    token_head = token.head.text
    print(f"{token_text:<12}{token_pos:<10}" \
          f"{token_dep:<10}{token_head:<12}")

New         PROPN     compound  MacBook     
Apple       PROPN     compound  MacBook     
MacBook     PROPN     nsubj     set         
set         VERB      ROOT      set         
to          PART      aux       launch      
launch      VERB      xcomp     set         
tomorrow    NOUN      npadvmod  launch

By far the best way to see dependencies is not to read the text data, but to build a syntax tree. The displacy function can help with this, which you just need to transfer the document:

from spacy import displacy
displacy.render(doc, style='dep', jupyter=True)

As a result of executing the code, we get a tree on which all the syntactic information about the sentence is located:

To decode the tag names, you can use the explain functions:

print(spacy.explain("aux"))
print(spacy.explain("PROPN"))
auxiliary
proper noun

Here, the abbreviations are displayed on the screen, from which we can learn that aux stands for an auxiliary particle (auxiliary), and PROPN stands for a proper noun.

SpaCy also implements the ability to find out the initial form of a word for any of the tokens (-PRON- is used for pronouns):

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I saw a movie yesterday")
print(' '.join([token.lemma_ for token in doc]))

'-PRON- see a movie yesterday'

3. Highlighting named entities

Often, to work with text, you need to highlight the entities mentioned in the text. The doc.ents attribute is used to list the named entities in the document, and the ent.label_ attribute is used to get the label for this entity:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for 1$ billion")
for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
U.K. GPE
1$ billion MONEY

You can also use the explain attribute here to find out the decoding of named entity labels:

print(spacy.explain("GPE"))

Countries, cities, states

And the displacy function will help you visualize the lists of entities right in the text:

from spacy import displacy

displacy.render (doc, style = 'ent', jupyter = True)

4. Create your own templates for text search

The spaCy module contains a very useful tool that allows you to build your own text search templates. In particular, you can search for words of a certain part of speech, all forms of a word by its initial form, check for the type of content in the token. Here is a list of the main parameters:

Let's try to create our own template for recognizing a sequence of tokens. Let's say we want to extract from the text lines about the FIFA or ICC Cricket World Cups with the mention of the year:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [
    {"IS_DIGIT": True}, 
    {"LOWER": {"REGEX": "(fifa|icc)"}},
    {"LOWER": "cricket", "OP": "?"},
    {"LOWER": "world"},
    {"LOWER": "cup"}
]
matcher.add("fifa_pattern", None, pattern)
doc = nlp("2018 ICC Cricket World Cup: Afghanistan won!")
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span)

2018 ICC Cricket World Cup

So, in this block of code, we imported a special Matcher object to store a set of custom templates. After initializing it, we created a template where we specified the sequence of tokens. Please note that we used regular expressions to choose between ICC and FIFA, and for the Cricket token - a key indicating the optional presence of this token.

After creating a template, you need to add it to the set using the add function, specifying a unique template ID in the parameters. The search results are presented in the form of a list of tuples. Each of the tuples consists of the match ID and the start and end indices of the slice found in the document.

5. Determination of semantic proximity

Two words can be very similar in meaning, but how do you measure their closeness? In such tasks, semantic vectors can come to the rescue. If two words or verbose expressions are similar, then their vectors will lie close to each other.

Calculating the semantic proximity of vectors in SpaCy is not difficult if the language model has been trained to solve such problems. The result is highly dependent on the size of the model, so let's take a larger model for this task:

import spacy

nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like burgers")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

0.9244169833828932

The value can range from zero to one: the closer to one, the greater the similarity. In the example above, we compared two documents, however, you can compare individual tokens and slices in the same way.

Semantic proximity assessment can be useful for solving many problems. For example, you can use it to set up a recommendation system so that it offers the user similar texts based on the ones already read.

It is important to remember that semantic affinity is very subjective and always depends on the context of the task. For example, the phrases “I love dogs” and “I hate dogs” are similar, since both express opinions about dogs, but at the same time differ greatly in mood. In some cases, you will have to additionally train language models so that the results correlate with the context of your problem.

6. Creating your own processing components

The SpaCy module supports a number of built-in components (tokenizer, named entity highlighting), but also allows you to define your own components. In fact, components are sequentially called functions that take a document as input, modify it and send it back. New components can be added using the add_pipe attribute:

import spacy

def length_component(doc):
    doc_length = len(doc)
    print(f"This document is {doc_length} tokens long.")
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(length_component, first=True)
print(nlp.pipe_names)
doc = nlp("This is a sentence.")

['length_component', 'tagger', 'parser', 'ner']
This document is 5 tokens long.

In the example above, we created and added our own function that displays the number of tokens in the processed document. Using the nlp.pipe_names attribute, we got the order of execution of the components: as we can see, the created component is the first in the list. You can use the following options to specify where to add the new component: The

ability to add custom components is a very powerful tool to optimize processing for your needs.

7. Training and updating models

Statistical models make predictions based on the examples they were trained on. As a rule, the accuracy of such models can be improved by additionally training them on examples specific to your task. Additional training on existing models can be very useful (for example, for named entity recognition or parsing).

Additional training examples can be added directly in the SpaCy interface. The examples themselves should consist of text data and a list of labels for this example on which the model will train.

As an illustration, consider updating the model to retrieve named entities. To update such a model, you need to pass it a lot of examples that contain text, an indication of the entities and their class. In the examples, you need to use whole clauses, since the model relies heavily on the clause context when extracting entities. It is very important to fully train the model so that it can recognize non-entity tokens.

For instance:

("What to expect at Apple's 10 November event", {"entities": [(18,23,"COMPANY")]})
("Is that apple pie I smell?", {"entities": []})

In the first example, a company is mentioned: for training, we highlight the positions where its name begins and ends, and then we put down our label that this entity is a company. In the second example, we are talking about a fruit, so there are no entities.

The data for training the model is usually marked up by humans, but this work can be slightly automated using the own search templates in SpaCy or specialized markup programs (for example, Prodigy ).

After the examples have been prepared, you can proceed directly to training the model. For the model to train effectively, you need to run a series of multiple trainings. With each training, the model will optimize the weights of certain parameters. The models in SpaCy use the stochastic gradient descent technique, so it is a good idea to mix the examples with each training, as well as transfer them in small portions (packets). This will increase the reliability of the gradient estimates.

import spacy
import random
from spacy.lang.en import English

TRAINING_DATA = [
    ("What to expect at Apple's 10 November event", 
    {"entities": [(18,23,"COMPANY")]})
    #  ...
]

nlp = English()

for i in range(10):
    random.shuffle(TRAINING_DATA)
    for batch in spacy.util.minibatch(TRAINING_DATA):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        nlp.update(texts, annotations)
        
nlp.to_disk("model")

In the example above, the loop consisted of 10 trainings. After completing the training, the model was saved to disk in the model folder.

For cases when it is necessary not only to update, but to create a new model, a number of operations are required before starting training.

Consider the process of creating a new model to highlight named entities:

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("COMPANY")
nlp.begin_training()

First, we create an empty model using the spacy.blank ("en") function. The model contains only language data and tokenization rules. Then we add a ner component that is responsible for highlighting named entities, and using the add_label attribute, add labels for the entities. Then we use the nlp.begin_training () function to initialize the model for training with a random distribution of weights. Well, then it will be enough to train the model, as shown in the previous example.

Everything is possible: solving NLP problems with spacy