Full-text search is one of those tools that we use almost every day when we search for some information on the Internet. Full-Text Search (FTS) is a method of searching for text in a collection of documents. The document can link to a web page, newspaper article, email message, or any structured text.

Today we are going to write our own FTS engine. By the end of this article, he will be able to search millions of documents in less than a millisecond. We'll start with simple search queries such as "Return all documents with cat" and then extend the engine to support more complex logical queries.

Note: The most famous full-text search engine is Lucene (and also Elasticsearch and Solr built on top of it).

Why do you need FTS

Before you write any code, you might ask: "Couldn't you just use grep or a loop that checks each document for the search word?" Yes, you can. But that's not always the best idea.

Housing

We will search for fragments of annotations from the English-language Wikipedia. The latest dump is available at dumps.wikimedia.org . As of today, the file size after unpacking is 913 MB. The XML file contains more than 600 thousand documents.

Sample document:

<title>Wikipedia: Kit-Cat Klock</title>
<url>https://en.wikipedia.org/wiki/Kit-Cat_Klock</url>
<abstract>The Kit-Cat Klock is an art deco novelty wall clock shaped like a grinning cat with cartoon eyes that swivel in time with its pendulum tail.</abstract>

Uploading documents

First, you need to load all documents from the dump using a very handy built-in package encoding/xml:

import (
    "encoding/xml"
    "os"
)

type document struct {
    Title string `xml:"title"`
    URL   string `xml:"url"`
    Text  string `xml:"abstract"`
    ID    int
}

func loadDocuments(path string) ([]document, error) {
    f, err := os.Open(path)
    if err != nil {
        return nil, err
    }
    defer f.Close()

    dec := xml.NewDecoder(f)
    dump := struct {
        Documents []document `xml:"doc"`
    }{}
    if err := dec.Decode(&dump); err != nil {
        return nil, err
    }

    docs := dump.Documents
    for i := range docs {
        docs[i].ID = i
    }
    return docs, nil
}

Each document is assigned a unique ID. For simplicity, the first loaded document is assigned ID = 0, the second ID = 1, and so on.

First try

Content search

Now we have all the documents loaded into memory, let's try to find those that mention cats. First, let's go through all the documents and check them for a substring cat:

func search(docs []document, term string) []document {
    var r []document
    for _, doc := range docs {
        if strings.Contains(doc.Text, term) {
            r = append(r, doc)
        }
    }
    return r
}

On my laptop, the search takes 103ms - not too bad. If a spot check several documents from the issue, we can see that the function gives satisfaction on the word caterpillar and category , but not at Cat with a capital letter the C . This is not exactly what we are looking for.

There are two things to fix before proceeding:

Make the search case insensitive (so that the output includes Cat as well ).
Consider word boundaries, not substrings (so that there are no words like caterpillar and communication in the output ).

Search with regular expressions

One obvious solution that solves both problems is regular expressions .

In this case, we need (?i)\bcat\b:

(?i) means the regex is case insensitive
\b indicates correspondence with word boundaries (a place where there is a character on one side and not on the other)

But now the search took more than two seconds. As you can see, the system began to slow down even on a modest body of 600 thousand documents. While this approach is easy to implement, it doesn't scale very well. As the dataset grows, more and more documents need to be scanned. The time complexity of this algorithm is linear, that is, the number of documents to scan is equal to the total number of documents. If we had 6 million documents instead of 600 thousand, the search would take 20 seconds. We'll have to come up with something better.

Inverted index

To speed up search queries, we'll preprocess the text and build an index.

The core of FTS is a data structure called an inverted index . It links each word to documents containing that word.

Example:

documents = {
    1: "a donut on a glass plate",
    2: "only the donut",
    3: "listen to the drum machine",
}

index = {
    "a": [1],
    "donut": [1, 2],
    "on": [1],
    "glass": [1],
    "plate": [1],
    "only": [2],
    "the": [2, 3],
    "listen": [3],
    "to": [3],
    "drum": [3],
    "machine": [3],
}

Below is a real example of an inverted index. This is an index in a book where the term is followed by page numbers:

Text analysis

Before starting to build the index, you need to split the source text into a list of words (tokens) suitable for indexing and searching.

The text analyzer consists of a tokenizer and several filters.

Tokenizer

The tokenizer is the first step in text parsing. Its task is to convert the text into a list of tokens. Our implementation splits the text at word boundaries and removes punctuation marks:

func tokenize(text string) []string {
    return strings.FieldsFunc(text, func(r rune) bool {
        // Split on any character that is not a letter or a number.
        return !unicode.IsLetter(r) && !unicode.IsNumber(r)
    })
}

> tokenize("A donut on a glass plate. Only the donuts.")

["A", "donut", "on", "a", "glass", "plate", "Only", "the", "donuts"]

Filters

In most cases, simply converting text to a list of tokens is not enough. Additional normalization will be required to facilitate indexing and searching.

Lower case

To make the search case insensitive, the lowercase filter converts tokens to lowercase. The words cAt , Cat, and caT are normalized to the form cat . Later, when referring to the index, we also normalize the search queries to lowercase, so that the search query cAt finds the word Cat .

Removing common words

Almost every English text contains common words such as a , I , The, or be . They are called stop words and are present in almost all documents, so they should be removed.

There is no "official" stopword list. Let's eliminate the top 10 on the OEC list . Feel free to supplement it:

var stopwords = map[string]struct{}{ // I wish Go had built-in sets.
    "a": {}, "and": {}, "be": {}, "have": {}, "i": {},
    "in": {}, "of": {}, "that": {}, "the": {}, "to": {},
}

func stopwordFilter(tokens []string) []string {
    r := make([]string, 0, len(tokens))
    for _, token := range tokens {
        if _, ok := stopwords[token]; !ok {
            r = append(r, token)
        }
    }
    return r
}

> stopwordFilter([]string{"a", "donut", "on", "a", "glass", "plate", "only", "the", "donuts"})

["donut", "on", "glass", "plate", "only", "donuts"]

Stemming

Due to grammar rules, there are different forms of words in documents. Stemming reduces them to basic form. For example, fishing , fished, and fisher all boil down to the main fish form .

Implementing stemming is not a trivial task and is not covered in this article. Let's take one of the existing modules:

import snowballeng "github.com/kljensen/snowball/english"

func stemmerFilter(tokens []string) []string {
    r := make([]string, len(tokens))
    for i, token := range tokens {
        r[i] = snowballeng.Stem(token, false)
    }
    return r
}

> stemmerFilter([]string{"donut", "on", "glass", "plate", "only", "donuts"})

["donut", "on", "glass", "plate", "only", "donut"]

Note: Stemmers don't always work correctly. For example, some may shorten airline to airlin .

Assembling the analyzer

func analyze(text string) []string {
    tokens := tokenize(text)
    tokens = lowercaseFilter(tokens)
    tokens = stopwordFilter(tokens)
    tokens = stemmerFilter(tokens)
    return tokens
}

The tokenizer and filters convert the sentences into a list of tokens:

> analyze("A donut on a glass plate. Only the donuts.")

["donut", "on", "glass", "plate", "only", "donut"]

Tokens are ready for indexing.

Building an index

Let's go back to the inverted index. It matches each word with document IDs. The built-in data type works well for storing a map (displaying) map. The key will be a token (string), and the value will be a list of document IDs:

type index map[string][]int

In the process of building the index, documents are analyzed and their identifiers are added to the map:

func (idx index) add(docs []document) {
    for _, doc := range docs {
        for _, token := range analyze(doc.Text) {
            ids := idx[token]
            if ids != nil && ids[len(ids)-1] == doc.ID {
                // Don't add same ID twice.
                continue
            }
            idx[token] = append(ids, doc.ID)
        }
    }
}

func main() {
    idx := make(index)
    idx.add([]document{{ID: 1, Text: "A donut on a glass plate. Only the donuts."}})
    idx.add([]document{{ID: 2, Text: "donut is a donut"}})
    fmt.Println(idx)
}

Everything is working! Each token in the display refers to the IDs of the documents containing that token:

map[donut:[1 2] glass:[1] is:[2] on:[1] only:[1] plate:[1]]

Inquiries

For queries to the index, we will apply the same tokenizer and filters that we used for indexing:

func (idx index) search(text string) [][]int {
    var r [][]int
    for _, token := range analyze(text) {
        if ids, ok := idx[token]; ok {
            r = append(r, ids)
        }
    }
    return r
}

> idx.search("Small wild cat")

[[24, 173, 303, ...], [98, 173, 765, ...], [[24, 51, 173, ...]]

And now, finally, we can find all the documents that mention cats. Searching 600 thousand documents took less than a millisecond (18 μs)!

With an inverted index, the time complexity of a search query is linear with the number of search tokens. In the above query example, in addition to analyzing the input text, only three map searches are performed.

Logical queries

The previous request returned an unlinked list of documents for each token. But we usually expect a search for small wild cat to return a list of results that contain both small , wild and cat . The next step is to compute the intersection between the lists. Thus, we will get a list of documents corresponding to all tokens.

Fortunately, the IDs in our inverted index are inserted in ascending order. Since the IDs are sorted, you can calculate the intersection between the lists in linear time. The function intersectioniterates through two lists at the same time and collects the identifiers that are present in both:

func intersection(a []int, b []int) []int {
    maxLen := len(a)
    if len(b) > maxLen {
        maxLen = len(b)
    }
    r := make([]int, 0, maxLen)
    var i, j int
    for i < len(a) && j < len(b) {
        if a[i] < b[j] {
            i++
        } else if a[i] > b[j] {
            j++
        } else {
            r = append(r, a[i])
            i++
            j++
        }
    }
    return r
}

The updated one searchparses the given request text, looks for tokens and calculates the given intersection between the ID lists:

func (idx index) search(text string) []int {
    var r []int
    for _, token := range analyze(text) {
        if ids, ok := idx[token]; ok {
            if r == nil {
                r = ids
            } else {
                r = intersection(r, ids)
            }
        } else {
            // Token doesn't exist.
            return nil
        }
    }
    return r
}

The Wikipedia dump contains only two documents that simultaneously contain the words small , wild and cat :

> idx.search("Small wild cat")

130764  The wildcat is a species complex comprising two small wild cat species, the European wildcat (Felis silvestris) and the African wildcat (F. lybica).
131692  Catopuma is a genus containing two Asian small wild cat species, the Asian golden cat (C. temminckii) and the bay cat.

Search works as expected!

By the way, I first learned about catopums , here is one of them:

conclusions

So, we made a full-text search engine. Despite its simplicity, it can provide a solid foundation for more advanced projects.

I have not mentioned many aspects that can significantly improve performance and make searching more convenient. Here are some ideas for further improvements:

Add logical operators OR and NOT .
Store index on disk:
- It takes some time to restore the index each time the application is restarted.
- Large indexes may not fit in memory.
Experiment with memory and CPU-optimized data formats for storing ID sets. Take a look at Roaring Bitmaps .
Indexing multiple fields of the document.
Sort results by relevance.

All source code is published on GitHub .

Writing a full-text search engine in Go