👨🏿‍🚀 🎮 ◾️ Not a button accordion: looking for duplicate images based on Milvus with the FAISS index inside 🌯 👊 🧖🏽

In user-generated projects, we often have to deal with duplicates, which is especially important for us, since the main content of the iFunny mobile application is images that are posted by tens of thousands every day. We have written a separate system for finding repetitions to make the process easier and save a lot of time.

Under the cut, we will consider the tools used, and then move on to an example implementation.

Convolutional neural network (CNN)

There are a huge number of different duplicate search algorithms, each with its own pros and cons. One of such is the search for the most similar (closest) vectors obtained using CNN networks .

After the image is classified through the CNN network, the output is a vector of "what the network saw in the image." In theory, this method should be less sensitive to crop, but there will be more false similar images compared to more accurate methods.

There is another drawback as well. At the output of the classification, a large vector (2048 float for resnet152) is obtained, which must be stored somewhere and be able to find all N similar vectors for the sought after within a reasonable period of time - which in itself is not easy.

FAISS

Finding the closest vectors is a common task for which there are already excellent tools. Here, the FAISS library from Facebook is considered the leader . It uses efficient vector clustering, allowing you to organize searches even for vectors that do not fit in RAM.

But working with FAISS directly is not very convenient. This is not a database, you can't just save a vector there and query a similar one (besides, after creating an index, you can only recreate it). Therefore, for industrial operation, you need to build your harness around the indexing system.

Milvus

For this there is a very promising project Milvus , which in design strongly resembles Elasticsearch. The only difference is that Elasticsearch is built around the lucene index, while in Milvus the whole architecture is built around the FAISS index.

The structure of collections is also similar:

For each collection, you can create several partitions, by which the search will be limited later. A partition consists of segments, which are a simple set of files with id, initial indexes, and service data.

Information about collections, partitions and segments is stored in a separate SQL database. For standalone-launch, built-in SQLite is used, and it is also possible to use an external MySQL database.

The Milvus project is in active development (current version 0.11.0). So far, there is no data replication in it, as well as the ability to use other SQL (or NoSQL) databases as a metainformation storage. Therefore, for now, for HA solutions, you can only use a scheme with two instances with shared storage: one will be running, and the other will be "sleeping". Mishards can be used for scaling , but in 0.11.0 it is broken.

In addition, in 0.11.0, it became possible to save additional data to the collection along with the vector and id itself. True, so far without additional indexes for them, but with the ability to search.

From a usage point of view, Milvus looks like a regular external database. There is an API (gRPC client and a set of http methods) for storing and searching a vector, managing collections and indexes, and getting information about all entities.

When creating a collection, you can specify the maximum number of entries in the segment (segment_row_limit). If this limit is exceeded, Milvus will start building the FAISS index. This is related to one of the Milvus features: for all added vectors, for which an index has not yet been created, the search will work on the basis of a full search. Therefore, for large values of segment_row_limit, there will be many records for which the index has not yet been built (it also affects how many segments will be created for the collection). To find similar vectors in the collection, you need to search in each segment - and the more there are, the longer the search.

Please note that the newly created segment is not filled up to the limit when adding new records. Instead, there is a gradual merging of segments based on the principle of a game. 2048 (until the size exceeds the limit). Therefore, if you specify a larger value for segment_row_limit, there may be many smaller segments for which there is no index, which means that the search on them will be slow.

Despite all the peculiarities, vector search is fast. The architecture of FAISS indexes and Milvus itself allows you to simultaneously search for values along several vectors at a time. And in practice, a sequential search for two vectors will be significantly slower than searching for both vectors at a time .

Duplicate search implementation

Milvus can be run in both the CPU version and the GPU. The former is best used on processors that support the AVX512 instruction . To do this, just start the container:

docker run -d --rm --name milvusdb -p 19530:19530 -p 19121:19121 \
     milvusdb/milvus:0.11.0-cpu-d101620-4c44c0

In this case, 19530 will be the port for the gRPC client, and 19121 for the http API .

As a CNN network, you can take any of the pretrained (or teach yourself) - the results may differ slightly. In this example, we will use the pretrained resnet152:

model = models.resnet152(pretrained=True)

The vector will be removed from the `avgpool` layer:

layer = model._modules.get('avgpool')

And get the vector itself using a hook:

vector = torch.zeros(2048)
def copy_data(m, i, o):
    vector.copy_(torch.reshape(o.data, (1, 2048))[0])

hook = layer.register_forward_hook(copy_data)
model(prepared_image)
hook.remove()

The complete code for obtaining a vector looks like this:

import numpy as np
import torch
import torchvision.models as models
import torchvision.transforms as transforms
from torch.autograd import Variable
from PIL import Image

model = models.resnet152(pretrained=True)
layer = model._modules.get('avgpool')
model.eval()

pipeline = [
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
]


def _prepare_Image(img: Image) -> Variable:
    raw = img
    for action in pipeline:
        raw = action(raw)

    return Variable(raw.unsqueeze(0))


def image_vectorization(image_path: str) -> np.ndarray:
    img = Image.open(image_path)

    prepared_image = _prepare_Image(img)

    vector = torch.zeros(2048)

    def copy_data(m, i, o):
        vector.copy_(torch.reshape(o.data, (1, 2048))[0])

    hook = layer.register_forward_hook(copy_data)
    model(prepared_image)
    hook.remove()

    # vector normalization
    norm_vector = vector / torch.norm(vector)

    return np.array(norm_vector)

Now you need a client to work with Milvus. You can take any of the supported ones (Python, Java, Go, Rest, C ++). Let's take a Java client and write an example in Kotlin. Why? Why not.

We connect Milvus SDK:

implementation("io.milvus:milvus-sdk-java:0.9.0")

Create connection to Milvus:

val connectParam = ConnectParam.Builder()
        .withHost("localhost")
        .withPort(19530)
        .build()

val client = MilvusGrpcClient(connectParam)

Create a collection for 2048 vector:

val collectionMapping = CollectionMapping.create(collectionName)
            .addVectorField("float_vec", DataType.VECTOR_FLOAT, 2048)
          //  id
            .setParamsInJson(JsonBuilder()
                  .param("auto_id", false)
                  .param("segment_row_limit", segmentRowLimit)
                  .build()
            )

        client.createCollection(collectionMapping)

Create IVF_SQ8 index:

	    Index.create(collectionName, "float_vec")
            .setIndexType(IndexType.IVF_SQ8)
            .setMetricType(MetricType.L2)
            .setParamsInJson(JsonBuilder()
                    .param("nlist", 16384)
                    .build()
            )
	
     client.createIndex(index)

Save a few vectors to a collection:

InsertParam.create(collectionName)
            .setEntityIds(listOf(1L, 2L))
            .addVectorField("float_vec", DataType.VECTOR_FLOAT, listOf(vector1, vector2))

client.insert(insertParam)
client.flush(collectionName)  //

We are looking for a previously saved vector:

val dsl = JsonBuilder().param(
            "bool", mapOf(
                "must" to listOf(
                    mapOf(
                        "vector" to mapOf(
                            "float_vec" to
                                mapOf(
                                    "topk" to 10,
                                    "metric_type" to MetricType.L2,
                                    "type" to "float",
                                    "query" to listOf(vector1),
                                    "params" to mapOf("nprobe" to 50)
                                )
                        )
                    )
                )
            )
        ).build()

 val searchParam = SearchParam.create(collectionName)
       .setDsl(dsl)

 val result = client.search(searchParam)
 println(result.queryResultsList[0].map { it.entityId to it.distance })

If everything works and is configured correctly, then a similar result will return:

[(1, 0.0), (2, 0.2)]

For the first vector, the distance L2 with itself will be 0, and with the other vector it will be greater than 0.

All of the above, of course, is just a sketch, but this is enough to try to create a Python service for classifying and obtaining a vector. And either wind up the API for saving and searching vectors for it, or do it in a separate service (for example, on Kotlin), which will receive the vector and save it in Milvus on its own.

Thanks to everyone who read to the end, I hope you have found something new for yourself. And if you are interested in the Milvus project, then you can support it on Github .

Not a button accordion: looking for duplicate images based on Milvus with the FAISS index inside

Convolutional neural network (CNN)

FAISS

Milvus

Duplicate search implementation

More articles: