How to search for file swamps in 104 lines of code in python

Continuing the topic of short useful scripts, I would like to acquaint readers with the possibility of building a search by the content of files and images in 104 lines. This will certainly not be a mind-boggling solution - but it will work for simple needs. Also, the article will not invent anything - all packages are open source.



And yes - blank lines in the code are also counted. A small demonstration of work is given at the end of the article.



We need python3 , downloaded by Tesseract 5, and the distiluse-base-multilingual-cased model from the Sentence-Transformers package . Those who already understand what will happen next will not be interesting.



In the meantime, everything we need will look like:



First 18 lines
import numpy as np
import os, sys, glob

os.environ['PATH'] += os.pathsep + os.path.join(os.getcwd(), 'Tesseract-OCR')
extensions = [
    '.xlsx', '.docx', '.pptx',
    '.pdf', '.txt', '.md', '.htm', 'html',
    '.jpg', '.jpeg', '.png', '.gif'
]

import warnings; warnings.filterwarnings('ignore')
import torch, textract, pdfplumber
from cleantext import clean
from razdel import sentenize
from sklearn.neighbors import NearestNeighbors
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('./distillUSE')





It will be needed, as you can see, decently, and everything seems to be ready, but you can't do without a file. In particular, textract (not from Amazon, which is paid), somehow does not work well with Russian pdfs , as you can use pdfplumber . Further, splitting the text into sentences is a difficult task, and razdel does an excellent job with the Russian language in this case .



Those who have not heard about scikit-learn - I envy that in short, the NearestNeighbors algorithm in it remembers vectors and gives out the nearest ones. Instead of scikit-learn, you can use faiss or annoy or even elasticsearch for example .



The main thing is to actually turn the text of (any) file into a vector, which is what they do:



next 36 lines of code
def processor(path, embedder):
    try:
        if path.lower().endswith('.pdf'):
            with pdfplumber.open(path) as pdf:
                if len(pdf.pages):
                    text = ' '.join([
                        page.extract_text() or '' for page in pdf.pages if page
                    ])
        elif path.lower().endswith('.md') or path.lower().endswith('.txt'):
            with open(path, 'r', encoding='UTF-8') as fd:
                text = fd.read()
        else:
            text = textract.process(path, language='rus+eng').decode('UTF-8')
        if path.lower()[-4:] in ['.jpg', 'jpeg', '.gif', '.png']:
            text = clean(
                text,
                fix_unicode=False, lang='ru', to_ascii=False, lower=False,
                no_line_breaks=True
            )
        else:
            text = clean(
                text,
                lang='ru', to_ascii=False, lower=False, no_line_breaks=True
            )
        sentences = list(map(lambda substring: substring.text, sentenize(text)))
    except Exception as exception:
        return None
    if not len(sentences):
        return None
    return {
        'filepath': [path] * len(sentences),
        'sentences': sentences,
        'vectors': [vector.astype(float).tolist() for vector in embedder.encode(
            sentences
        )]
    }





Well, then it remains a matter of technique - to go through all the files, extract the vectors and find the closest to the query by cosine distance.



Remaining code
def indexer(files, embedder):
    for file in files:
        processed = processor(file, embedder)
        if processed is not None:
            yield processed

def counter(path):
    if not os.path.exists(path):
        return None
    for file in glob.iglob(path + '/**', recursive=True):
        extension = os.path.splitext(file)[1].lower()
        if extension in extensions:
            yield file

def search(engine, text, sentences, files):
    indices = engine.kneighbors(
        embedder.encode([text])[0].astype(float).reshape(1, -1),
        return_distance=True
    )

    distance = indices[0][0][0]
    position = indices[1][0][0]

    print(
        ' "%.3f' % (1 - distance / 2),
        ': "%s",  "%s"' % (sentences[position], files[position])
    )

print('  "%s"' % sys.argv[1])
paths = list(counter(sys.argv[1]))

print(' "%s"' % sys.argv[1])
db = list(indexer(paths, embedder))

sentences, files, vectors = [], [], []
for item in db:
    sentences += item['sentences']
    files += item['filepath']
    vectors += item['vectors']

engine = NearestNeighbors(n_neighbors=1, metric='cosine').fit(
    np.array(vectors).reshape(len(vectors), -1)
)

query = input(' : ')
while query:
    search(engine, query, sentences, files)
    query = input(' : ')





You can run all the code like this:



python3 app.py /path/to/your/files/


That's how it is with the code.



And here is the promised demo.



I took two news from "Lenta.ru", and put one in a gif file through the notorious paint, and the other just in a text file.



First.gif file




Second .txt file
, . .



, - . , , , . . , .



, , , . . .



, - - .



, β„–71 , , , . 10 , . β€” .



And here is a gif animation of how it works. With the GPU, of course, everything works more cheerful.



Demonstration, better click on the picture






Thanks for reading! I still hope that this method will be useful to someone.



All Articles