The original inspiration for this work is this Toward Data Science article. This article opened my eyes to what might be possible in writing search engines using word embeddings generated by modern models like transformers.
I've been interested in search engines for quite a while now and have built some rudimentary search engines from scratch in Python. This repo contains experiments in building search engines using more modern techniques.
The main documentation for sentence transformers. One of the key things that it powers is semantic search
The models described in the page for semantic search differ based on their applications. Symmetric cases are where the length of the query text and the retrieval text are similar, and these models perform well.
In asymmetric cases where the length of the query is different from the length of the text retrieved, these models perform better. It is worth noting that many of the models came from work done by the Bing search engine team.
In all cases, these models use GPU acceleration. Performance is quite good on my desktop GPU, an RTX 2080, with typical query results returned in <20ms. I have not tried to test these models on machines that lack GPU support.
The model that is currently used is msmarco-distilbert-base-v4
. These models
use the ms-marco datasets which
started as a 100,000 question/answer dataset from Bing, which has since been
expanded to a 1,000,000 question/answer dataset. These datasets have been
used extensively in academic research and the linked page above contains
leaderboards for progress in different text retrieval tasks.
- This is a search engine for wine reviews built using a database of 100K reviews.
- It takes each review and computes, using a transformer model, a vector that represents that review. That vector contains 768 elements (numbers) in it.
- It uses huggingface and a pre-trained transformer model to do so:
msmarco-distilbert-base-v4
. Here is a description of it: sentence-transformers/msmarco-distilbert-base-v4 · Hugging Face - It only takes a two lines of code to download and use the pre-trained model
using the huggingface library (see the
cache-model.py
file). These two lines of code will download and cache the pre-trained model in the Docker container image file.
model = SentenceTransformer(model_name)
model.save(model_path)
- The
prepare.py
script takes a vector of reviews and converts them into embeddings. It reads the list of reviews into a pandas dataframe and passes the column containing the review text to the model to train. It takes approximately 5 minutes to compute the embeddings from 100K reviews on an RTX 2080 GPU. - It uses another library,
nmslib
(non-metric space library) to store the computed embeddings and search the computed embeddings for the nearest neighbor match using the cosine similarity algorithm. - The
search.ipynb
notebook lets the user enter a query and call thesearch
function. The search function uses themsmarco-distilbert-base-v4
model to compute an embedding for the user's query, and then passes the embedding vector to thenmslib
library to return the top 20 results that match that query. Those results are indexes into the original dataframe and we use the indexes to retrieve the original review text and title to show the user.