wine

Experiments in building personal search engines

The original inspiration for this work is this Toward Data Science article. This article opened my eyes to what might be possible in writing search engines using word embeddings generated by modern models like transformers.

I've been interested in search engines for quite a while now and have built some rudimentary search engines from scratch in Python. This repo contains experiments in building search engines using more modern techniques.

Sentence-Transformers

The main documentation for sentence transformers. One of the key things that it powers is semantic search

The models described in the page for semantic search differ based on their applications. Symmetric cases are where the length of the query text and the retrieval text are similar, and these models perform well.

In asymmetric cases where the length of the query is different from the length of the text retrieved, these models perform better. It is worth noting that many of the models came from work done by the Bing search engine team.

In all cases, these models use GPU acceleration. Performance is quite good on my desktop GPU, an RTX 2080, with typical query results returned in <20ms. I have not tried to test these models on machines that lack GPU support.

The model that is currently used is msmarco-distilbert-base-v4. These models use the ms-marco datasets which started as a 100,000 question/answer dataset from Bing, which has since been expanded to a 1,000,000 question/answer dataset. These datasets have been used extensively in academic research and the linked page above contains leaderboards for progress in different text retrieval tasks.

Description of the code

This is a search engine for wine reviews built using a database of 100K reviews.
It takes each review and computes, using a transformer model, a vector that represents that review. That vector contains 768 elements (numbers) in it.
It uses huggingface and a pre-trained transformer model to do so: msmarco-distilbert-base-v4. Here is a description of it: sentence-transformers/msmarco-distilbert-base-v4 · Hugging Face
It only takes a two lines of code to download and use the pre-trained model using the huggingface library (see the cache-model.py file). These two lines of code will download and cache the pre-trained model in the Docker container image file.

model = SentenceTransformer(model_name)
model.save(model_path)

The prepare.py script takes a vector of reviews and converts them into embeddings. It reads the list of reviews into a pandas dataframe and passes the column containing the review text to the model to train. It takes approximately 5 minutes to compute the embeddings from 100K reviews on an RTX 2080 GPU.
It uses another library, nmslib (non-metric space library) to store the computed embeddings and search the computed embeddings for the nearest neighbor match using the cosine similarity algorithm.
The search.ipynb notebook lets the user enter a query and call the search function. The search function uses the msmarco-distilbert-base-v4 model to compute an embedding for the user's query, and then passes the embedding vector to the nmslib library to return the top 20 results that match that query. Those results are indexes into the original dataframe and we use the indexes to retrieve the original review text and title to show the user.

jflam / wine Goto Github PK

wine's Introduction

wine

Experiments in building personal search engines

Sentence-Transformers

Description of the code

wine's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent