Git Product home page Git Product logo

wine's Introduction

wine

Experiments in building personal search engines

The original inspiration for this work is this Toward Data Science article. This article opened my eyes to what might be possible in writing search engines using word embeddings generated by modern models like transformers.

I've been interested in search engines for quite a while now and have built some rudimentary search engines from scratch in Python. This repo contains experiments in building search engines using more modern techniques.

Sentence-Transformers

The main documentation for sentence transformers. One of the key things that it powers is semantic search

The models described in the page for semantic search differ based on their applications. Symmetric cases are where the length of the query text and the retrieval text are similar, and these models perform well.

In asymmetric cases where the length of the query is different from the length of the text retrieved, these models perform better. It is worth noting that many of the models came from work done by the Bing search engine team.

In all cases, these models use GPU acceleration. Performance is quite good on my desktop GPU, an RTX 2080, with typical query results returned in <20ms. I have not tried to test these models on machines that lack GPU support.

The model that is currently used is msmarco-distilbert-base-v4. These models use the ms-marco datasets which started as a 100,000 question/answer dataset from Bing, which has since been expanded to a 1,000,000 question/answer dataset. These datasets have been used extensively in academic research and the linked page above contains leaderboards for progress in different text retrieval tasks.

Description of the code

  1. This is a search engine for wine reviews built using a database of 100K reviews.
  2. It takes each review and computes, using a transformer model, a vector that represents that review. That vector contains 768 elements (numbers) in it.
  3. It uses huggingface and a pre-trained transformer model to do so: msmarco-distilbert-base-v4. Here is a description of it: sentence-transformers/msmarco-distilbert-base-v4 · Hugging Face
  4. It only takes a two lines of code to download and use the pre-trained model using the huggingface library (see the cache-model.py file). These two lines of code will download and cache the pre-trained model in the Docker container image file.
model = SentenceTransformer(model_name)
model.save(model_path)
  1. The prepare.py script takes a vector of reviews and converts them into embeddings. It reads the list of reviews into a pandas dataframe and passes the column containing the review text to the model to train. It takes approximately 5 minutes to compute the embeddings from 100K reviews on an RTX 2080 GPU.
  2. It uses another library, nmslib (non-metric space library) to store the computed embeddings and search the computed embeddings for the nearest neighbor match using the cosine similarity algorithm.
  3. The search.ipynb notebook lets the user enter a query and call the search function. The search function uses the msmarco-distilbert-base-v4 model to compute an embedding for the user's query, and then passes the embedding vector to the nmslib library to return the top 20 results that match that query. Those results are indexes into the original dataframe and we use the indexes to retrieve the original review text and title to show the user.

wine's People

Contributors

jflam avatar

Stargazers

Omer BenAmram avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.