Git Product home page Git Product logo

cdrc-semantic-search's Introduction

CDRC Semantic Search System

Overview

The CDRC Semantic Search System is a project designed to enhance the search capabilities of the Centre for Consumer Data Research (CDRC) data catalogue. The goal is to implement a semantic search approach that goes beyond traditional keyword-based searches, providing users with more accurate and relevant results.

Features

  • Semantic Search: Embeds documents using OpenAI which are stored on Pinecone, allowing for semantic querying using cosine similarity.

  • Retrieval Augmented Generation: Generates responses using GPT 3.5 turbo to explain the relevance of retrieved datasets.

System Architecture

The CDRC Semantic Search System follows a standard Retrieval Augmented Generation (RAG) architecture:

Credit to Heiko Hotz (https://towardsdatascience.com/rag-vs-finetuning-which-is-the-best-tool-to-boost-your-llm-application-94654b1eaba7)

Credit to Heiko Hotz (https://towardsdatascience.com/rag-vs-finetuning-which-is-the-best-tool-to-boost-your-llm-application-94654b1eaba7)

Installation

To get started with the CDRC Semantic Search System, follow these steps:

  1. Clone the repository:

    git clone https://github.com/cjber/cdrc-semantic-search.git
  2. Install dependencies:

With pip:

cd cdrc-semantic-search
pip install -r requirements.txt

With pdm:

cd cdrc-semantic-search
pdm install
  1. Configure the system:

    Edit the config/config.toml file to customize settings such as API keys, or model settings.

  2. Run the system using a DVC pipeline.

    dvc repro

NOTE: This requires a Pinecone database and access to the CDRC catalogue.

cdrc-semantic-search's People

Contributors

cjber avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

cdrc-semantic-search's Issues

Explainability of semantic search results

Ideally, we want the model to justify the choices made by the semantic search system. There are several approaches:

  1. A RAG QA system with a prompt that queries 'Why are the following {documents} relevant to {query}?'

    • e.g. A query for 'diabetes' returns health related documents; why are these relevant? Can specific columns be named by the LLM?
  2. Document Store split into small chunks; returned chunks can be highlighted in the main document as relevant to the query.

    • Unsure whether this negatively impacts retreival performance.
    • Can be combined with RAG.
  3. Extractive QA - returns string taken directly from document.

  4. Summariser - prompt: 'summarise the returned documents'

Evaluate model performance using existing queries

There is a list of existing queries that use the CDRC keyword-based search. Evaluation could make use of these.

  1. Find most common searches and compare semantic search results to website results.
    • List retrieved datasets
  2. Should semantic search encourage longer form questioning? e.g. 'which datasets could help study diabetes?' rather than 'diabetes'.

It is worth noting that the most frequent queries tend to search with a known dataset in mind, e.g. 'imd', 'ahah'. Semantic search would be more focussed on dataset discovery.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.