cdrc-semantic-search's Introduction

CDRC Semantic Search System

Overview

The CDRC Semantic Search System is a project designed to enhance the search capabilities of the Centre for Consumer Data Research (CDRC) data catalogue. The goal is to implement a semantic search approach that goes beyond traditional keyword-based searches, providing users with more accurate and relevant results.

Features

Semantic Search: Embeds documents using OpenAI which are stored on Pinecone, allowing for semantic querying using cosine similarity.
Retrieval Augmented Generation: Generates responses using GPT 3.5 turbo to explain the relevance of retrieved datasets.

System Architecture

The CDRC Semantic Search System follows a standard Retrieval Augmented Generation (RAG) architecture:

Credit to Heiko Hotz (https://towardsdatascience.com/rag-vs-finetuning-which-is-the-best-tool-to-boost-your-llm-application-94654b1eaba7)

Installation

To get started with the CDRC Semantic Search System, follow these steps:

Clone the repository:

git clone https://github.com/cjber/cdrc-semantic-search.git

Install dependencies:

With pip:

cd cdrc-semantic-search
pip install -r requirements.txt

With pdm:

cd cdrc-semantic-search
pdm install

Configure the system:

Edit the config/config.toml file to customize settings such as API keys, or model settings.
Run the system using a DVC pipeline.
```
dvc repro
```

NOTE: This requires a Pinecone database and access to the CDRC catalogue.

cdrc-semantic-search's People

Contributors

Stargazers

Watchers

cdrc-semantic-search's Issues

Explainability of semantic search results

Ideally, we want the model to justify the choices made by the semantic search system. There are several approaches:

A RAG QA system with a prompt that queries 'Why are the following {documents} relevant to {query}?'
- e.g. A query for 'diabetes' returns health related documents; why are these relevant? Can specific columns be named by the LLM?
Document Store split into small chunks; returned chunks can be highlighted in the main document as relevant to the query.
- Unsure whether this negatively impacts retreival performance.
- Can be combined with RAG.
Extractive QA - returns string taken directly from document.
Summariser - prompt: 'summarise the returned documents'

Evaluate model performance using existing queries

There is a list of existing queries that use the CDRC keyword-based search. Evaluation could make use of these.

Find most common searches and compare semantic search results to website results.
- List retrieved datasets
Should semantic search encourage longer form questioning? e.g. 'which datasets could help study diabetes?' rather than 'diabetes'.

It is worth noting that the most frequent queries tend to search with a known dataset in mind, e.g. 'imd', 'ahah'. Semantic search would be more focussed on dataset discovery.

Recommend Projects

cjber / cdrc-semantic-search Goto Github PK

cdrc-semantic-search's Introduction

CDRC Semantic Search System

Overview

Features

System Architecture

Installation

cdrc-semantic-search's People

Contributors

Stargazers

Watchers

cdrc-semantic-search's Issues

Explainability of semantic search results

Evaluate model performance using existing queries

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent