Light

biocompute-objects / bco-rag Goto Github PK

View Code? Open in Web Editor NEW

0.0 6.0 0.0 2.56 MB

BioCompute Object Retrieval-Augmented Generation Tool

Python 100.00%

bco-rag's Introduction

Biocompute Object Retrieval-Augmented Generation Assistant

Full documentation can be found here.

Background

The BioCompute Object (BCO) project is a community-driven open standards framework for standardizing and sharing computations and analyses. With the exponential increase in both the quantity and complexity of biological data and the workflows used to analyze and transform the data, the need for standardization in documentation is imperative for experimental preservation, transparency, accuracy, and reproducability.

As with any documentation standard, the main hurdles to continued adoption are the overhead required to maintain the quality and accuracy of a BCO in parallel as the research evolves over time and retroactively documenting pre-existing research. With the recent improvements in large language models (LLMs), the feasibility and utility of an automated BCO creation assistant is an intriguing use case.

Goal

The goal of this project is to reduce the overhead required in retroactively documenting pre-existing research. By using the Biocompute RAG assistant, you can seamlessly convert existing publications on previous research to be BCO compliant.

bco-rag's People

Contributors

Watchers

bco-rag's Issues

Enforce domain generation ordering for dependent domains

Some domains are dependent on other domains. The big one is the parametric domain has to be generated after the description domain as the step numbers have to match up.

Will have to hold the description domain in memory and pass it in as part of the query for the parametric domain.

Update documentation

Need to update some documentation for the migration and new features that have been added.

MongoDB Backend?

Should probably remove the local evaluation data. That was meant for the proof of concept and a true backend should probably be setup, potentially in the form of MongoDB and a simple Flask API. The domain generations (for now) can probably be kept in the repository.

Separate query

The standardized queries contain two parts:

The query part that contains what is in the domain and what the domain represents.
The schema part contains the output formatting for the return response.

The first part of the query is obviously important for the semantic retrieval process but I have a hunch that including the second part of the query for the semantic retrieval is polluting the retrieval. Going to try splitting the query and re-injecting the data schema part before the data is sent to the llm.

Update logging

handle the logging directly in the objects rather than in the main.py entrypoint.

Re-work default eval check

Right now the default score check prevents erroneously saving default evaluations. Should re-work the logic to exclude some fields such as:

Score (being implemented here: #9)
json format error (which I want to automatically set when falling back to the raw txt file)

Implement from last session feature

Figure out evaluations.json file output

Thinking about creating separate scripts solely for identifying potentially erroneous evals saves and then separately building the evaluations file on "high quality reviews" only (will have to determine that criteria).

Semantic Chunking Chunk Size Bug

Llamaindex's SemanticSplitterNodeParser can sometimes produce chunks that are too large for the embedding model. Unfortunately there is no max length option for the semantic chunking to avoid this issue.

Will have to eventually subclass the SemanticSplitterNodeParser and create a two level safety net that will naively split large chunks into sub-chunks in order to stay under the embedding model input token limits.

Reference:
run-llama/llama_index#12270

Add score to ScoreEval

Should store the score along with the score eval. Because the scoring will iterate/change, need to keep track of which score the score eval was submitted against.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.