Git Product home page Git Product logo

bco-rag's Introduction

Biocompute Object Retrieval-Augmented Generation Assistant

Full documentation can be found here.

Sean Kim

Background

The BioCompute Object (BCO) project is a community-driven open standards framework for standardizing and sharing computations and analyses. With the exponential increase in both the quantity and complexity of biological data and the workflows used to analyze and transform the data, the need for standardization in documentation is imperative for experimental preservation, transparency, accuracy, and reproducability.

As with any documentation standard, the main hurdles to continued adoption are the overhead required to maintain the quality and accuracy of a BCO in parallel as the research evolves over time and retroactively documenting pre-existing research. With the recent improvements in large language models (LLMs), the feasibility and utility of an automated BCO creation assistant is an intriguing use case.

Goal

The goal of this project is to reduce the overhead required in retroactively documenting pre-existing research. By using the Biocompute RAG assistant, you can seamlessly convert existing publications on previous research to be BCO compliant.

bco-rag's People

Contributors

seankim658 avatar

Watchers

 avatar Stian Soiland-Reyes avatar Jeremy Goecks avatar Jonathon Keeney avatar Hadley King avatar Tianyi Wang avatar

bco-rag's Issues

Enforce domain generation ordering for dependent domains

Some domains are dependent on other domains. The big one is the parametric domain has to be generated after the description domain as the step numbers have to match up.

Will have to hold the description domain in memory and pass it in as part of the query for the parametric domain.

Update documentation

Need to update some documentation for the migration and new features that have been added.

MongoDB Backend?

Should probably remove the local evaluation data. That was meant for the proof of concept and a true backend should probably be setup, potentially in the form of MongoDB and a simple Flask API. The domain generations (for now) can probably be kept in the repository.

Separate query

The standardized queries contain two parts:

  • The query part that contains what is in the domain and what the domain represents.
  • The schema part contains the output formatting for the return response.

The first part of the query is obviously important for the semantic retrieval process but I have a hunch that including the second part of the query for the semantic retrieval is polluting the retrieval. Going to try splitting the query and re-injecting the data schema part before the data is sent to the llm.

Update logging

handle the logging directly in the objects rather than in the main.py entrypoint.

Re-work default eval check

Right now the default score check prevents erroneously saving default evaluations. Should re-work the logic to exclude some fields such as:

  • Score (being implemented here: #9)
  • json format error (which I want to automatically set when falling back to the raw txt file)

Figure out evaluations.json file output

Thinking about creating separate scripts solely for identifying potentially erroneous evals saves and then separately building the evaluations file on "high quality reviews" only (will have to determine that criteria).

Semantic Chunking Chunk Size Bug

Llamaindex's SemanticSplitterNodeParser can sometimes produce chunks that are too large for the embedding model. Unfortunately there is no max length option for the semantic chunking to avoid this issue.

Will have to eventually subclass the SemanticSplitterNodeParser and create a two level safety net that will naively split large chunks into sub-chunks in order to stay under the embedding model input token limits.

Reference:
run-llama/llama_index#12270

Add score to ScoreEval

Should store the score along with the score eval. Because the scoring will iterate/change, need to keep track of which score the score eval was submitted against.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.