scrippt-tech / orca Goto Github PK

LLM Orchestrator built in Rust

License: Apache License 2.0

Rust 100.00%

orca's Introduction

Orca

Orca is a LLM Orchestration Framework written in Rust. It is designed to be a simple, easy-to-use, and easy-to-extend framework for creating LLM Orchestration. It is currently in development so it may contain bugs and its functionality is limited.

About Orca

Orca is currently in development. It is hard to say what the future of Orca looks like, as I am currently learning about LLM orchestrations and its extensive applications. These are some ideas I want to explore. Suggestions are welcome!

WebAssembly to create simple, portable, yet powerful LLM applications that can run serverless across platforms.
Taking advantage of Rust for fast memory-safe distributed LLM applications.
Deploying LLMs to the edge (think IOT devices, mobile devices, etc.)

Set up

To set up Orca, you will need to install Rust. You can do this by following the instructions here. Once you have Rust installed, you can add Orca to your Cargo.toml file as a dependency:

[dependencies]
orca = { git = "https://github.com/scrippt-tech/orca", package = "orca-core" }

Features

Prompt templating using handlebars-like syntax (see example below)
Loading records (documents)
- HTML from URLs or local files
- PDF from bytes or local files
Vector store support with Qdrant
Current LLM support:
- OpenAI Chat
- Limited Bert support using the Candle ML framework
Pipelines:
- Simple pipelines
- Sequential pipelines

Examples

Orca supports simple LLM pipelines and sequential pipelines. It also supports reading PDF and HTML records (documents).

OpenAI Chat

use orca::pipeline::simple::LLMPipeline;
use orca::pipeline::Pipeline;
use orca::llm::openai::OpenAI;
use orca::prompt::context::Context;
use serde::Serialize;

#[derive(Serialize)]
pub struct Data {
    country1: String,
    country2: String,
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let client = OpenAI::new();
    let prompt = r#"
            {{#chat}}
            {{#user}}
            What is the capital of {{country1}}?
            {{/user}}
            {{#assistant}}
            Paris
            {{/assistant}}
            {{#user}}
            What is the capital of {{country2}}?
            {{/user}}
            {{/chat}}
            "#;
    let pipeline = LLMPipeline::new(&client)
        .load_template("capitals", prompt)?
        .load_context(&Context::new(Data {
            country1: "France".to_string(),
            country2: "Germany".to_string(),
        })?)?;
    let res = pipeline.execute("capitals").await?.content();

    assert!(res.contains("Berlin") || res.contains("berlin"));
    Ok(())
}

Contributing

Contributors are welcome! If you would like to contribute, please open an issue or a pull request. If you would like to add a new feature, please open an issue first so we can discuss it.

Running locally

We use [cargo-make](https://github.com/sagiegurari/cargo-make) to run Orca locally. To install it run:

cargo install cargo-make

Once you have cargo-make installed, you can build or test Orca by running:

$ makers build # Build Orca
$ makers test # Test Orca

orca's People

Contributors

Stargazers

Watchers

Forkers

daheige butch78 ai-jie01 dongpil asloan7 placrosse brianpark314 brunoscaglione zhutony ather23 finfalter tuqinabc aminnasiri andreclaudino wilsongramer conikeec

orca's Issues

Add more models to Orca

Right now we only have a small amount of supported models. Let's expand this! A good start would be to look into https://github.com/huggingface/candle/tree/main/candle-examples/examples and port over their examples into Orca.

Note: Copy pasta is not enough, look at how I did for Bert and Quantized for inspiration. These also could be improved to be a cleaner interface.

Edit: I am refactoring to have a models API in a separate crate orca-models. This crate's goal is to provide an API for using models such as those provided by OpenAI as well as providing an API to easily use Candle transformer models. This should be the main point of development and should replace orca-core's models implementations.

[suggestion] Share an example / use case as a blogpost

completely up to you, could be a nice way to increase visibility and feedback - If you have some nice examples of how this could be useful, it would be interesting to feature as a Community Blogpost on HF. What are your thoughts?

Refactor Prompt polymorphism

Currently, Prompts follow an implementation of the Prompt trait. We then use dynamic dispatch to handle prompts. This is not a clean implementation. The goal is to allow every user to implement their prompt types, and use them as they wish. Unfortunately, my mistake was to hardcode the use of ChatPrompt and String prompts into the codebase. We need to refactor this to handle generic prompts. I am thinking of replacing Prompt trait with using just ToString and handling this accordingly. Another option would be to leverage JSON. Ideas are welcome.

General solution to generate embeddings concurrently

Concurrent processing is a key strategy for speeding up embedding generation, especially when generating embeddings for a series of texts. However, a general issue that surfaces with concurrency is the difficulty in maintaining the correct order of the output embeddings. In concurrent setups, due to asynchronous task completions, the order of the output embeddings (Vec<Vec<f32>>) often doesn’t align with the order of the input prompts (Vec<Box<dyn Prompt>>). This misalignment is causes issues such as a misalignment of text-to-embedding when inserting in a vector database such as Qdrant.

Concurrency:

Rayon Crate: Experiments with the Rayon crate have shown drastic improvements in processing speed (80% faster) for Bert embeddings but have fallen short of solving the ordering issue.
Tokio: Using tokio::spawn to start multiple asynchronous embedding tasks is another potential solution under exploration for async embedding contexts (such as using OpenAI for embedding API calls).

The goal here is to identify a solution that not only enhances processing speed but also ensures the integrity of data by preserving the correct order of embeddings. Any insights or experiences that could contribute to resolving this challenge would be highly valuable.

Split OpenAI model into directory with multiple modules

Might be a good idea to split the OpenAI model into separate modules for readability and scalability. (i.e. ChatCompletion and Embeddings go in separate modules) This will also provide a cleaner interface to users as well.

Taking templates to the next level

I think there is a lot of potential for templates in Orca. A potential they have is being the main point of entry and interface for the user to develop their LLM applications. Maybe think of it as a language to write LLM applications(?)

I keep thinking of ideas to extend this interface. What other ways is there for the user to communicate with an LLM? For example, I think it would be possible to specify functions for the LLM to use within a template as specified in #11.

Prompt engineering is basically pseudocoding in and of itself, so why not formalize it?

Any ideas, thoughts, or comments are more than welcome.

Add sqlite-vss wrapper

I'm thinking of using this project, but I'd like to use sqlite instead of Qdrant, would you accept such a PR?

Integrate functions for LLMs to use

I want to integrate the use of functions for LLMs. Inspired by the ReAct framework, this would allow users to pass in a function they would like the LLM to use when a trigger happens (the LLM decides it needs to use it). I am holding off in implementing this for now, since I want to have a clear idea of a functions lifetime in the context of Orca.

Even cooler, would be to be able to pass the function's trigger through a template. Similar to how you load records or context, you would load the function for the LLM to use, and it would be interface through the template.

Any thoughts, advice, or ideas are more than welcome.