edgenai / edgen Goto Github PK

⚡ Edgen: Local, private GenAI server alternative to OpenAI. No GPU required. Run AI models locally: LLMs (Llama2, Mistral, Mixtral...), Speech-to-text (whisper) and many others.

Home Page: https://docs.edgen.co/

License: Apache License 2.0

Rust 96.80% HTML 0.09% CSS 0.41% TypeScript 0.35% Nix 0.71% Shell 0.57% Python 1.08%

chatgpt edge genai llm localai ml rust tauri openai vertex-ai

edgen's People

Contributors

Stargazers

Watchers

Forkers

beckend deadstrobe5 cherriae mehdihmidi523 ravichandra480 danmx itsharex opeolluwa kustomzone codingonion adityasoni17 visioninhope

edgen's Issues

chore: add integration tests for embeddings

feat: Chat Completions Status

Description

Every AI endpoint shall have a "status" that shows state information concerning this endpoint. State information shall contain:

The active model
Download progress if ongoing (prio!)
completions state
State of last activity
Last errors

Scope

This issue is about getting started with the chat/completions endpoint.

Minimal acceptance criterion is to have the download progress working.

feat: support model in request

This already happens with the default model, i.e. if there's no model present in an endpoint, edgen auto downloads the default model.
The same behaviour should happen with the requested model (i.e. the value in the model attribute of the request)

The format of model should be: <hf_repo_owner>/<hf_repo>/<model_name>

Example:

TheBloke/deepseek-coder-6.7B-instruct-GGUF/deepseek-coder-6.7b-instruct.Q5_K_M.gguf

If the request model is not valid, return an error

chore: CTRL+C should perform a global shutdown

Right now, CTRL+C is only shutting down edgen_server's thread. While tauri's thread keeps alive

Add one shot LLM requests

Some requests don't require a session to be kept. A new optional parameter should be added to chat completions that will keep the new LLM context from being saved.

feat: context sliding window

Right now, if a session reaches the max token context size, it seg faults. This cannot happen

chore(docs): add embeddings in api reference

chore: /embeddings/status endpoint

feat(GUI): add log window

Add a "Open Log Window" so that we can see what is happening. This is useful when to know what is happening. For example it may be taking time to load the model so being able to see this logs would be great.

feat: model_present in status

Problem

Client applications want or even need to know if the current default model (from the config) is present or not, i.e. if a download will be necessary when the model is used. Currently, client applications abuse the progress field from the status. But that is not reliable.

Solution

Provide a flag in the status "model_present". This flag is set when edgen starts, when the config changes and after download. To test if the model is present, the hf-hub client should be used.

feat: native windows ARM support

Can be useful when running inside a VM from Mac ARM machines.

Windows release not launching server

On Windows, when the release build is installed and executed, the server isn't properly launched. (There is no process listening to the designated server port)

When running through cargo run, everything works correctly.

Tokio runtime panicking due to `llama_cpp::LlamaSession::context_size` using `block_on`

Hi maintainers, I am not entirely sure if this is edgen's, llama_cpp-rs or my problem so I apologize in advance if I miss something obvious here.

I cloned from main (936a45afedbb208a177038c7379341a52b911786) to build from source, then did a release run to serve. Axum started listening correctly, and everything looks good:

RUST_BACKTRACE=full CUDA_ROOT=/usr/include/_remapped cargo run --features llama_cuda --release -- serve -g -b http://my_host:54321

However if I am to ping the v1/chat/completions endpoint as per the example, then the following panic occurs:

thread 'tokio-runtime-worker' panicked at /home/my_username/.cargo/git/checkouts/llama_cpp-rs-b9d51cabb4b43824/1141010/crates/llama_cpp/src/lib.rs:887:27:
Cannot block the current thread from within a runtime. This happens because a function attempted to block the current thread while the thread is being used to drive asynchronous tasks.

The relevant stack backtraces are:

  ...
  18:     0x5593475ee443 - core::option::expect_failed::h0d6627132effeebe
                               at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/core/src/option.rs:1985:5
  19:     0x559347ff727d - tokio::future::block_on::block_on::h57b11492fc0329f5
  20:     0x559347ff2d21 - llama_cpp::LlamaSession::context_size::h7a5ce26c608b13f9
  21:     0x559347a3d8b4 - llama_cpp::LlamaSession::start_completing_with::h9d25e453a5e19b0a
  22:     0x5593479b3576 - <core::pin::Pin<P> as core::future::future::Future>::poll::hf7d715e681ad5cb3
  23:     0x559347a0585f - <F as axum::handler::Handler<(M,T1),S>>::call::{{closure}}::h650dac4864c27520
  ...
  39:     0x559347a463af - tokio::runtime::task::harness::Harness<T,S>::poll::h3ae60ec675d41951
  40:     0x5593481ecd93 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h5f54c3255e052ad4
  41:     0x5593481e6c14 - tokio::runtime::context::scoped::Scoped<T>::set::h4261ff6b5d63c680
  42:     0x5593481af564 - tokio::runtime::context::runtime::enter_runtime::hb3189b7b6d405fc5
  43:     0x5593481ecaac - tokio::runtime::scheduler::multi_thread::worker::run::h11ac423a0e22051e
  ...

This occurs with or without --features llama_cuda.

This does not occur, however, if I checkout v0.1.2 instead, which produces a token-by-token output:

data: {"id":"856034ed-0e9e-4a56-94de-c9cb0be6cc90","choices":[{"delta":{"content":"Hello","role":null},"finish_reason":null,"index":0}],"created":1708805524,"model":"main","system_fingerprint":"edgen-0.1.2","object":"text_completion"}

data: {"id":"95a36b4b-4c7b-42fd-aa90-9fbfca770057","choices":[{"delta":{"content":"!","role":null},"finish_reason":null,"index":0}],"created":1708805524,"model":"main","system_fingerprint":"edgen-0.1.2","object":"text_completion"}

data: {"id":"24483190-b48d-48cc-ad0b-7459be014740","choices":[{"delta":{"content":" How","role":null},"finish_reason":null,"index":0}],"created":1708805524,"model":"main","system_fingerprint":"edgen-0.1.2","object":"text_completion"}

...

And on it goes that generates the sentence "Hello! How can I assist you today?" which I assume is the expected behaviour.

Looking at the exception it seems like it came from the fact that llama_cpp::LlamaSession::context_size internally started calling block_on last week, which the existing tokio::Runtime didn't appreciate.

Is this a bug, and if not, can anyone point me to the right direction here? Many thanks in advance.

System

Debian GNU/Linux 11 (bullseye) x86_64
rustc 1.75.0-beta.3 (b66b7951b 2023-11-20) as per rust-toolchain.toml
(same happens to 1.76 stable anyway)

Fix: helper functions for model dirs point to wrong path

Add /embeddings api

Add support for embedding api. https://platform.openai.com/docs/guides/embeddings/what-are-embeddings. This can be useful for implementing RAG and vector search.

Might be could start with nomic-embed-text-v1

epic: candle integration - image generation

To get to a fully functional Image Generation endpoint, the following is required:

Design the API
- Design a new Image Generation endpoint interface, since OpenAI does not have one we can mimic. This should be the first step, as everything else depends on this.
Implement a backend
- Implement a backend using candle that can at least provide the API's functionality. It doesn't necessarily have to implement the API directly, it can (maybe should) be a fully featured backend similar to llama.cpp or whisper.cpp. It's up to the endpoint implementation later to directly implement the API, not the backend. Image generation example in candle: https://github.com/huggingface/candle/blob/main/candle-examples/examples/stable-diffusion/main.rs
Edgen implementation
- Define an Image Generation endpoint that follows the API directly and translates the API calls to backend calls. This may be implemented according to the already existing traits LLM and Whisper traits in edgen_core.
- Implement the model endpoint according to the LLM and Whisper endpoints in edgen_server.
- Implement the loading and downloading logic for the new model kind in edgen_server/src/model.rs.
- Implement the new endpoint using the backend. edgen_server/src/openai_shims.rs may provide some inspiration.
- Implement new server routes for the API and connect it to the endpoint (in edgen_server/src/routes.rs).

chore: setup cargo deny in ci

RAM and VRAM monitoring

Edgen should be capable of monitoring the current and RAM and VRAM usage. This will allow further functionality such as avoiding program crashes due to OOM allocations.

chore: improve error handling

Problem

In the code we often use unwrap and expect to handle errors. The advantage is that it is easy and provides detailed error information for us to identify and solve problems. However, it terminates the thread in which the error occurred and, that way, reduces the functionality of the process, probably without the user being aware of it. We should use this strategy only if there is no better way to handle the error (e.g. we lose error information) or when the thread is disposable (e.g. a thread serving a connection).

Solution

We should first distinguish different classes of threads:

The main thread: if the main thread unwraps, the program would exit. This is sometimes what we want, namely, in the case of an unrecoverable error. If the error is recoverable, however, we want to recover and continue.
Threads that are started directly from the main thread like the listeners: if this kind of thread terminates, the process would continue but we lose a capability, for instance, listening to an URI. In this case, we should terminate the program.
Temporary threads that serve a specific purpose like serving a connection to an endpoint: if this kind of thread terminates the connection closes. But nothing severe happens, the next connection will start a new thread.
Handlers and callbacks (like the update handler for the config file): if this kind of thread terminates, we lose a capability, for instance, the refresh of settings.

Main Thread

Check error handling in the main thread;
For every unwrapping error handling think of a better solution and, if found, implement it. Otherwise, provide a comment in the code why this is the best solution.

Listeners

If a listener unwraps, terminate the main thread.

Connection Handlers

If the connection is open send an error response to the client (usually 500) with an adequate error description in the body. If the error is technical, issue an intelligible warning.
Otherwise, terminate the thread with an intelligible warning.

Other Handlers and Callbacks

We should avoid terminating the thread. In some cases, that may be impossible. In those cases issue an error log.
Be aware that, in the case of the config, user activity may be involved. A user may edit the config file by hand and save an inconsistent version (e.g. a missing attribute). We should give the user enough time to solve this issue, e.g.: ignore the issue and wait for the next event. Together with the rule on the main thread this should lead to a consistent handling: if the config remains unreadable, we don't read it. When the program is terminated and, later, started again, the settings will be read from the main thread. It is now save to just abandon.

WIP

The issue of error handling does not go away. There is, unfortunately, no automatism that would always apply the appropriate handling. We need to think of error handling in coding and, inevitably, bad handling will be introduced again. (Remark: this is why the designers of Go decided not to add an easy fallback solution like unwrap. They want programmers to think about error handling every time.) In the long run, some reflections on error handling should be added to a future coding guide.

chore: add embeddings to settings integration tests

windows config uses / and \ for paths

It works but would be good to be the same when the default config is created.

chat_completions_models_dir: C:\Users\prshrest\AppData\Roaming\EdgenAI\Edgen\data\models/chat/completions
audio_transcriptions_models_dir: C:\Users\prshrest\AppData\Roaming\EdgenAI\Edgen\data\models/audio/transcriptions

add cli to path when using MacOS dmg installer

refactor: agnostic to ML backend

Overview

In the future, we want to support multiple ML backends for each endpoint

Example:
chat/completions can use:

llama.cpp
candle
tensorrt

audio/transcriptions can use:

whisper.cpp
candle

images/generations can use:

candle

Failure to build on MacOS due to missing `objectcopy`.

Despite binutils installed, the compilation on macos fails with objectcopy not found. Setting the path specifically as suggested doesn't seem to help either.

  Looking for "objcopy" or an equivalent tool

  --- stderr
  thread 'main' panicked at /Users/evariste/.cargo/git/checkouts/llama_cpp-rs-b9d51cabb4b43824/bc38e77/crates/llama_cpp_sys/build.rs:521:54:
  No suitable tool equivalent to "objcopy" has been found in PATH, if one is already installed, either add it to PATH or set OBJCOPY_PATH to its full path
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

chore: more unit tests

Problem

There are still not enough unit tests to cover our requirements. For example: we shall test

that all parameters that are passed to endpoints are applied correctly,
that error handling is reasonable,
that the model abstraction works (edgen_server/models),
that all endpoints work (here, for unit testing, mocks would be a nice-to-have)
that the cli is processed correctly,
etc.

limit on file size for audio transcription

Is there a set limit on the audio file when creating transcription?

I noticed that small files work but large files doesn't work. You can try a 30mb file from http://www.kiea.jp/english_voicefiles.html. Direct link to mp3: http://www.kiea.jp/hosenji.mp3

Looking at the code seems like file is a vec.

edgen/crates/edgen_core/src/whisper.rs

Lines 39 to 40 in 66ee740

 pub struct TranscriptionArgs { 

 pub file: Vec<u8>,

bug: audio transcriptions fails with "failed to initialize the whisper context"

curl http://localhost:33322/v1/audio/transcriptions   -H "Authorization: Bearer no-key-required"   -H "Content-Type: multipart/form-data"   -F file="@/Users/prabirshrestha/Downloads/frost.wav"   -F model="default"

log:

2024-02-12 23:43:51.354 Edgen[19183:19652515] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
^L2024-02-13T07:44:02.282828Z ERROR whisper_cpp::internal: ggml: whisper_model_load: tensor 'encoder.conv1.weight' has wrong shape in model file: got [80, 768, 1], expected [3, 80, 768]
2024-02-13T07:44:02.282846Z ERROR whisper_cpp::internal: ggml: whisper_init_with_params_no_state: failed to load model

config:

audio_transcriptions_models_dir: /Users/prabirshrestha/code/llm/audio_transcriptions
audio_transcriptions_model_name: ggml-distil-small.en.bin
audio_transcriptions_model_repo: distil-whisper/distil-small.en

OS: MacOS M3, Sonoma 14.2.1

feat(GUI): redirect users to EdgenChat

Introduction

Right now, GUI is hidden and there are many users that execute Edgen and, since only the system tray icon pops up, think that it didn't load.
Let's make the default behavior to show the GUI and leave a button as a call-to-action for users to try out EdgenChat

feat: persistent logs

In order to give users runtime logs, before we figure out how to do that via GUI, let's allow users to easily see logs via a file in DATA_DIR or CACHE_DIR.
This issue also solves #38

feat: GPU CUDA + Vulkan support

docs: add supported models in README

bug: thread 'notify-rs poll loop' panicked

Introduction

After a fresh install, sometimes edgen crashes before and after downloading default chat/completions model.

Using nix env.

Edit: It is not deterministic.

Log

thread 'notify-rs poll loop' panicked at crates/edgen_rt_llama_cpp/src/lib.rs:123:30:
there is no reactor running, must be called from the context of a Tokio 1.x runtime
stack backtrace:
   0: rust_begin_unwind
             at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/core/src/panicking.rs:72:14
   2: tokio::task::spawn::spawn_inner::panic_cold_display
   3: tokio::task::spawn::spawn
   4: <edgen_rt_llama_cpp::LlamaCppEndpoint as core::default::Default>::default
   5: once_cell::imp::OnceCell<T>::initialize::{{closure}}
   6: once_cell::imp::initialize_or_wait
   7: once_cell::imp::OnceCell<T>::initialize
   8: futures_executor::local_pool::block_on
   9: edgen_server::run_server::{{closure}}::{{closure}}
  10: <edgen_core::settings::UpdateHandler as notify::EventHandler>::handle_event
  11: notify::poll::data::WatchData::rescan
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
zephyr-7b-beta.Q4_K_M.gguf [00:03:39] [████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████] 4.07 GiB/4.07 GiB 18.97 MiB/s (0s)thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/once_cell-1.19.0/src/lib.rs:1311:25:
Lazy instance has previously been poisoned
stack backtrace:
   0: rust_begin_unwind
             at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/core/src/panicking.rs:72:14
   2: once_cell::imp::OnceCell<T>::initialize::{{closure}}
   3: once_cell::imp::initialize_or_wait
   4: once_cell::imp::OnceCell<T>::initialize
   5: <F as axum::handler::Handler<(M,T1),S>>::call::{{closure}}
   6: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   7: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   8: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
   9: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
  10: <tower_http::cors::ResponseFuture<F> as core::future::future::Future>::poll
  11: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  12: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  13: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  14: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
  15: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
  16: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
  17: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch
  18: <hyper::server::conn::http1::UpgradeableConnection<I,S> as core::future::future::Future>::poll
  19: <hyper_util::server::conn::auto::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll
  20: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  21: std::panicking::try
  22: tokio::runtime::task::harness::Harness<T,S>::poll
  23: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  24: tokio::runtime::context::scoped::Scoped<T>::set
  25: tokio::runtime::context::runtime::enter_runtime
  26: tokio::runtime::scheduler::multi_thread::worker::run
  27: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  28: tokio::runtime::task::core::Core<T,S>::poll
  29: tokio::runtime::task::harness::Harness<T,S>::poll
  30: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
thread 'tokio-runtime-worker' panicked at ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/once_cell-1.19.0/src/lib.rs:1311:25:
Lazy instance has previously been poisoned
stack backtrace:
   0: rust_begin_unwind
             at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/core/src/panicking.rs:72:14
   2: once_cell::imp::OnceCell<T>::initialize::{{closure}}
   3: once_cell::imp::initialize_or_wait
   4: once_cell::imp::OnceCell<T>::initialize
   5: <F as axum::handler::Handler<(M,T1),S>>::call::{{closure}}
   6: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   7: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   8: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
   9: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
  10: <tower_http::cors::ResponseFuture<F> as core::future::future::Future>::poll
  11: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  12: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  13: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  14: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
  15: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
  16: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
  17: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch
  18: <hyper::server::conn::http1::UpgradeableConnection<I,S> as core::future::future::Future>::poll
  19: <hyper_util::server::conn::auto::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll
  20: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  21: std::panicking::try
  22: tokio::runtime::task::harness::Harness<T,S>::poll
  23: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  24: tokio::runtime::context::scoped::Scoped<T>::set
  25: tokio::runtime::context::runtime::enter_runtime
  26: tokio::runtime::scheduler::multi_thread::worker::run
  27: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  28: tokio::runtime::task::core::Core<T,S>::poll
  29: tokio::runtime::task::harness::Harness<T,S>::poll
  30: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Additionally, this bug also happens when resetting config through Tauri. Sometimes it works with no problem. Other times, it does not:

2024-01-30T23:24:29.127498Z  INFO edgen_core::settings: Creating new settings file: ~/.config/edgen/edgen.conf.yaml
2024-01-30T23:24:29.999864Z  INFO edgen_server: Thread has exited
2024-01-30T23:24:29.999875Z  INFO edgen_server: All threads have exited; exiting normally
thread 'notify-rs poll loop' panicked at crates/edgen_rt_llama_cpp/src/lib.rs:123:30:
there is no reactor running, must be called from the context of a Tokio 1.x runtime
stack backtrace:
   0: rust_begin_unwind
             at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/b66b7951b9b4258fc433f2919e72598fbcc1816e/library/core/src/panicking.rs:72:14
   2: tokio::task::spawn::spawn_inner::panic_cold_display
   3: tokio::task::spawn::spawn
   4: <edgen_rt_llama_cpp::LlamaCppEndpoint as core::default::Default>::default
   5: once_cell::imp::OnceCell<T>::initialize::{{closure}}
   6: once_cell::imp::initialize_or_wait
   7: once_cell::imp::OnceCell<T>::initialize
   8: futures_executor::local_pool::block_on
   9: edgen_server::run_server::{{closure}}::{{closure}}
  10: <edgen_core::settings::UpdateHandler as notify::EventHandler>::handle_event
  11: notify::poll::data::WatchData::rescan
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-01-30T23:24:30.036796Z  INFO edgen_server: Settings have been updated, resetting environment
2024-01-30T23:24:30.036845Z  INFO edgen_server: Using default URI
2024-01-30T23:24:30.036982Z  INFO edgen_server: Listening in on: http://127.0.0.1:33322

docs: update edgen architecture figure

feat: auto-detect GPU and use it, if available

feat: automatic api docs

Let's explore using https://github.com/fern-api/fern for automatic api docs

Full featured chat endpoint

At the moment, several parameters are ignored in the chat endpoint, need to forward them to the endpoint implementation.
Additionally, the responses also return data such as usage without actually acquiring such data, the endpoint implementations should return such data.

feat: edgen needs to handle 1000s of requests

panic on model loading in edgen_rt_llama_cpp

Description

This line, very sporadically, causes a panic.

Solution

I guess that the problem is the lazy implementation of UnloadingModel but I couldn't prove it yet. The bug is simply too rare. If this is the problem, however, retry should solve the issue.

Remark

The code in Whisper is similar and, whatever solution is found for LLM, it should also be applied there.

how do I build edgen locally in Mac

What is a correct way to build edgen locally in Mac with metal?

git clone https://github.com/edgenai/edgen.git
cd edgen/edgen
npm run tauri build

This seems to always crash with segfault with or without llama_meta feature. It used to work before but has been failing recently.

cargo run --release --features llama_metal -- serve
   Compiling edgen v0.1.3 (/Users/username/code/tmp/edgen/edgen/src-tauri)
    Finished release [optimized] target(s) in 3.10s
     Running `/Users/username/code/tmp/edgen/target/release/edgen serve`
Segmentation fault: 11

curl http://localhost:33322/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key-required" -d '{
  "model": "default",
  "messages": [
    {
      "role": "system",
      "content": "You are EdgenChat, a helpful AI assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ]
}'

I'm using default config and have reset it too.

audio/transcriptions may specify language

I see that both edgen_rt_whisper_cpp::TranscriptionArgs and whisper_cpp::WhisperParams offer a language field, which I hope means that specifying the desired language should work when using /v1/audio/transcriptions.

feat: multimodal chat completions

Introduction

Llama.cpp already supports multiple multimodal models. We need to adapt the current chat completions endpoint to also be able to receive images and pass it onto a multimodal model

Pass stop words directly to backend

Right now, if a stop word is reached, because StoppingStream is a wrap over the completion stream, the stop word itself is still added to the context and to the SessionId. This causes the following prompt for the same session to need a whole new session to be created, because the original one with the stopping word does not match, and may cause issues with inference due to the extra token that should be getting ignored.

Feat: config file parameter

I propose that we add a command line parameter --config-file that changes the location where edgen looks for a config file. That would ease integration testing and makes edgen more flexible in general.

Refactor openai_shim.rs

This file is needlessly bloated and complex, its contents should be split across several smaller files.

Allow using ~/ in models dir

Seems like right now it needs to be absolute url. Would be great if ~/ was supported so we don't need to hardcode username in path.

docs: add OpenAI compliant tag

It'd be nice in the docs to have a clear way of indicating which endpoints are OpenAI compliant.

chore(docs): add chat completions nom-streaming docs

chore: integration tests for settings

Problem

Settings and the handling of model directories together with other features of edgen became quite complex. Several bugs were introduced during the quick progress towards MVP1. Testing this behaviour is, on the other hand, time consuming and error-prone. It is therefore urgent to provide integration tests for the different scenarios of settings and model directory managed.

Scenarios

The following scenarios shall be tested:

edgen starts without config directory and model directories
edgen starts with config, but without model directories
edgen start without config, but with model directories
model directories are removed while edgen is running
model directories are changed while edgen is running
config file is removed while edgen is running
config file is changed while edgen is running
edgen starts after config reset
edgen works without model file present
edgen works with model file present
edgen works with model file from huggingface

Verdicts

In all these scenarios, the following conditions shall hold:

/misc/version endpoint is working
/chat/completions endpoint is working
/audio/transcriptions endpoint is working
/chat/completions/status endpoint is working
/audio/transcriptions/status endpoint is working
Download progress is correctly reported for completions and transcriptions
The active model in the status endpoints is correctly reported
config updates are correctly reflected

Local Environment

A specific challenge for this task is that all these operations impact the local environment: they remove or change the config file and they remove and add models. This is annoying. The test environment shall, therefore, backup the environment before the tests start and restore it afterwards. This affects:

~/.config/edgen
~/.local/edgen/models

User-defined model directories do not need to be considered. Instead the tests shall start with a fresh configuration that uses the default model directories.

edgenai / edgen Goto Github PK

edgen's People

Contributors

Stargazers

Watchers

Forkers

edgen's Issues

Description

Scope

Problem

Solution

System

Problem

Solution

Main Thread

Listeners

Connection Handlers

Other Handlers and Callbacks

WIP

Overview

Problem

Introduction

Introduction

Log

Description

Solution

Remark

Introduction

Problem

Scenarios

Verdicts

Local Environment

Recommend Projects

Recommend Topics

Recommend Org