New Koboldcpp provider about wingman HOT 10 OPEN

nvms commented on September 24, 2024

New Koboldcpp provider

from wingman.

Comments (10)

nvms commented on September 24, 2024

I made a Koboldcpp experimental provider: https://github.com/synw/wingman/blob/main/src/providers/koboldcpp.ts. I did it to be able to run inference from my 8Go RAM phone from Wingman queries using small models (Koboldcpp is the only thing that runs on my phone)

This is great! I also have koboldcpp running locally occasionally, so I would be happy to test this out.

It would be nice to be able to switch provider depending on your prompt commands. I have different local servers running different backends: Goinfer on Linux, Koboldcpp on Android. I would like to be able to submit a query to one or another depending on the prompt. If the command could specify the provider it would be nice

Yes, absolutely! I have plans to support selecting a provider (and maybe other things like template format) from the UI/command panel. The more providers and template formats that wingman supports, the more unreasonable it becomes to force the user to define these as a per-command configuration. These things should probably be selected from the UI.

This will be a pretty fundamental change to the extension itself. Not a big deal, but a lot of things will change related to how commands are defined, how the UI is built, etc.

By the way what's the plan for this extension: do you want to develop it further and maintain it, or not really? I am wondering because I am suggesting many changes and improvements, but they might not fit in your plan

Updates have been slow lately (I have a day job), but I'm still planning on maintaining it. I will likely make the changes I mentioned above soon. I have the UI somewhat prototyped already, but nothing is final.

As you make changes in your branch that you believe are ready, I'd be happy to review any PR you want to open. This way, your changes will be included in the refactor I mentioned above.

from wingman.

nvms commented on September 24, 2024

Just an update: this refactor is well on its way. Here's some screenshots of the somewhat rough implementation of the third panel which is primarily used for configuring provider, completion params, and other relevant things like template format and API URL:

from wingman.

synw commented on September 24, 2024

Nice: a way to select the provider was missing (I just set a provider as default in my branch to test my new ones). The question of the prompt templates is also crucial if you want to have a chance to get decent results from a model or another, good that we can have support for this.

In my branch I added some default template and a default context size settings because I really need these features, but I am going to remove them and prepare a PR that can merge in the actual code. As we are discussing about refactoring and improvements, I add a list of the most important features I would like to have to use a local model:

Model config: the context window size support is important: this way we can tokenize the prompt and calculate the number of tokens to predict (and remove the hardcoded 1000 number) and send this to the api ( #17 ). I will provide an utility to do this in my branch: input the prompt and the context window size and get the number of tokens to predict. [Edit]: we may sometimes need important model conf params sometimes like rope scaling for some models
Prompt templating support: I see that this is on the way, excellent. We should have the ability to add custom templates, for example when using exotic models that require a specific format
Server config : ability to associate a server config to a prompt. A server config would be an url+a provider. This way the prompts could be dispatched on different servers and models. For example I could have a server running Code Llama Python for the Python prompts, and another for more general code or whatever

from wingman.

synw commented on September 24, 2024

Here is the PR #22 : please review the code. It will be usable when we will have this setting to select the default provider that you are working on

from wingman.

nvms commented on September 24, 2024

Looking at the PR now.

Model config: the context window size support is important: this way we can tokenize the prompt and calculate the number of tokens to predict (and remove the hardcoded 1000 number) and send this to the api ( #17 ). I will provide an utility to do this in my branch: input the prompt and the context window size and get the number of tokens to predict

I agree this is very useful, but: how will we infer which tokenizer we need to use at runtime? AFAIK js-tiktoken works fine for estimating token count for ChatGPT, but is not accurate for llama models, which is where llama-tokenizer-js comes in. Do we need to allow the user to choose a tokenizer in the UI? Maybe the Provider interface should include something like a tokenizePrompt(prompt: string): number method, forcing a provider implementation to specify a tokenizer? This is something I have not given much thought, but I'm leaning towards the second idea.

Prompt templating support: I see that this is on the way, excellent. We should have the ability to add custom templates, for example when using exotic models that require a specific format

I agree. This is in progress, but not yet functional. I wanted to get your feedback on this approach I'm trying.

  "ChatML": {
    system: "<|im_start|>system\n{system_message}<|im_end|>",
    user: "<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant\n",
    first: "{system}\n{user}",
    stops: ["<|im_end|>"],
  },
  "Llama 2": {
    system: "<<SYS>>\n{system_message}\n<</SYS>>",
    user: "<s>[INST] {user_message} [/INST]",
    first: "<s>[INST] {system}\n\n{user_message} [/INST]",
    stops: ["</s>"],
  },

The above probably makes sense to you, but I will explain my thinking anyways. Tell me if I'm wrong or this is a bad approach.

Often times the first part of the conversation is different enough than all subsequent messages that it makes sense to have a specific format for just the first message, which almost certainly includes the system message and user message. So in the case of the ChatML example above, with the way the first message is defined as {system}\n{user}:

{system} is replaced with ChatML.system and {user} is replaced with ChatML.user.
All subsequent messages are just formatted with ChatML.user, with conversation history prepended to it (as is handled by most of the provider implementations right now).

I think the usefulness of this type of format configuration becomes apparent when examining the llama 2 format, most specifically.

Koboldcpp (and maybe others I'm not aware of) supports a stop_sequence: string[], which is the purpose of stops.
This approach is easy to extend by a user (it's just JSON), and I can populate the UI with whatever additional formats the user defines.

from wingman.

nvms commented on September 24, 2024

Maybe the Provider interface should include something like a tokenizePrompt(prompt: string): number method, forcing a provider implementation to specify a tokenizer? This is something I have not given much thought, but I'm leaning towards the second idea.

Respond to my own comment here. I don't actually think this approach works. Tokenization approach really depends more on the model being used than the provider, and since a provider can potentially support many different models (e.g. OpenAI provider but speaking to Goinfer running some llama or llama 2 model), it doesn't make sense that the Provider should know how to tokenize the prompt.

from wingman.

synw commented on September 24, 2024

Do we need to allow the user to choose a tokenizer in the UI?
Tokenization approach really depends more on the model being used than the provider,

Yes it depends on the model family. We could have a mixed approach like defining default tokenizers for each providers, and have a tokenizer fallback setting and command param to let the user choose in case needed. For the local model providers the Llama tokenizer would be fine for most cases. For OpenAi the llama-tokenizer-js lib recommends the Gpt Tokenizer lib, or use the official Tiktoken one. I don't know about Anthropic.

About the templating it looks good but I don't really get the first abstraction thing, I must read the code to get a better idea about this. The supported template variables should include the conversation history. I like the Orca mini format abstractions, it's pretty simple and clear:

### System:
{system}

### User:
{prompt}

### Input:
{input}

### Response:

Associating a stop sequence to a template might be a good idea, but in the api it is an inference parameter: please have a look a this data structure for reference: https://github.com/synw/infergui/blob/main/src/interfaces.ts#L29 : it's the Llama.cpp api implemented in Goinfer but Koboldcpp is pretty much similar

By the way I have a question: would it be possible to edit the commands in a more convenient format than json? Like in human readable yaml files for example? For complex templates like few shots ones it is much more convenient. This is what I did in Goinfer with my concept of tasks. Example "command" in yaml: https://github.com/synw/goinfer/blob/main/examples/tasks/code/json/fix.yml (one shot prompt)

from wingman.

nvms commented on September 24, 2024

We could have a mixed approach like defining default tokenizers for each providers, and have a tokenizer fallback setting and command param to let the user choose in case needed.

Yeah, this might be the only approach that makes sense for wingman. I'll work on this in the refactor (I should publish the branch maybe today when it's in a good spot).

I don't really get the first abstraction thing

Maybe it's not needed, but it is my understanding that the first message in a conversation is usually formatted differently than all followup messages, like in the current Anthropic provider (simplified for example purposes):

if (!isFollowup) {
  prompt = `${system}${user} ${user_message}${assistant}`
} else { 
  prompt = `${history}${user} ${user_message}${assistant}`
}

Becomes something like:

if (!isFollowup) {
  // formats with `format.first` as guide, e.g. `<s>[INST] {system}\n\n{user_message} [/INST]`
  prompt = formatFirst(command, userMessage, llama2Template); 
} else {
  // formats with `format.user` as guide, e.g. `<s>[INST] {user_message} [/INST]`
  prompt = `${history}${format(command, userMessage, llama2Template)}`;
}

Since the first message is not always just {system}{user}, sometimes it is {system}\n{user}, {system}\n\n{user}, or in the weird case of llama2: {<s>[INST] {system}\n\n{user} [/INST]

Is this not necessary?

Associating a stop sequence to a template might be a good idea, but in the api it is an inference parameter

Ah, I overlooked this. Thank you for pointing this out to me. It definitely does feel right to define it on the template. Will have to think about this one some more.

would it be possible to edit the commands in a more convenient format than json?

Open to suggestions that are vscode-friendly. AFAIK the recommended method is to use settings.json for this type of thing. I understand that the way we define custom commands at the moment is somewhat cumbersome. I'd like to improve it somehow if at all possible.

from wingman.

capdevc commented on September 24, 2024

Mostly lurking at the moment (also due to day job stuff) but hope to be able to contribute some more as well soon.

Re: prompt config, having it in the json settings is a huge benefit because the config is automatically synced and always available. There are some other vs code extensions out there that let you customize prompts (flexigpt, continue, for example) as js or python objects in separate config files, but then it's up to the user to manage that js/python file and make sure that the path to it is correct and all that. I'm just one user here, but I use vs code on remote machines 90% of the time via the k8s or ssh remote extensions... wingman just works in that scenario. The other extensions require me to copy config/prompt files around.

Personally, I'd prefer for common prompting formats + tokenizer selection to be built in (keyed by model name for example), with maybe an option to add a custom one to the settings.json, even if it requires some more or less complicated string escaping/templating into json. I think it's the approach that keeps it the easiest for the 90+% of users who are just going to use GPT-4 or other common model, while still keeping it possible to do more custom things for other users.

FWIW I'm mainly a python programmer so when I was initially setting up a bunch of prompts I made a tiny script that let me define the prompt settings as a python object and then json encode it/escape it/etc to copy/paste into settings.json. Maybe some tiny UI for prompt creation that does something similar would be useful.

from wingman.

nvms commented on September 24, 2024

Personally, I'd prefer for common prompting formats + tokenizer selection to be built in (keyed by model name for example), with maybe an option to add a custom one to the settings.json, even if it requires some more or less complicated string escaping/templating into json. I think it's the approach that keeps it the easiest for the 90+% of users who are just going to use GPT-4 or other common model, while still keeping it possible to do more custom things for other users.

Completely agree. A new user with a fresh install should be able to set their API key and just start using it, entirely ignoring the configuration panel. Likely more than 90% of users will fall into that bucket as you mentioned. For the remaining few, it should be configurable enough such that they can use whatever provider/llm/format/etc. they want to use -- and it should be easy to do so. So far version 2 captures this idea, but there is still a bit more work to be done.

Maybe some tiny UI for prompt creation that does something similar would be useful

Yeah, since this is such a core feature of the extension it might make sense to have some small UI like this. I have no good ideas at the moment though.

from wingman.

New Koboldcpp provider about wingman HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent