liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching

⚠️ DEPRECATION WARNING: LiteLLM is our new home. You can find the LiteLLM Proxy there. Thank you for checking us out! ❤️

Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models

Usage

Step 1: Put your API keys in .env Copy the .env.template and put in the relevant keys (e.g. OPENAI_API_KEY="sk-..")

Step 2: Test your proxy Start your proxy server

$ cd litellm-proxy && python3 main.py

Make your first call

import openai 

openai.api_key = "sk-litellm-master-key"
openai.api_base = "http://0.0.0.0:8080"

response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])

print(response)

What does liteLLM proxy do

Make /chat/completions requests for 50+ LLM models Azure, OpenAI, Replicate, Anthropic, Hugging Face

Example: for model use claude-2, gpt-3.5, gpt-4, command-nightly, stabilityai/stablecode-completion-alpha-3b-4k

{
  "model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
  "messages": [
    {
      "content": "Hello, whats the weather in San Francisco??",
      "role": "user"
    }
  ]
}

Consistent Input/Output Format
- Call all models using the OpenAI format - completion(model, messages)
- Text responses will always be available at ['choices'][0]['message']['content']
Error Handling Using Model Fallbacks (if GPT-4 fails, try llama2)
Logging - Log Requests, Responses and Errors to Supabase, Posthog, Mixpanel, Sentry, LLMonitor, Traceloop, Helicone (Any of the supported providers here: https://docs.litellm.ai/docs/

Example: Logs sent to Supabase
Token Usage & Spend - Track Input + Completion tokens used + Spend/model
Caching - Implementation of Semantic Caching
Streaming & Async Support - Return generators to stream text responses

API Endpoints

`/chat/completions` (POST)

This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc

Input

This API endpoint accepts all inputs in raw JSON and expects the following inputs

model (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://docs.litellm.ai/docs/): eg gpt-3.5-turbo, gpt-4, claude-2, command-nightly, stabilityai/stablecode-completion-alpha-3b-4k
messages (array, required): A list of messages representing the conversation context. Each message should have a role (system, user, assistant, or function), content (message text), and name (for function role).
Additional Optional parameters: temperature, functions, function_call, top_p, n, stream. See the full list of supported inputs here: https://docs.litellm.ai/docs/

Example JSON body

For claude-2

{
  "model": "claude-2",
  "messages": [
    {
      "content": "Hello, whats the weather in San Francisco??",
      "role": "user"
    }
  ]
}

Making an API request to the Proxy Server

import requests
import json

# TODO: use your URL
url = "http://localhost:5000/chat/completions"

payload = json.dumps({
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "content": "Hello, whats the weather in San Francisco??",
      "role": "user"
    }
  ]
})
headers = {
  'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

Output [Response Format]

Responses from the server are given in the following format. All responses from the server are returned in the following format (for all LLM models). More info on output here: https://docs.litellm.ai/docs/

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
        "role": "assistant"
      }
    }
  ],
  "created": 1691790381,
  "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 41,
    "prompt_tokens": 16,
    "total_tokens": 57
  }
}

Installation & Usage

Running Locally

Clone liteLLM repository to your local machine:

git clone https://github.com/BerriAI/liteLLM-proxy

Install the required dependencies using pip
```
pip install requirements.txt
```

(optional)Set your LiteLLM proxy master key

os.environ['LITELLM_PROXY_MASTER_KEY]` = "YOUR_LITELLM_PROXY_MASTER_KEY"
or
set LITELLM_PROXY_MASTER_KEY in your .env file

Set your LLM API keys

os.environ['OPENAI_API_KEY]` = "YOUR_API_KEY"
or
set OPENAI_API_KEY in your .env file

Run the server:
```
python main.py
```

Deploying

Quick Start: Deploy on Railway
GCP, AWS, Azure This project includes a Dockerfile allowing you to build and deploy a Docker Project on your providers

Support / Talk with founders

Our calendar 👋
Community Discord 💭
Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
Our emails ✉️ [email protected] / [email protected]

Roadmap

Support hosted db (e.g. Supabase)
Easily send data to places like posthog and sentry.
Add a hot-cache for project spend logs - enables fast checks for user + project limitings
Implement user-based rate-limiting
Spending controls per project - expose key creation endpoint
Need to store a keys db -> mapping created keys to their alias (i.e. project name)
Easily add new models as backups / as the entry-point (add this to the available model list)

shankscoder / litellm-proxy Goto Github PK

litellm-proxy's Introduction