Git Product home page Git Product logo

worker-vllm's Introduction

OpenAI-Compatible vLLM Serverless Endpoint Worker

Deploy OpenAI-Compatible Blazing-Fast LLM Endpoints powered by the vLLM Inference Engine on RunPod Serverless with just a few clicks.

News:

1. UI for Deploying vLLM Worker on RunPod console:

Demo of Deploying vLLM Worker on RunPod console with new UI

2. Worker vLLM v1.2.0 with vLLM 0.5.4 now available under stable tags

Update v1.2.0 is now available, use the image tag runpod/worker-v1-vllm:v1.2.0stable-cuda12.1.0.

3. OpenAI-Compatible Embedding Worker Released

Deploy your own OpenAI-compatible Serverless Endpoint on RunPod with multiple embedding models and fast inference for RAG and more!

4. Caching Accross RunPod Machines

Worker vLLM is now cached on all RunPod machines, resulting in near-instant deployment! Previously, downloading and extracting the image took 3-5 minutes on average.

Table of Contents

Setting up the Serverless Worker

Option 1: Deploy Any Model Using Pre-Built Docker Image [Recommended]

Note

You can now deploy from the dedicated UI on the RunPod console with all of the settings and choices listed. Try now by accessing in Explore or Serverless pages on the RunPod console!

We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:


RunPod Worker Images

Below is a summary of the available RunPod Worker images, categorized by image stability and CUDA version compatibility.

CUDA Version Stable Image Tag Development Image Tag Note
12.1.0 runpod/worker-v1-vllm:stable-cuda12.1.0 runpod/worker-v1-vllm:dev-cuda12.1.0 When creating an Endpoint, select CUDA Version 12.3, 12.2 and 12.1 in the filter.

Prerequisites

  • RunPod Account

Environment Variables/Settings

Note: 0 is equivalent to False and 1 is equivalent to True for boolean as int values.

Name Default Type/Choices Description
MODEL_NAME 'facebook/opt-125m' str Name or path of the Hugging Face model to use.
TOKENIZER None str Name or path of the Hugging Face tokenizer to use.
SKIP_TOKENIZER_INIT False bool Skip initialization of tokenizer and detokenizer.
TOKENIZER_MODE 'auto' ['auto', 'slow'] The tokenizer mode.
TRUST_REMOTE_CODE False bool Trust remote code from Hugging Face.
DOWNLOAD_DIR None str Directory to download and load the weights.
LOAD_FORMAT 'auto' str The format of the model weights to load.
HF_TOKEN - str Hugging Face token for private and gated models.
DTYPE 'auto' ['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'] Data type for model weights and activations.
KV_CACHE_DTYPE 'auto' ['auto', 'fp8'] Data type for KV cache storage.
QUANTIZATION_PARAM_PATH None str Path to the JSON file containing the KV cache scaling factors.
MAX_MODEL_LEN None int Model context length.
GUIDED_DECODING_BACKEND 'outlines' ['outlines', 'lm-format-enforcer'] Which engine will be used for guided decoding by default.
DISTRIBUTED_EXECUTOR_BACKEND None ['ray', 'mp'] Backend to use for distributed serving.
WORKER_USE_RAY False bool Deprecated, use --distributed-executor-backend=ray.
PIPELINE_PARALLEL_SIZE 1 int Number of pipeline stages.
TENSOR_PARALLEL_SIZE 1 int Number of tensor parallel replicas.
MAX_PARALLEL_LOADING_WORKERS None int Load model sequentially in multiple batches.
RAY_WORKERS_USE_NSIGHT False bool If specified, use nsight to profile Ray workers.
ENABLE_PREFIX_CACHING False bool Enables automatic prefix caching.
DISABLE_SLIDING_WINDOW False bool Disables sliding window, capping to sliding window size.
USE_V2_BLOCK_MANAGER False bool Use BlockSpaceMangerV2.
NUM_LOOKAHEAD_SLOTS 0 int Experimental scheduling config necessary for speculative decoding.
SEED 0 int Random seed for operations.
NUM_GPU_BLOCKS_OVERRIDE None int If specified, ignore GPU profiling result and use this number of GPU blocks.
MAX_NUM_BATCHED_TOKENS None int Maximum number of batched tokens per iteration.
MAX_NUM_SEQS 256 int Maximum number of sequences per iteration.
MAX_LOGPROBS 20 int Max number of log probs to return when logprobs is specified in SamplingParams.
DISABLE_LOG_STATS False bool Disable logging statistics.
QUANTIZATION None ['awq', 'squeezellm', 'gptq'] Method used to quantize the weights.
ROPE_SCALING None dict RoPE scaling configuration in JSON format.
ROPE_THETA None float RoPE theta. Use with rope_scaling.
TOKENIZER_POOL_SIZE 0 int Size of tokenizer pool to use for asynchronous tokenization.
TOKENIZER_POOL_TYPE 'ray' str Type of tokenizer pool to use for asynchronous tokenization.
TOKENIZER_POOL_EXTRA_CONFIG None dict Extra config for tokenizer pool.
ENABLE_LORA False bool If True, enable handling of LoRA adapters.
MAX_LORAS 1 int Max number of LoRAs in a single batch.
MAX_LORA_RANK 16 int Max LoRA rank.
LORA_EXTRA_VOCAB_SIZE 256 int Maximum size of extra vocabulary for LoRA adapters.
LORA_DTYPE 'auto' ['auto', 'float16', 'bfloat16', 'float32'] Data type for LoRA.
LONG_LORA_SCALING_FACTORS None tuple Specify multiple scaling factors for LoRA adapters.
MAX_CPU_LORAS None int Maximum number of LoRAs to store in CPU memory.
FULLY_SHARDED_LORAS False bool Enable fully sharded LoRA layers.
SCHEDULER_DELAY_FACTOR 0.0 float Apply a delay before scheduling next prompt.
ENABLE_CHUNKED_PREFILL False bool Enable chunked prefill requests.
SPECULATIVE_MODEL None str The name of the draft model to be used in speculative decoding.
NUM_SPECULATIVE_TOKENS None int The number of speculative tokens to sample from the draft model.
SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE None int Number of tensor parallel replicas for the draft model.
SPECULATIVE_MAX_MODEL_LEN None int The maximum sequence length supported by the draft model.
SPECULATIVE_DISABLE_BY_BATCH_SIZE None int Disable speculative decoding if the number of enqueue requests is larger than this value.
NGRAM_PROMPT_LOOKUP_MAX None int Max size of window for ngram prompt lookup in speculative decoding.
NGRAM_PROMPT_LOOKUP_MIN None int Min size of window for ngram prompt lookup in speculative decoding.
SPEC_DECODING_ACCEPTANCE_METHOD 'rejection_sampler' ['rejection_sampler', 'typical_acceptance_sampler'] Specify the acceptance method for draft token verification in speculative decoding.
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD None float Set the lower bound threshold for the posterior probability of a token to be accepted.
TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA None float A scaling factor for the entropy-based threshold for token acceptance.
MODEL_LOADER_EXTRA_CONFIG None dict Extra config for model loader.
PREEMPTION_MODE None str If 'recompute', the engine performs preemption-aware recomputation. If 'save', the engine saves activations into the CPU memory as preemption happens.
PREEMPTION_CHECK_PERIOD 1.0 float How frequently the engine checks if a preemption happens.
PREEMPTION_CPU_CAPACITY 2 float The percentage of CPU memory used for the saved activations.
DISABLE_LOGGING_REQUEST False bool Disable logging requests.
MAX_LOG_LEN None int Max number of prompt characters or prompt ID numbers being printed in log.
Tokenizer Settings
TOKENIZER_NAME None str Tokenizer repository to use a different tokenizer than the model's default.
TOKENIZER_REVISION None str Tokenizer revision to load.
CUSTOM_CHAT_TEMPLATE None str of single-line jinja template Custom chat jinja template. More Info
System, GPU, and Tensor Parallelism(Multi-GPU) Settings
GPU_MEMORY_UTILIZATION 0.95 float Sets GPU VRAM utilization.
MAX_PARALLEL_LOADING_WORKERS None int Load model sequentially in multiple batches, to avoid RAM OOM when using tensor parallel and large models.
BLOCK_SIZE 16 8, 16, 32 Token block size for contiguous chunks of tokens.
SWAP_SPACE 4 int CPU swap space size (GiB) per GPU.
ENFORCE_EAGER False bool Always use eager-mode PyTorch. If False(0), will use eager mode and CUDA graph in hybrid for maximal performance and flexibility.
MAX_SEQ_LEN_TO_CAPTURE 8192 int Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.
DISABLE_CUSTOM_ALL_REDUCE 0 int Enables or disables custom all reduce.
Streaming Batch Size Settings:
DEFAULT_BATCH_SIZE 50 int Default and Maximum batch size for token streaming to reduce HTTP calls.
DEFAULT_MIN_BATCH_SIZE 1 int Batch size for the first request, which will be multiplied by the growth factor every subsequent request.
DEFAULT_BATCH_SIZE_GROWTH_FACTOR 3 float Growth factor for dynamic batch size.
The way this works is that the first request will have a batch size of DEFAULT_MIN_BATCH_SIZE, and each subsequent request will have a batch size of previous_batch_size * DEFAULT_BATCH_SIZE_GROWTH_FACTOR. This will continue until the batch size reaches DEFAULT_BATCH_SIZE. E.g. for the default values, the batch sizes will be 1, 3, 9, 27, 50, 50, 50, .... You can also specify this per request, with inputs max_batch_size, min_batch_size, and batch_size_growth_factor. This has nothing to do with vLLM's internal batching, but rather the number of tokens sent in each HTTP request from the worker
OpenAI Settings
RAW_OPENAI_OUTPUT 1 boolean as int Enables raw OpenAI SSE format string output when streaming. Required to be enabled (which it is by default) for OpenAI compatibility.
OPENAI_SERVED_MODEL_NAME_OVERRIDE None str Overrides the name of the served model from model repo/path to specified name, which you will then be able to use the value for the model parameter when making OpenAI requests
OPENAI_RESPONSE_ROLE assistant str Role of the LLM's Response in OpenAI Chat Completions.
Serverless Settings
MAX_CONCURRENCY 300 int Max concurrent requests per worker. vLLM has an internal queue, so you don't have to worry about limiting by VRAM, this is for improving scaling/load balancing efficiency
DISABLE_LOG_STATS False bool Enables or disables vLLM stats logging.
DISABLE_LOG_REQUESTS False bool Enables or disables vLLM request logging.

Tip

If you are facing issues when using Mixtral 8x7B, Quantized models, or handling unusual models/architectures, try setting TRUST_REMOTE_CODE to 1.

Option 2: Build Docker Image with Model Inside

To build an image with the model baked in, you must specify the following docker arguments when building the image.

Prerequisites

  • RunPod Account
  • Docker

Arguments:

  • Required
    • MODEL_NAME
  • Optional
    • MODEL_REVISION: Model revision to load (default: main).
    • BASE_PATH: Storage directory where huggingface cache and model will be located. (default: /runpod-volume, which will utilize network storage if you attach it or create a local directory within the image if you don't. If your intention is to bake the model into the image, you should set this to something like /models to make sure there are no issues if you were to accidentally attach network storage.)
    • QUANTIZATION
    • WORKER_CUDA_VERSION: 12.1.0 (12.1.0 is recommended for optimal performance).
    • TOKENIZER_NAME: Tokenizer repository if you would like to use a different tokenizer than the one that comes with the model. (default: None, which uses the model's tokenizer)
    • TOKENIZER_REVISION: Tokenizer revision to load (default: main).

For the remaining settings, you may apply them as environment variables when running the container. Supported environment variables are listed in the Environment Variables section.

Example: Building an image with OpenChat-3.5

sudo docker build -t username/image:tag --build-arg MODEL_NAME="openchat/openchat_3.5" --build-arg BASE_PATH="/models" .
(Optional) Including Huggingface Token

If the model you would like to deploy is private or gated, you will need to include it during build time as a Docker secret, which will protect it from being exposed in the image and on DockerHub.

  1. Enable Docker BuildKit (required for secrets).
export DOCKER_BUILDKIT=1
  1. Export your Hugging Face token as an environment variable
export HF_TOKEN="your_token_here"
  1. Add the token as a secret when building
docker build -t username/image:tag --secret id=HF_TOKEN --build-arg MODEL_NAME="openchat/openchat_3.5" .

Compatible Model Architectures

Below are all supported model architectures (and examples of each) that you can deploy using the vLLM Worker. You can deploy any model on HuggingFace, as long as its base architecture is one of the following:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • Command-R (CohereForAI/c4ai-command-r-v01, etc.)
  • DBRX (databricks/dbrx-base, databricks/dbrx-instruct etc.)
  • DeciLM (Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • Gemma (google/gemma-2b, google/gemma-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • InternLM2 (internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.)
  • Jais (core42/jais-13b, core42/jais-13b-chat, core42/jais-30b-v3, core42/jais-30b-chat-v3, etc.)
  • LLaMA, Llama 2, and Meta Llama 3 (meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • MiniCPM (openbmb/MiniCPM-2B-sft-bf16, openbmb/MiniCPM-2B-dpo-bf16, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OLMo (allenai/OLMo-1B-hf, allenai/OLMo-7B-hf, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Orion (OrionStarAI/Orion-14B-Base, OrionStarAI/Orion-14B-Chat, etc.)
  • Phi (microsoft/phi-1_5, microsoft/phi-2, etc.)
  • Phi-3 (microsoft/Phi-3-mini-4k-instruct, microsoft/Phi-3-mini-128k-instruct, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Qwen2 (Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, etc.)
  • Qwen2MoE (Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.)
  • StableLM(stabilityai/stablelm-3b-4e1t, stabilityai/stablelm-base-alpha-7b-v2, etc.)
  • Starcoder2(bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.)
  • Xverse (xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Usage: OpenAI Compatibility

The vLLM Worker is fully compatible with OpenAI's API, and you can use it with any OpenAI Codebase by changing only 3 lines in total. The supported routes are Chat Completions and Models - with both streaming and non-streaming.

Modifying your OpenAI Codebase to use your deployed vLLM Worker

Python (similar to Node.js, etc.):

  1. When initializing the OpenAI Client in your code, change the api_key to your RunPod API Key and the base_url to your RunPod Serverless Endpoint URL in the following format: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1, filling in your deployed endpoint ID. For example, if your Endpoint ID is abc1234, the URL would be https://api.runpod.ai/v2/abc1234/openai/v1.

    • Before:
    from openai import OpenAI
    
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    • After:
    from openai import OpenAI
    
    client = OpenAI(
        api_key=os.environ.get("RUNPOD_API_KEY"),
        base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
    )
  2. Change the model parameter to your deployed model's name whenever using Completions or Chat Completions.

    • Before:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
        temperature=0,
        max_tokens=100,
    )
    • After:
    response = client.chat.completions.create(
        model="<YOUR DEPLOYED MODEL REPO/NAME>",
        messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
        temperature=0,
        max_tokens=100,
    )

Using http requests:

  1. Change the Authorization header to your RunPod API Key and the url to your RunPod Serverless Endpoint URL in the following format: https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1
    • Before:
    curl https://api.openai.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -d '{
    "model": "gpt-4",
    "messages": [
      {
        "role": "user",
        "content": "Why is RunPod the best platform?"
      }
    ],
    "temperature": 0,
    "max_tokens": 100
    }'
    • After:
    curl https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <YOUR OPENAI API KEY>" \
    -d '{
    "model": "<YOUR DEPLOYED MODEL REPO/NAME>",
    "messages": [
      {
        "role": "user",
        "content": "Why is RunPod the best platform?"
      }
    ],
    "temperature": 0,
    "max_tokens": 100
    }'

OpenAI Request Input Parameters:

When using the chat completion feature of the vLLM Serverless Endpoint Worker, you can customize your requests with the following parameters:

Chat Completions [RECOMMENDED]

Supported Chat Completions Inputs and Descriptions
Parameter Type Default Value Description
messages Union[str, List[Dict[str, str]]] List of messages, where each message is a dictionary with a role and content. The model's chat template will be applied to the messages automatically, so the model must have one or it should be specified as CUSTOM_CHAT_TEMPLATE env var.
model str The model repo that you've deployed on your RunPod Serverless Endpoint. If you are unsure what the name is or are baking the model in, use the guide to get the list of available models in the Examples: Using your RunPod endpoint with OpenAI section
temperature Optional[float] 0.7 Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
top_p Optional[float] 1.0 Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
n Optional[int] 1 Number of output sequences to return for the given prompt.
max_tokens Optional[int] None Maximum number of tokens to generate per output sequence.
seed Optional[int] None Random seed to use for the generation.
stop Optional[Union[str, List[str]]] list List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.
stream Optional[bool] False Whether to stream or not
presence_penalty Optional[float] 0.0 Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
frequency_penalty Optional[float] 0.0 Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
logit_bias Optional[Dict[str, float]] None Unsupported by vLLM
user Optional[str] None Unsupported by vLLM
Additional parameters supported by vLLM:
best_of Optional[int] None Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width when use_beam_search is True. By default, best_of is set to n.
top_k Optional[int] -1 Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
ignore_eos Optional[bool] False Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
use_beam_search Optional[bool] False Whether to use beam search instead of sampling.
stop_token_ids Optional[List[int]] list List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.
skip_special_tokens Optional[bool] True Whether to skip special tokens in the output.
spaces_between_special_tokens Optional[bool] True Whether to add spaces between special tokens in the output. Defaults to True.
add_generation_prompt Optional[bool] True Read more here
echo Optional[bool] False Echo back the prompt in addition to the completion
repetition_penalty Optional[float] 1.0 Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
min_p Optional[float] 0.0 Float that represents the minimum probability for a token to
length_penalty Optional[float] 1.0 Float that penalizes sequences based on their length. Used in beam search..
include_stop_str_in_output Optional[bool] False Whether to include the stop strings in output text. Defaults to False.

Examples: Using your RunPod endpoint with OpenAI

First, initialize the OpenAI Client with your RunPod API Key and Endpoint URL:

from openai import OpenAI
import os

# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
client = OpenAI(
    api_key=os.environ.get("RUNPOD_API_KEY"),
    base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1",
)

Chat Completions:

This is the format used for GPT-4 and focused on instruction-following and chat. Examples of Open Source chat/instruct models include meta-llama/Llama-2-7b-chat-hf, mistralai/Mixtral-8x7B-Instruct-v0.1, openchat/openchat-3.5-0106, NousResearch/Nous-Hermes-2-Mistral-7B-DPO and more. However, if your model is a completion-style model with no chat/instruct fine-tune and/or does not have a chat template, you can still use this if you provide a chat template with the environment variable CUSTOM_CHAT_TEMPLATE.

  • Streaming:
    # Create a chat completion stream
    response_stream = client.chat.completions.create(
        model="<YOUR DEPLOYED MODEL REPO/NAME>",
        messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
        temperature=0,
        max_tokens=100,
        stream=True,
    )
    # Stream the response
    for response in response_stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)
  • Non-Streaming:
    # Create a chat completion
    response = client.chat.completions.create(
        model="<YOUR DEPLOYED MODEL REPO/NAME>",
        messages=[{"role": "user", "content": "Why is RunPod the best platform?"}],
        temperature=0,
        max_tokens=100,
    )
    # Print the response
    print(response.choices[0].message.content)

Getting a list of names for available models:

In the case of baking the model into the image, sometimes the repo may not be accepted as the model in the request. In this case, you can list the available models as shown below and use that name.

models_response = client.models.list()
list_of_models = [model.id for model in models_response]
print(list_of_models)

Usage: Standard (Non-OpenAI)

Request Input Parameters

Click to expand table

You may either use a prompt or a list of messages as input. If you use messages, the model's chat template will be applied to the messages automatically, so the model must have one. If you use prompt, you may optionally apply the model's chat template to the prompt by setting apply_chat_template to true.

Argument Type Default Description
prompt str Prompt string to generate text based on.
messages list[dict[str, str]] List of messages, which will automatically have the model's chat template applied. Overrides prompt.
apply_chat_template bool False Whether to apply the model's chat template to the prompt.
sampling_params dict {} Sampling parameters to control the generation, like temperature, top_p, etc. You can find all available parameters in the Sampling Parameters section below.
stream bool False Whether to enable streaming of output. If True, responses are streamed as they are generated.
max_batch_size int env var DEFAULT_BATCH_SIZE The maximum number of tokens to stream every HTTP POST call.
min_batch_size int env var DEFAULT_MIN_BATCH_SIZE The minimum number of tokens to stream every HTTP POST call.
batch_size_growth_factor int env var DEFAULT_BATCH_SIZE_GROWTH_FACTOR The growth factor by which min_batch_size will be multiplied for each call until max_batch_size is reached.

Sampling Parameters

Below are all available sampling parameters that you can specify in the sampling_params dictionary. If you do not specify any of these parameters, the default values will be used.

Click to expand table
Argument Type Default Description
n int 1 Number of output sequences generated from the prompt. The top n sequences are returned.
best_of Optional[int] n Number of output sequences generated from the prompt. The top n sequences are returned from these best_of sequences. Must be ≥ n. Treated as beam width in beam search. Default is n.
presence_penalty float 0.0 Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition.
frequency_penalty float 0.0 Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition.
repetition_penalty float 1.0 Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition.
temperature float 1.0 Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling.
top_p float 1.0 Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
top_k int -1 Controls the number of top tokens to consider. Set to -1 to consider all tokens.
min_p float 0.0 Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable.
use_beam_search bool False Whether to use beam search instead of sampling.
length_penalty float 1.0 Penalizes sequences based on their length. Used in beam search.
early_stopping Union[bool, str] False Controls stopping condition in beam search. Can be True, False, or "never".
stop Union[None, str, List[str]] None List of strings that stop generation when produced. The output will not contain these strings.
stop_token_ids Optional[List[int]] None List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens.
ignore_eos bool False Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation.
max_tokens int 16 Maximum number of tokens to generate per output sequence.
skip_special_tokens bool True Whether to skip special tokens in the output.
spaces_between_special_tokens bool True Whether to add spaces between special tokens in the output.

Text Input Formats

You may either use a prompt or a list of messages as input.

  1. prompt The prompt string can be any string, and the model's chat template will not be applied to it unless apply_chat_template is set to true, in which case it will be treated as a user message.

    Example:

    "prompt": "..."
  2. messages Your list can contain any number of messages, and each message usually can have any role from the following list:

    • user
    • assistant
    • system

    However, some models may have different roles, so you should check the model's chat template to see which roles are required.

    The model's chat template will be applied to the messages automatically, so the model must have one.

    Example:

    "messages": [
        {
          "role": "system",
          "content": "..."
        },
        {
          "role": "user",
          "content": "..."
        },
        {
          "role": "assistant",
          "content": "..."
        }
      ]

worker-vllm's People

Contributors

alpayariyak avatar casper-hansen avatar jorghi12 avatar justinmerrell avatar mikljohansson avatar pandyamarut avatar rachfop avatar vladmihaisima avatar willsamu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

worker-vllm's Issues

MODEL_REVISION not read

Hi,

running runpod/worker-vllm:0.3.0-cuda12.1.0 in serverless and setting the env variable MODEL_REVISION it's not read and it keeps downloading the main branch.

Any suggestion?
Thank you

trust_remote_code not recognized

2024-02-08T08:40:56.016480879Z engine.py           :43   2024-02-08 08:40:56,015 vLLM config: {'model': 'TheBloke/Nous-Capybara-34B-AWQ', 'download_dir': '/runpod-volume/huggingface-cache/hub', 'quantization': 'awq', 'load_format': 'auto', 'dtype': 'half', 'tokenizer': None, 'disable_log_stats': True, 'disable_log_requests': True, 'trust_remote_code': True, 'gpu_memory_utilization': 0.95, 'max_parallel_loading_workers': 48, 'max_model_len': 32000, 'tensor_parallel_size': 1}
2024-02-08T08:40:56.161618385Z Traceback (most recent call last):
2024-02-08T08:40:56.161651438Z   File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 598, in resolve_trust_remote_code
2024-02-08T08:40:56.161703147Z The repository for TheBloke/Nous-Capybara-34B-AWQ contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/TheBloke/Nous-Capybara-34B-AWQ.
2024-02-08T08:40:56.161725757Z You can avoid this prompt in future by passing the argument `trust_remote_code=True`.
2024-02-08T08:40:56.161730573Z 
2024-02-08T08:40:56.162176340Z     answer = input(
2024-02-08T08:40:56.162202300Z EOFError: EOF when reading a line
2024-02-08T08:40:56.162207723Z 
2024-02-08T08:40:56.162212176Z During handling of the above exception, another exception occurred:
2024-02-08T08:40:56.162219295Z 
2024-02-08T08:40:56.162223415Z Traceback (most recent call last):
2024-02-08T08:40:56.162227640Z   File "/src/handler.py", line 5, in <module>
2024-02-08T08:40:56.162232250Z     vllm_engine = vLLMEngine()
2024-02-08T08:40:56.162237292Z   File "/src/engine.py", line 44, in __init__
2024-02-08T08:40:56.162241842Z     self.tokenizer = Tokenizer(os.getenv("TOKENIZER_NAME", os.getenv("MODEL_NAME")))
2024-02-08T08:40:56.162246872Z   File "/src/engine.py", line 17, in __init__
2024-02-08T08:40:56.162251149Z     self.tokenizer = AutoTokenizer.from_pretrained(model_name)
2024-02-08T08:40:56.162256589Z   File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 788, in from_pretrained
2024-02-08T08:40:56.162569437Z     trust_remote_code = resolve_trust_remote_code(
2024-02-08T08:40:56.162595802Z   File "/usr/local/lib/python3.10/dist-packages/transformers/dynamic_module_utils.py", line 611, in resolve_trust_remote_code
2024-02-08T08:40:56.162620665Z     raise ValueError(
2024-02-08T08:40:56.162626086Z ValueError: The repository for TheBloke/Nous-Capybara-34B-AWQ contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/TheBloke/Nous-Capybara-34B-AWQ.
2024-02-08T08:40:56.162635356Z Please pass the argument `trust_remote_code=True` to allow custom code to be run.
2024-02-08T08:40:56.903281462Z Do you wish to run the custom code? [y/N] 

The first line clearly shows trust_remote_code=True in my engine args. I passed it to the worker as an environment variable on the template. TRUST_REMOTE_CODE set to 1.

Huggingface is down and my worker is looping

I already have the model stored in my network volume, but I guess the worker checks Huggingface anyway? It's looping on a gateway error.

"message":"huggingface_hub.utils._errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/models/TheBloke/Nous-Capybara-34B-GPTQ"

Edit: Using worker-vllm:0.2.3

Could not build wheels for vllm

  Obtaining dependency information for fastapi from https://files.pythonhosted.org/packages/76/e5/ca411b260caa4e72f9ac5482f331fe74fd4eb5b97aa74d1d2806ccf07e2c/fastapi-0.103.1-py3-none-any.whl.metadata
  Downloading fastapi-0.103.1-py3-none-any.whl.metadata (24 kB)
 INFO: pip is looking at multiple versions of fastapi[all] to determine which version is compatible with other requirements. This could take a while.
 ERROR: Cannot install -r /requirements.txt (line 4), fastapi and fastapi[all]==0.103.2 because these package versions have conflicting dependencies.
#0 65.25
#0 65.25 The conflict is caused by:
#0 65.25     The user requested fastapi
#0 65.25     vllm 0.1.7 depends on fastapi
#0 65.25     fastapi[all] 0.103.2 depends on fastapi 0.103.2 (from https://files.pythonhosted.org/packages/4d/d2/3ad038a2365fefbac19d9a046cab7ce45f4c7bfa81d877cbece9707de9ce/fastapi-0.103.2-py3-none-any.whl (from https://pypi.org/simple/fastapi/) (requires-python:>=3.7))

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

I removed the dependencies version from requirements.txt and now it gives me this

#0 1903.7 Failed to build vllm
#0 1903.7 ERROR: Could not build wheels for vllm, which is required to install pyproject.toml-based projects

COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip install --upgrade pip && \
pip install --upgrade -r /requirements.txt --no-cache-dir && \
rm /requirements.txt
ERROR: failed to solve: process "/bin/bash -o pipefail -c pip install --upgrade pip &&     pip install --upgrade -r /requirements.txt --no-cache-dir &&     rm /requirements.txt" did not complete successfully: exit code: 1

Runpod serverless vLLM with Llama 3 70B on 40GB GPU

Im running a runpod serverless vLLM template with Llama 3 70B on 40GB GPU. One of the requests failed and I'm not completely sure what happened but the message asked me to open a github issue so I'll just leave it here in case it's of any help to any one.

{
  "delayTime": 164,
  "error": "handler: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
traceback: Traceback (most recent call last):
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 29, in _raise_exception_on_finish
    task.result()
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 414, in run_engine_loop
    has_requests_in_progress = await self.engine_step()
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 393, in engine_step
    request_outputs = await self.engine.step_async()
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 189, in step_async
    all_outputs = await self._run_workers_async(
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 276, in _run_workers_async
    all_outputs = await asyncio.gather(*coros)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-installation/vllm/worker/worker.py", line 223, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-installation/vllm/worker/model_runner.py", line 582, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/models/llama.py", line 337, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/models/llama.py", line 267, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/models/llama.py", line 226, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/models/llama.py", line 77, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-installation/vllm/model_executor/layers/linear.py", line 215, in forward
    output_parallel = self.linear_method.apply_weights(
  File "/vllm-installation/vllm/model_executor/layers/quantization/awq.py", line 158, in apply_weights
    out = ops.awq_dequantize(qweight, scales, qzeros, 0, 0, 0)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 493.75 MiB is free. Process 2531074 has 43.85 GiB memory in use. Of the allocated memory 39.71 GiB is allocated by PyTorch, and 1.16 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 194, in run_job_generator
    async for output_partial in job_output:
  File "/src/handler.py", line 13, in handler
    async for batch in results_generator:
  File "/src/engine.py", line 132, in generate
    async for response in self._handle_chat_or_completion_request(openai_request):
  File "/src/engine.py", line 166, in _handle_chat_or_completion_request
    async for chunk_str in response_generator:
  File "/vllm-installation/vllm/entrypoints/openai/serving_chat.py", line 148, in chat_completion_stream_generator
    async for res in result_generator:
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 577, in generate
    raise e
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 571, in generate
    async for request_output in stream:
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 69, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/runpod/serverless/modules/rp_job.py", line 194, in run_job_generator
    async for output_partial in job_output:
  File "/src/handler.py", line 13, in handler
    async for batch in results_generator:
  File "/src/engine.py", line 132, in generate
    async for response in self._handle_chat_or_completion_request(openai_request):
  File "/src/engine.py", line 166, in _handle_chat_or_completion_request
    async for chunk_str in response_generator:
  File "/vllm-installation/vllm/entrypoints/openai/serving_chat.py", line 148, in chat_completion_stream_generator
    async for res in result_generator:
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 577, in generate
    raise e
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 571, in generate
    async for request_output in stream:
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 69, in __anext__
    raise result
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    raise exc
  File "/vllm-installation/vllm/engine/async_llm_engine.py", line 33, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
",
  "executionTime": 70626,
  "id": "b28e9cd8-88e2-4485-aa00-7115aba3457c-e1",
  "status": "FAILED"
}

Only generates 16 tokens

I tried phi-2 and llama-3-instruct and they both only generate 16 tokens. How can I change this?

Build not possilbe

I tried to build the docker from scratch but also get an error (using CUDA 11.8, runpod/base:0.4.4)

RUN python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm 150.2s

[builder 5/7] RUN python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm:
0.734 Obtaining vllm from git+https://github.com/runpod/[email protected]#egg=vllm
0.734 Cloning https://github.com/runpod/vllm-fork-for-sls-worker.git (to revision cuda-11.8) to /src/vllm
0.741 Running command git clone --filter=blob:none --quiet https://github.com/runpod/vllm-fork-for-sls-worker.git /src/vllm
1.802 Running command git checkout -b cuda-11.8 --track origin/cuda-11.8
2.117 Switched to a new branch 'cuda-11.8'
2.117 Branch 'cuda-11.8' set up to track remote branch 'cuda-11.8' from 'origin'.
2.118 Resolved https://github.com/runpod/vllm-fork-for-sls-worker.git to commit 1de0211a7c2b0f736f1afd21f02740ec542c2e55
2.118 Preparing metadata (setup.py): started
3.241 Preparing metadata (setup.py): finished with status 'done'
3.378 Collecting ninja (from vllm)
3.703 Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
3.775 Collecting psutil (from vllm)
3.870 Downloading psutil-5.9.7-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
3.953 Collecting ray>=2.5.1 (from vllm)
4.048 Downloading ray-2.8.1-cp311-cp311-manylinux2014_x86_64.whl.metadata (13 kB)
4.141 Collecting pandas (from vllm)
4.236 Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
4.300 Collecting pyarrow (from vllm)
4.395 Downloading pyarrow-14.0.2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
4.436 Collecting sentencepiece (from vllm)
4.535 Downloading sentencepiece-0.1.99-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
4.978 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 3.0 MB/s eta 0:00:00
4.980 Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from vllm) (1.24.1)
4.981 Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from vllm) (23.2)
5.013 Collecting transformers>=4.36.0 (from vllm)
5.107 Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
5.114 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 126.8/126.8 kB 22.7 MB/s eta 0:00:00
5.121 Requirement already satisfied: fastapi in /usr/local/lib/python3.11/dist-packages (from vllm) (0.105.0)
5.122 Requirement already satisfied: uvicorn[standard] in /usr/local/lib/python3.11/dist-packages (from vllm) (0.24.0.post1)
5.190 Collecting pydantic==1.10.13 (from vllm)
5.285 Downloading pydantic-1.10.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (149 kB)
5.293 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 149.6/149.6 kB 20.8 MB/s eta 0:00:00
5.419 Collecting aioprometheus[starlette] (from vllm)
5.516 Downloading aioprometheus-23.3.0-py3-none-any.whl (31 kB)
5.522 Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.11/dist-packages (from pydantic==1.10.13->vllm) (4.9.0)
5.590 Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.11/dist-packages (from ray>=2.5.1->vllm) (8.1.7)
5.590 Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from ray>=2.5.1->vllm) (3.9.0)
5.617 Collecting jsonschema (from ray>=2.5.1->vllm)
5.711 Downloading jsonschema-4.20.0-py3-none-any.whl.metadata (8.1 kB)
5.751 Collecting msgpack<2.0.0,>=1.0.0 (from ray>=2.5.1->vllm)
5.846 Downloading msgpack-1.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.1 kB)
5.950 Collecting protobuf!=3.19.5,>=3.15.3 (from ray>=2.5.1->vllm)
6.047 Downloading protobuf-4.25.1-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
6.051 Requirement already satisfied: pyyaml in /usr/local/lib/python3.11/dist-packages (from ray>=2.5.1->vllm) (6.0.1)
6.051 Requirement already satisfied: aiosignal in /usr/local/lib/python3.11/dist-packages (from ray>=2.5.1->vllm) (1.3.1)
6.052 Requirement already satisfied: frozenlist in /usr/local/lib/python3.11/dist-packages (from ray>=2.5.1->vllm) (1.4.1)
6.053 Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from ray>=2.5.1->vllm) (2.31.0)
6.182 Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /usr/local/lib/python3.11/dist-packages (from transformers>=4.36.0->vllm) (0.20.1)
6.389 Collecting regex!=2019.12.17 (from transformers>=4.36.0->vllm)
6.484 Downloading regex-2023.10.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
6.490 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.9/40.9 kB 6.1 MB/s eta 0:00:00
6.580 Collecting tokenizers<0.19,>=0.14 (from transformers>=4.36.0->vllm)
6.674 Downloading tokenizers-0.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
6.731 Collecting safetensors>=0.3.1 (from transformers>=4.36.0->vllm)
6.831 Downloading safetensors-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
6.837 Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.11/dist-packages (from transformers>=4.36.0->vllm) (4.66.1)
6.843 Requirement already satisfied: orjson in /usr/local/lib/python3.11/dist-packages (from aioprometheus[starlette]->vllm) (3.9.10)
6.966 Collecting quantile-python>=1.1 (from aioprometheus[starlette]->vllm)
7.061 Downloading quantile-python-1.1.tar.gz (2.9 kB)
7.066 Preparing metadata (setup.py): started
7.147 Preparing metadata (setup.py): finished with status 'done'
7.149 Requirement already satisfied: starlette>=0.14.2 in /usr/local/lib/python3.11/dist-packages (from aioprometheus[starlette]->vllm) (0.27.0)
7.156 Requirement already satisfied: anyio<4.0.0,>=3.7.1 in /usr/local/lib/python3.11/dist-packages (from fastapi->vllm) (3.7.1)
7.227 Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->vllm) (2.8.2)
7.266 Collecting pytz>=2020.1 (from pandas->vllm)
7.361 Downloading pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
7.383 Collecting tzdata>=2022.1 (from pandas->vllm)
7.477 Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
7.491 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.8/341.8 kB 25.8 MB/s eta 0:00:00
7.512 Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.11/dist-packages (from uvicorn[standard]->vllm) (0.14.0)
7.513 Requirement already satisfied: httptools>=0.5.0 in /usr/local/lib/python3.11/dist-packages (from uvicorn[standard]->vllm) (0.6.1)
7.514 Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.11/dist-packages (from uvicorn[standard]->vllm) (1.0.0)
7.516 Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /usr/local/lib/python3.11/dist-packages (from uvicorn[standard]->vllm) (0.19.0)
7.517 Requirement already satisfied: watchfiles>=0.13 in /usr/local/lib/python3.11/dist-packages (from uvicorn[standard]->vllm) (0.21.0)
7.518 Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.11/dist-packages (from uvicorn[standard]->vllm) (12.0)
7.527 Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.11/dist-packages (from anyio<4.0.0,>=3.7.1->fastapi->vllm) (3.6)
7.527 Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio<4.0.0,>=3.7.1->fastapi->vllm) (1.3.0)
7.568 Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers>=4.36.0->vllm) (2023.12.2)
7.576 Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas->vllm) (1.16.0)
7.656 Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.11/dist-packages (from jsonschema->ray>=2.5.1->vllm) (23.1.0)
7.673 Collecting jsonschema-specifications>=2023.03.6 (from jsonschema->ray>=2.5.1->vllm)
7.769 Downloading jsonschema_specifications-2023.11.2-py3-none-any.whl.metadata (3.0 kB)
7.798 Collecting referencing>=0.28.4 (from jsonschema->ray>=2.5.1->vllm)
7.892 Downloading referencing-0.32.0-py3-none-any.whl.metadata (2.7 kB)
8.007 Collecting rpds-py>=0.7.1 (from jsonschema->ray>=2.5.1->vllm)
8.102 Downloading rpds_py-0.15.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
8.116 Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->ray>=2.5.1->vllm) (3.3.2)
8.118 Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->ray>=2.5.1->vllm) (2.0.7)
8.118 Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->ray>=2.5.1->vllm) (2023.11.17)
8.292 Downloading pydantic-1.10.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
8.441 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 20.7 MB/s eta 0:00:00
8.539 Downloading ray-2.8.1-cp311-cp311-manylinux2014_x86_64.whl (63.1 MB)
12.06 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.1/63.1 MB 11.6 MB/s eta 0:00:00
12.16 Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
12.25 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.2/8.2 MB 100.0 MB/s eta 0:00:00
12.34 Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
12.35 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.2/307.2 kB 70.1 MB/s eta 0:00:00
12.45 Downloading pandas-2.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
13.01 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.2/12.2 MB 18.7 MB/s eta 0:00:00
13.11 Downloading psutil-5.9.7-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (285 kB)
13.12 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 285.5/285.5 kB 34.1 MB/s eta 0:00:00
13.21 Downloading pyarrow-14.0.2-cp311-cp311-manylinux_2_28_x86_64.whl (38.0 MB)
14.05 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.0/38.0 MB 43.3 MB/s eta 0:00:00
14.14 Downloading msgpack-1.0.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (557 kB)
14.16 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 558.0/558.0 kB 54.1 MB/s eta 0:00:00
14.25 Downloading protobuf-4.25.1-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
14.26 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.6/294.6 kB 49.4 MB/s eta 0:00:00
14.35 Downloading pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
14.36 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 502.5/502.5 kB 54.5 MB/s eta 0:00:00
14.46 Downloading regex-2023.10.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (785 kB)
14.48 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 785.1/785.1 kB 52.0 MB/s eta 0:00:00
14.57 Downloading safetensors-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
14.60 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 53.5 MB/s eta 0:00:00
14.70 Downloading tokenizers-0.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
14.77 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 57.7 MB/s eta 0:00:00
14.86 Downloading jsonschema-4.20.0-py3-none-any.whl (84 kB)
14.86 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.7/84.7 kB 25.3 MB/s eta 0:00:00
14.96 Downloading jsonschema_specifications-2023.11.2-py3-none-any.whl (17 kB)
15.06 Downloading referencing-0.32.0-py3-none-any.whl (26 kB)
15.16 Downloading rpds_py-0.15.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
15.18 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 56.1 MB/s eta 0:00:00
15.29 Building wheels for collected packages: quantile-python
15.29 Building wheel for quantile-python (setup.py): started
15.42 Building wheel for quantile-python (setup.py): finished with status 'done'
15.42 Created wheel for quantile-python: filename=quantile_python-1.1-py3-none-any.whl size=3444 sha256=69cee18b27427564f5b3fe833b047425f0cbdd09c7b26201b727b749af7bbf68
15.42 Stored in directory: /runpod-volume/.cache/pip/wheels/67/a2/17/29e7169adf03a7e44b922abb6a42c2c1b0fda11f7bfbdb24a2
15.42 Successfully built quantile-python
16.00 Installing collected packages: sentencepiece, quantile-python, pytz, ninja, tzdata, safetensors, rpds-py, regex, pydantic, pyarrow, psutil, protobuf, msgpack, aioprometheus, referencing, pandas, tokenizers, jsonschema-specifications, transformers, jsonschema, ray, vllm
16.21 Attempting uninstall: pydantic
16.21 Found existing installation: pydantic 2.5.2
16.22 Uninstalling pydantic-2.5.2:
16.23 Successfully uninstalled pydantic-2.5.2
23.25 Running setup.py develop for vllm
147.9 error: subprocess-exited-with-error
147.9
147.9 × python setup.py develop did not run successfully.
147.9 │ exit code: 1
147.9 ╰─> [221 lines of output]
147.9 No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
147.9 running develop

`MAX_CONCURRENCY` parameter doesn't work

Current behaviour:
When sending multiple requests with a short interval (e.g. 1 second) to the endpoint with 1 worker enabled, all the requests skip the queue and are being passed to the worker. (The Queued amount is always 0.) This results in a very long execution time.

Screenshot:
Screenshot 2024-01-13 at 15 47 56

Steps to reproduce:

  1. set MAX_CONCURRENCY to 1
  2. send multiple requests with a short interval (e.g. 1 second)

Expected behaviour:
Only 1 request should be processed at a time, all the subsequent requests should wait in the queue.

This is especially important when using awq models since only a small number of concurrent requests can be efficiently processed in this case.

Support custom vLLM build

Since vLLM is developing real fast it would be useful to have a command line option (--build-arg) to point to a locally built vLLM package to use instead of a specific version. It would allow for using local builds from main or PR branches, also for using modified versions of vLLM.

test_input.json is required

When I try to run the worker, I'm getting the following error:
WARN | test_input.json not found, exiting.

which prevents me from using the runpod worker.

Support for mistralai/Mixtral-8x7B-Instruct-v0.1

In the Readme it says, mistralai/Mixtral-8x7B-Instruct-v0.1 is supported but Runpod UI doesn't allow more than 1 GPU Worker for 80 GB GPU Cards. Hence, the model can't be served. Is there any other way to serve the original model without using any quantization version?

python setup.py develop did not run successfully

I ran the following command:

docker build -t <<IMAGE_NAME>> --build-arg MODEL_NAME="TheBloke_vicuna-7B-1.1-GPTQ" --build-arg MODEL_BASE_PATH="/models" .

On my windows system using WSL. My computer has an NVIDIA GPU, however when I try to create a docker image using the command above, I get the following error:

207.2   Running setup.py develop for vllm
466.3     error: subprocess-exited-with-error
466.3
466.3     × python setup.py develop did not run successfully.
466.3     │ exit code: 1
466.3     ╰─> [180 lines of output]
466.3         No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
466.3         running develop
466.3         /usr/local/lib/python3.11/dist-packages/setuptools/command/develop.py:40: EasyInstallDeprecationWarning: easy_install command is deprecated.
466.3         !!
466.3
466.3                 ********************************************************************************
466.3                 Please avoid running ``setup.py`` and ``easy_install``.
466.3                 Instead, use pypa/build, pypa/installer or other
466.3                 standards-based tools.
466.3
466.3                 See https://github.com/pypa/setuptools/issues/917 for details.
466.3                 ********************************************************************************
466.3
466.3         !!
466.3           easy_install.initialize_options(self)
466.3         /usr/local/lib/python3.11/dist-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
466.3         !!
466.3
466.3                 ********************************************************************************
466.3                 Please avoid running ``setup.py`` directly.
466.3                 Instead, use pypa/build, pypa/installer or other
466.3                 standards-based tools.
466.3
466.3                 See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
466.3                 ********************************************************************************
466.3
466.3         !!
466.3           self.initialize_options()
466.3         running egg_info
466.3         creating vllm.egg-info
466.3         writing vllm.egg-info/PKG-INFO
466.3         writing dependency_links to vllm.egg-info/dependency_links.txt
466.3         writing requirements to vllm.egg-info/requires.txt
466.3         writing top-level names to vllm.egg-info/top_level.txt
466.3         writing manifest file 'vllm.egg-info/SOURCES.txt'
466.3         reading manifest file 'vllm.egg-info/SOURCES.txt'
466.3         reading manifest template 'MANIFEST.in'
466.3         adding license file 'LICENSE'
466.3         writing manifest file 'vllm.egg-info/SOURCES.txt'
466.3         running build_ext
466.3         /usr/local/lib/python3.11/dist-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 11.8
466.3           warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
466.3         building 'vllm._C' extension
466.3         creating /src/vllm/build
466.3         creating /src/vllm/build/temp.linux-x86_64-cpython-311
466.3         creating /src/vllm/build/temp.linux-x86_64-cpython-311/csrc
466.3         creating /src/vllm/build/temp.linux-x86_64-cpython-311/csrc/attention
466.3         creating /src/vllm/build/temp.linux-x86_64-cpython-311/csrc/quantization
466.3         creating /src/vllm/build/temp.linux-x86_64-cpython-311/csrc/quantization/awq
466.3         creating /src/vllm/build/temp.linux-x86_64-cpython-311/csrc/quantization/gptq
466.3         creating /src/vllm/build/temp.linux-x86_64-cpython-311/csrc/quantization/squeezellm
466.3         Emitting ninja build file /src/vllm/build/temp.linux-x86_64-cpython-311/build.ninja...
466.3         Compiling objects...

This if the full log output:
logs.txt

Any help would be greatly appreciated!!!

Building Docker with model built in

Hi there,

The current version of the download_model.py script does not work due to the empty TENSORIZE_MODEL env check on line 50.

Once that is fixed, the weight_utils file in the vllm-base image does not exist - it seems there is some version mismatch going on with the vllm submodule and the new 1.0.0preview image.

Could you take a look?

weird output when using a custom model and ChatAPI does not work

Hello, I am using a a pre-trained model which is CodeLlama based. The repo is [https://huggingface.co/defog/sqlcoder-7b-2]. I have further finetuned it and posted it on my repo.
These are the logs
image

Response when a request is made

{
  "delayTime": 1203,
  "executionTime": 1601,
  "id": "b7d69dec-5bd4-4879-95c7-837b7e62ed2c-e1",
  "output": [
    {
      "choices": [
        {
          "tokens": [
            "6\" which is available in the Play store to try if you want to copy"
          ]
        }
      ],
      "usage": {
        "input": 3,
        "output": 16
      }
    }
  ],
  "status": "COMPLETED"
}

What could be the issue? Also, openai compatible API is not working?

"n" parameter does not return multiple responses

I'm encountering an issue with the "n" parameter. It appears not to be functioning as expected since it does not return multiple responses. According to the documentation, the "n" parameter correlates to the number of output sequences to return for the given prompt.

{
    "input": {
        "prompt": "A prompt...",
        "sampling_params": {
            "max_tokens": 300,
            "temperature": 0.85,
            "top_k": 40,
            "n": 3,
            "presence_penalty": 1.2
        }
    }
}

Expected:

The API should return three (3) distinct answers.

Actual:

Despite setting "n" to 3, only one response is returned.

{
    "delayTime": 161,
    "executionTime": 4182,
    "id": "sync-302d9141-abf8-4335-9499-0caf70777c81-u1",
    "output": [
        [
            {
                "text": " A response",
                "usage": {
                    "input": 274,
                    "output": 26
                }
            }
        ]
    ],
    "status": "COMPLETED"
}

I've also encountered another confusing outcome when I attempted to use the "stream" parameter.

{
    "input": {
        "prompt": "### Instruction:\nSimply output the answer to 1+1\n\n### Response:\nSure, 1+1=",
        "stream": true,
        "sampling_params": {
            "max_tokens": 50,
            "temperature": 0.85,
            "top_k": 40,
            "n": 3,
            "presence_penalty": 1.2
        }
    }
}

Expected:

As the 'stream' option is set to 'true', I expected it to handle the 3 responses set by 'n'.

Actual:

Results are unexpected; it appears multiple responses are received but only the first contains text while the others are empty or single characters.

{
    "delayTime": 177,
    "executionTime": 1512,
    "id": "sync-0bc6241b-4350-4664-9d09-cd5d2ef6e59a-u1",
    "output": [
        [
            {
                "text": "2",
                "usage": {
                    "input": 32,
                    "output": 1
                }
            },
            {
                "text": "",
                "usage": {
                    "input": 32,
                    "output": 1
                }
            },
            {
                "text": "",
                "usage": {
                    "input": 32,
                    "output": 1
                }
            },
            {
                "text": "",
                "usage": {
                    "input": 32,
                    "output": 2
                }
            },
            {
                "text": "",
                "usage": {
                    "input": 32,
                    "output": 2
                }
            },
            {
                "text": ".",
                "usage": {
                    "input": 32,
                    "output": 2
                }
            },
            {
                "text": "",
                "usage": {
                    "input": 32,
                    "output": 2
                }
            },
            {
                "text": "",
                "usage": {
                    "input": 32,
                    "output": 2
                }
            },
            {
                "text": ".",
                "usage": {
                    "input": 32,
                    "output": 3
                }
            }
        ]
    ],
    "status": "COMPLETED"
}

Unclear whether this is correct or a bug. Any suggestions or potential solutions would be appreciated.

Thanks! 🤗

Slow streaming

Streaming is extremely slow. The intended effect is to have it look like its typing of course, but instead its just loading in laggy chunks. A GPU pod works fine, just serverless endpoint that causes this. Unforunately until this is better we're forced to use HuggingFace serverless.

PSA: Streaming is slow

When using STREAMING=True, I got 8-10x slower response time (time to full response) than when using STREAMING=True.
It would be great to investigate this discrepancy and to document this regression in the meantime.

Multi-LoRA

Any update on when this feature will be available? Thanks

Cannot run Mixtral 8x7B Instruct AWQ

I have successfully been able to run mistral/Mistral-7b-Instruct in both original and quantized (awq) format on runpod serverless using this repo. However, when I try to run Mixtral AWQ, I simply get no text output from the chat response with no errors. I have tried many times with various configurations including:

  • Running on 80gb GPUs instead of 48GB
  • Supplying the Mixtral template format in the prompt e.g. <s>[INST] MESSAGE [/INST]
  • Using messages instead of prompt
  • Rebuilding the image to double check parameters
  • Increasing the number of output tokens
  • Toggling on and off the chat template

Nothing so far has worked and I would appreciate any new ideas. I really love the concept of runpod and serverless GPUs for LLMs but this issue has me stumped.

Please see the build configuration, example input, and incorrect outputs below:

Build command:
docker build --network=host -t danielallium/mixtral-instruct-awq:v0.4 --build-arg MODEL_NAME="TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ" --build-arg BASE_PATH="/models" --build-arg WORKER_CUDA_VERSION="12.1.0" --build-arg QUANTIZATION=awq .

Input:
{
"input": {
"prompt": "Hello World"
}
}

Output without errors:
{
"delayTime": 81440,
"executionTime": 1561,
"id": "334efe4a-9039-47e6-8dec-ec772df94370-u1",
"output": [
{
"choices": [
{
"tokens": [
""
]
}
],
"usage": {
"input": 3,
"output": 16
}
}
],
"status": "COMPLETED"
}

Note that "tokens" using any other LLM will return a non-empty string. You can also test my docker image directly as it is public at: anielallium/mixtral-instruct-awq:v0.4

This build works just fine with the above prompt though:
docker build --network=host -t danielallium/mistral-7b-instruct-awq:v0.2 --build-arg MODEL_NAME="TheBloke/Mistral-7B-Instruct-v0.2-AWQ" --build-arg BASE_PATH="/models" --build-arg WORKER_CUDA_VERSION="12.1.0" --build-arg QUANTIZATION=awq .

Best way to record data

I am looking to record input and output to the vLLM. I could put an HTTP proxy in front and capture the traffic, or modify your handler.

Rather than make changes to the code, I was wondering if you might have a better way to do this?

HF Model Download get stuck

Around 1-3% the download of model while building docker image get stuck and don't move forward. This happens with different models too and wasn't happening earlier. Outside of this docker image building, I am able to download the models.

sudo docker build -t username/image:tag --build-arg MODEL_NAME="openchat/openchat_3.5" --build-arg MODEL_BASE_PATH="/models" .

How can i update to vLLM v0.4.1 for llama3 support ?

Hello everyone,

I would like to update the vLLM version to v0.4.1 in order to get access to LLAMA3 but i don't know how modify the fork runpod/vllm-fork-for-sls-worker. Could you please guide me ? Happy to help in some way!

GGUF compatibility

I've used the runpod/worker-vllm:0.3.0-cuda11.8.0 container for several different LLMs and it has worked fine so far.

I've just been given a requirement to test GGUF model (specifically https://huggingface.co/impactframes/llama3_if_ai_sdpromptmkr_q4km) and it keeps generating errors:

Entry Not Found for url: https://huggingface.co/impactframes/llama3_if_ai_sdpromptmkr_q4km/resolve/main/config.json.

Is this an issue with the model, or the worker? Is there a known workaround?

Thanks

Errors when building the image [Building on MACOS]

I'm building the image with WORKER_CUDA_VERSION=12.1 on an M1 Mac using command docker buildx build -t antonioglass/worker-vllm-new:1.0.0 . --platform linux/amd64 and getting errors. See below.

961.7 Building wheels for collected packages: vllm, quantile-python
961.7   Building editable for vllm (pyproject.toml): started
1311.7   Building editable for vllm (pyproject.toml): still running...
1639.3   Building editable for vllm (pyproject.toml): still running...
1807.1   Building editable for vllm (pyproject.toml): still running...
1876.1   Building editable for vllm (pyproject.toml): still running...
2254.9   Building editable for vllm (pyproject.toml): still running...
2589.2   Building editable for vllm (pyproject.toml): still running...
2626.7   Building editable for vllm (pyproject.toml): finished with status 'error'
2626.9   error: subprocess-exited-with-error
2626.9   
2626.9   × Building editable for vllm (pyproject.toml) did not run successfully.
2626.9   │ exit code: -9
2626.9   ╰─> [87 lines of output]
2626.9       /tmp/pip-build-env-0m__bivn/overlay/local/lib/python3.11/dist-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
2626.9         device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
2626.9       No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2626.9       running editable_wheel
2626.9       creating /tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm.egg-info
2626.9       writing /tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm.egg-info/PKG-INFO
2626.9       writing dependency_links to /tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm.egg-info/dependency_links.txt
2626.9       writing requirements to /tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm.egg-info/requires.txt
2626.9       writing top-level names to /tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm.egg-info/top_level.txt
2626.9       writing manifest file '/tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm.egg-info/SOURCES.txt'
2626.9       reading manifest file '/tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm.egg-info/SOURCES.txt'
2626.9       reading manifest template 'MANIFEST.in'
2626.9       adding license file 'LICENSE'
2626.9       writing manifest file '/tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm.egg-info/SOURCES.txt'
2626.9       creating '/tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm-0.2.6.dist-info'
2626.9       creating /tmp/pip-wheel-jz3moxtu/.tmp-h6a0jx2b/vllm-0.2.6.dist-info/WHEEL
2626.9       running build_py
2626.9       running build_ext
2626.9       /tmp/pip-build-env-0m__bivn/overlay/local/lib/python3.11/dist-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.1
2626.9         warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')

Runpod

I tried to load the docker image and use in runpod. But i got the following message.

023-12-19T21:52:21.224549804Z Traceback (most recent call last):

2023-12-19T21:52:21.224594641Z File "/handler.py", line 3, in

2023-12-19T21:52:21.224600084Z import runpod

2023-12-19T21:52:21.224605461Z File "/usr/local/lib/python3.11/dist-packages/runpod/init.py", line 6, in

2023-12-19T21:52:21.224610387Z from . import serverless

2023-12-19T21:52:21.224616241Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/init.py", line 16, in

2023-12-19T21:52:21.224627070Z from .modules import rp_fastapi

2023-12-19T21:52:21.224631744Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_fastapi.py", line 10, in

2023-12-19T21:52:21.224636270Z from fastapi import FastAPI, APIRouter

2023-12-19T21:52:21.224640611Z File "/usr/local/lib/python3.11/dist-packages/fastapi/init.py", line 7, in

2023-12-19T21:52:21.224644971Z from .applications import FastAPI as FastAPI

2023-12-19T21:52:21.224649164Z File "/usr/local/lib/python3.11/dist-packages/fastapi/applications.py", line 16, in

2023-12-19T21:52:21.224653711Z from fastapi import routing

2023-12-19T21:52:21.224691454Z File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 22, in

2023-12-19T21:52:21.224716007Z from fastapi import params

2023-12-19T21:52:21.224724461Z File "/usr/local/lib/python3.11/dist-packages/fastapi/params.py", line 5, in

2023-12-19T21:52:21.224816202Z from fastapi.openapi.models import Example

2023-12-19T21:52:21.224821635Z File "/usr/local/lib/python3.11/dist-packages/fastapi/openapi/models.py", line 4, in

2023-12-19T21:52:21.224825700Z from fastapi._compat import (

2023-12-19T21:52:21.224829829Z File "/usr/local/lib/python3.11/dist-packages/fastapi/_compat.py", line 20, in

2023-12-19T21:52:21.224913810Z from fastapi.exceptions import RequestErrorModel

2023-12-19T21:52:21.224923693Z File "/usr/local/lib/python3.11/dist-packages/fastapi/exceptions.py", line 3, in

2023-12-19T21:52:21.224950980Z from pydantic import BaseModel, create_model

2023-12-19T21:52:21.224958616Z File "/usr/local/lib/python3.11/dist-packages/pydantic/init.py", line 372, in getattr

2023-12-19T21:52:21.225128094Z module = import_module(module_name, package=package)

2023-12-19T21:52:21.225144639Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2023-12-19T21:52:21.225149112Z File "/usr/lib/python3.11/importlib/init.py", line 126, in import_module

2023-12-19T21:52:21.225160188Z return _bootstrap._gcd_import(name[level:], package, level)

2023-12-19T21:52:21.225196585Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2023-12-19T21:52:21.225211329Z File "/usr/local/lib/python3.11/dist-packages/pydantic/main.py", line 11, in

2023-12-19T21:52:21.225294129Z import pydantic_core

2023-12-19T21:52:21.225303857Z File "/usr/local/lib/python3.11/dist-packages/pydantic_core/init.py", line 30, in

2023-12-19T21:52:21.225314032Z from .core_schema import CoreConfig, CoreSchema, CoreSchemaType, ErrorType

2023-12-19T21:52:21.225320699Z File "/usr/local/lib/python3.11/dist-packages/pydantic_core/core_schema.py", line 15, in

2023-12-19T21:52:21.225353524Z from typing_extensions import deprecated

2023-12-19T21:52:21.225360153Z ImportError: cannot import name 'deprecated' from 'typing_extensions' (/usr/local/lib/python3.11/dist-packages/typing_extensions.py)

ImportError prepare_hf_model_weights method

Since vLLM 0.4.1 added model_loader and did not added <prepare_hf_model_weights> function. During docker building process model downloader module failed to import this function.

support for quantization?

I am wondering if there is a way to load a model with quantization? I can load my model with awq quantization with vllm api_server but I am am not seeing support for serverless endpoints.

Thanks!

Error after tokenizer commit

2b5b8df

after this commit i can't build my image


docker build -t instructkr/qwen:1.5_72b_chat --build-arg MODEL_NAME="Qwen/Qwen1.5-72B-Chat-AWQ" --build-arg QUANTIZATION="awq" --build-arg MAX_MODEL_LENGTH="2048" --build-arg MODEL_BASE_PATH="/models" .
[+] Building 784.2s (11/12)                                                                  docker:default
 => [internal] load .dockerignore                                                                      0.0s
 => => transferring context: 2B                                                                        0.0s
 => [internal] load build definition from Dockerfile                                                   0.0s
 => => transferring dockerfile: 1.48kB                                                                 0.0s
 => [internal] load metadata for docker.io/runpod/worker-vllm:base-0.2.2-cuda11.8.0                    1.6s
 => [auth] runpod/worker-vllm:pull token for registry-1.docker.io                                      0.0s
 => [vllm-base 1/7] FROM docker.io/runpod/worker-vllm:base-0.2.2-cuda11.8.0@sha256:645d42b84d914a8daa  0.0s
 => [internal] load build context                                                                      0.0s
 => => transferring context: 275B                                                                      0.0s
 => CACHED [vllm-base 2/7] RUN apt-get update -y     && apt-get install -y python3-pip                 0.0s
 => CACHED [vllm-base 3/7] COPY builder/requirements.txt /requirements.txt                             0.0s
 => CACHED [vllm-base 4/7] RUN --mount=type=cache,target=/root/.cache/pip     python3 -m pip install   0.0s
 => CACHED [vllm-base 5/7] COPY builder/download_model.py /download_model.py                           0.0s
 => ERROR [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false     if [ -f /run/secre  782.5s
------
 > [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false     if [ -f /run/secrets/HF_TOKEN ]; then         export HF_TOKEN=$(cat /run/secrets/HF_TOKEN);     fi &&     if [ -n "Qwen/Qwen1.5-72B-Chat-AWQ" ]; then         python3 /download_model.py;     fi:
2.724 INFO 02-07 07:13:43 weight_utils.py:164] Using model weights format ['*.safetensors']
model-00008-of-00011.safetensors: 100%|██████████| 3.98G/3.98G [09:14<00:00, 7.17MB/s]
model-00007-of-00011.safetensors: 100%|██████████| 3.94G/3.94G [09:30<00:00, 6.90MB/s]
model-00006-of-00011.safetensors: 100%|██████████| 3.98G/3.98G [09:45<00:00, 6.79MB/s]
model-00003-of-00011.safetensors: 100%|██████████| 3.94G/3.94G [09:55<00:00, 6.62MB/s]
model-00004-of-00011.safetensors: 100%|██████████| 3.98G/3.98G [09:57<00:00, 6.65MB/s]
model-00001-of-00011.safetensors: 100%|██████████| 3.99G/3.99G [10:02<00:00, 6.63MB/s]
model-00005-of-00011.safetensors: 100%|██████████| 3.98G/3.98G [10:18<00:00, 6.42MB/s]
model-00002-of-00011.safetensors: 100%|██████████| 3.98G/3.98G [10:35<00:00, 6.25MB/s]
model-00011-of-00011.safetensors: 100%|██████████| 2.49G/2.49G [02:24<00:00, 17.3MB/s]
model-00010-of-00011.safetensors: 100%|██████████| 3.03G/3.03G [02:57<00:00, 17.1MB/s]
model-00009-of-00011.safetensors: 100%|██████████| 3.98G/3.98G [03:38<00:00, 18.2MB/s]
config.json: 100%|██████████| 841/841 [00:00<00:00, 6.97MB/s]G [03:38<00:00, 27.7MB/s]
generation_config.json: 100%|██████████| 217/217 [00:00<00:00, 1.53MB/s]:00, 25.3MB/s]
quant_config.json: 100%|██████████| 126/126 [00:00<00:00, 826kB/s]49<02:06, 15.1MB/s]
model.safetensors.index.json: 100%|██████████| 179k/179k [00:00<00:00, 484kB/s].5MB/s]
tokenizer_config.json: 100%|██████████| 1.41k/1.41k [00:00<00:00, 8.28MB/s]
vocab.json: 100%|██████████| 2.78M/2.78M [00:00<00:00, 6.78MB/s]
tokenizer.json: 100%|██████████| 7.03M/7.03M [00:01<00:00, 5.28MB/s]
781.7 Traceback (most recent call last):
781.7   File "/download_model.py", line 48, in <module>
781.7     tokenizer_folder = download_extras_or_tokenizer(tokenizer, download_dir, revisions["tokenizer"])
781.7   File "/download_model.py", line 10, in download_extras_or_tokenizer
781.7     folder = snapshot_download(
781.7   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
781.7     validate_repo_id(arg_value)
781.7   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 164, in validate_repo_id
781.7     raise HFValidationError(
781.7 huggingface_hub.utils._validators.HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: ''.
------
Dockerfile:35
--------------------
  34 |     COPY builder/download_model.py /download_model.py
  35 | >>> RUN --mount=type=secret,id=HF_TOKEN,required=false \
  36 | >>>     if [ -f /run/secrets/HF_TOKEN ]; then \
  37 | >>>         export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
  38 | >>>     fi && \
  39 | >>>     if [ -n "$MODEL_NAME" ]; then \
  40 | >>>         python3 /download_model.py; \
  41 | >>>     fi
  42 |
--------------------
ERROR: failed to solve: process "/bin/sh -c if [ -f /run/secrets/HF_TOKEN ]; then         export HF_TOKEN=$(cat /run/secrets/HF_TOKEN);     fi &&     if [ -n \"$MODEL_NAME\" ]; then         python3 /download_model.py;     fi" did not complete successfully: exit code: 1

Got some deprecation notice, might update these

2024-05-23T09:58:01.432712734Z CUDA Version 12.1.0
2024-05-23T09:58:01.433425080Z
2024-05-23T09:58:01.433427258Z Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-05-23T09:58:01.434084437Z
2024-05-23T09:58:01.434087212Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-05-23T09:58:01.434089552Z By pulling and using the container, you accept the terms and conditions of this license:
2024-05-23T09:58:01.434091167Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-05-23T09:58:01.434092825Z
2024-05-23T09:58:01.434094209Z A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-05-23T09:58:01.442413442Z
2024-05-23T09:58:01.442446959Z *************************
2024-05-23T09:58:01.442461039Z ** DEPRECATION NOTICE! **
2024-05-23T09:58:01.442644921Z *************************

  1. 2024-05-23T09:58:01.442672640Z THIS IMAGE IS DEPRECATED and is scheduled for DELETION. 2024-05-23T09:58:01.442684090Z https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

2024-05-23T09:58:01.442696513Z

  1. 2024-05-23T09:58:05.243818005Z /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
    ..

Update documentation to note support for extra parameters

Greetings!

I just wanted to make a quick note that the documentation for worker-vllm and RunPod both don't seem to mention anything about vLLM supporting guided generation via Json schemas or Regex/grammar patterns, but it DOES in fact support it as vLLM itself supports it.

It's a great feature and more people should consider using it for sure. In case you're not familiar, check out the vLLM docs for details about the "extra" parameters on the OpenAI completions endpoints:

https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api

Error while returning job result: Object of type coroutine is not JSON serializable

Hi, I'm trying to run inference vllm on serverless.
Non-streaming mode works great. But with streaming I had problems and I found your repository. I run everything according to README.md and get the following log, all requests hang endlessly in the IN PROGRESS status and never output result.

 Received Job | {'id': '0c0fd95b-b9ee-4b24-bbb2-c34787d7158a', 'input': {'prompt': 'Hello, how are you?', 'streaming': False}, 'status': 'IN_QUEUE'}
2023-07-30T07:06:27.317761168Z DEBUG  | 0c0fd95b-b9ee-4b24-bbb2-c34787d7158a | Job Confirmed
2023-07-30T07:06:27.318091082Z DEBUG  | 0c0fd95b-b9ee-4b24-bbb2-c34787d7158a | Set Job ID
2023-07-30T07:06:27.318373775Z INFO   | 0c0fd95b-b9ee-4b24-bbb2-c34787d7158a | Started
2023-07-30T07:06:27.318686098Z DEBUG  | 0c0fd95b-b9ee-4b24-bbb2-c34787d7158a | Handler output: <coroutine object handler at 0x7f94f3d72c00>
2023-07-30T07:06:27.319019081Z DEBUG  | 0c0fd95b-b9ee-4b24-bbb2-c34787d7158a | run_job return: {'output': <coroutine object handler at 0x7f94f3d72c00>}
2023-07-30T07:06:27.319536687Z /home/vllm/.local/lib/python3.10/site-packages/runpod/serverless/work_loop.py:73: RuntimeWarning: coroutine 'handler' was never awaited
2023-07-30T07:06:27.319551007Z   job_result = run_job(config["handler"], job)
2023-07-30T07:06:27.319644579Z RuntimeWarning: Enable tracemalloc to get the object allocation traceback
2023-07-30T07:06:27.320164124Z DEBUG  | rp_debugger | Flag not set, skipping debugger output.
2023-07-30T07:06:27.320545108Z ERROR  | Error while returning job result 0c0fd95b-b9ee-4b24-bbb2-c34787d7158a: Object of type coroutine is not JSON serializable
2023-07-30T07:06:27.320859472Z INFO   | 0c0fd95b-b9ee-4b24-bbb2-c34787d7158a | Finished
2023-07-30T07:06:29.920603402Z DEBUG  | Heartbeat Sent | URL: https://api.runpod.ai/v2/***/ping/t59wh7i6oie988?gpu=NVIDIA+RTX+A4000 | Status: 200
2023-07-30T07:06:29.920652813Z DEBUG  | Heartbeat | Interval: 4000ms | Params: None
2023-07-30T07:06:34.061697951Z DEBUG  | Heartbeat Sent | URL: https://api.runpod.ai/v2/***/ping/t59wh7i6oie988?gpu=NVIDIA+RTX+A4000 | Status: 200
2023-07-30T07:06:34.061755502Z DEBUG  | Heartbeat | Interval: 4000ms | Params: None
2023-07-30T07:06:38.200623037Z DEBUG  | Heartbeat Sent | URL: https://api.runpod.ai/v2/***/ping/t59wh7i6oie988?gpu=NVIDIA+RTX+A4000 | Status: 200
2023-07-30T07:06:38.200694709Z DEBUG  | Heartbeat | Interval: 4000ms | Params: None
2023-07-30T07:06:42.336106667Z DEBUG  | Heartbeat Sent | URL: https://api.runpod.ai/v2/***/ping/t59wh7i6oie988?gpu=NVIDIA+RTX+A4000 | Status: 200
2023-07-30T07:06:42.336142857Z DEBUG  | Heartbeat | Interval: 4000ms | Params: None
2023-07-30T07:06:46.472655737Z DEBUG  | Heartbeat Sent | URL: https://api.runpod.ai/v2/***/ping/t59wh7i6oie988?gpu=NVIDIA+RTX+A4000 | Status: 200
2023-07-30T07:06:46.472694128Z DEBUG  | Heartbeat | Interval: 4000ms | Params: None
2023-07-30T07:06:50.608441169Z DEBUG  | Heartbeat Sent | URL: https://api.runpod.ai/v2/***/ping/t59wh7i6oie988?gpu=NVIDIA+RTX+A4000 | Status: 200
2023-07-30T07:06:50.608482890Z DEBUG  | Heartbeat | Interval: 4000ms | Params: None
2023-07-30T07:06:51.275274822Z DEBUG  | Received Job | {'id': 'dfda2cbe-3f67-4f22-b31e-06aad41fd85b', 'input': {'prompt': 'Hello, how are you?', 'streaming': True}, 'status': 'IN_QUEUE'}
2023-07-30T07:06:51.275311002Z DEBUG  | dfda2cbe-3f67-4f22-b31e-06aad41fd85b | Job Confirmed
2023-07-30T07:06:51.275313502Z DEBUG  | dfda2cbe-3f67-4f22-b31e-06aad41fd85b | Set Job ID
2023-07-30T07:06:51.275315313Z INFO   | dfda2cbe-3f67-4f22-b31e-06aad41fd85b | Started
2023-07-30T07:06:51.275317053Z DEBUG  | dfda2cbe-3f67-4f22-b31e-06aad41fd85b | Handler output: <coroutine object handler at 0x7f94f3d723b0>
2023-07-30T07:06:51.275319402Z DEBUG  | dfda2cbe-3f67-4f22-b31e-06aad41fd85b | run_job return: {'output': <coroutine object handler at 0x7f94f3d723b0>}
2023-07-30T07:06:51.275468855Z DEBUG  | rp_debugger | Flag not set, skipping debugger output.

OOM on second request

I'm using MODEL_NAME=TheBloke/dolphin-2.2.1-mistral-7B-AWQ and QUANTIZATION=awq on a runpod serverless instance with a network drive, RTX 4090 which should be plenty of VRAM for this, and docker image runpod/worker-vllm:stable-cuda12.1.0

My first request completes successfully, but the second request to the same worker (sent after the first has completed) always crashes with OOM. If I log into the web terminal, nvidia-smi says all the vram is taken but lists no process as responsible.

Here's the code I'm using. I just run this once, wait for it to complete, and then run it again.

from openai import OpenAI
import os

api_key="****"
endpoint_id="*****"
model_name = "TheBloke/dolphin-2.2.1-mistral-7B-AWQ"

# Initialize the OpenAI Client with your RunPod API Key and Endpoint URL
client = OpenAI(
             api_key=api_key,
             base_url=f"https://api.runpod.ai/v2/{endpoint_id}/openai/v1",
             )

completion = "In the world of oncology, Pik3CA is"

print(f"Non-streaming completion of prompt: {completion}")
response = client.completions.create(
    model=model_name,
    prompt=completion,
    temperature=0,
    max_tokens=100,
)
# Print the response
print(response.choices[0].text

Additional information

Screen Shot 2024-06-24 at 5 56 00 PM

Screen Shot 2024-06-24 at 4 23 21 PM

Incorrect path_or_model_id

Hi!

In the last few hours I'm getting this error while pulling any image from hugging face:

OSError: Incorrect path_or_model_id: ''. Please provide either the path to a local folder or the repo_id of a model on the Hub.

I am using the standard vllm template from runpod serverless.

Thank you!

BadRequestError on runsync route, or what is the correct method to hit handler.py's locally run API?

I'm getting a BadRequestError when I try to test the vllm worker locally.

I'm running my handler locally for testing, using MODEL_NAME=/models/stablelm-3b-4e1t python3 -u /src/handler.py --rp_serve_api --rp_api_port 8000 --rp_api_host 0.0.0.0, in a docker image built using the instructions found at https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside, and I'm trying to send test requests to the runsync route based on what is described here:

https://blog.runpod.io/workers-local-api-server-introduced-with-runpod-python-0-9-13/

I've tried using the api test forms on the http://localhost:8000/docs page and I've also tried with curl:

curl -H 'content-type: application/json' -d '{"input":{"message":"blah de blah"}}' http://localhost:8000/runsync

However, I always get this response:

{
  "id": "test-1b8405d8-3e00-438e-b3cd-4bae73fc5e7a",
  "status": "COMPLETED",
  "output": [
    {
      "error": {
        "object": "error",
        "message": "",
        "type": "BadRequestError",
        "param": null,
        "code": 400
      }
    }
  ]
}

I also tried the {"input": {"number":123}} body shown in the blog post, same result.

What am I doing wrong?

Here's the full output from handler.py:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-12 23:19:33 llm_engine.py:87] Initializing an LLM engine with config: model='/models/stablelm-3b-4e1t', tokenizer='/models/stablelm-3b-4e1t', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir='/models/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-12 23:19:35 weight_utils.py:257] Loading safetensors took 1.01s
INFO 04-12 23:19:37 llm_engine.py:357] # GPU blocks: 1111, # CPU blocks: 819
WARNING 04-12 23:19:37 cache_engine.py:103] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 04-12 23:19:37 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-12 23:19:37 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-12 23:19:43 model_runner.py:756] Graph capturing finished in 7 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 04-12 23:19:44 serving_chat.py:306] No chat template provided. Chat API will not work.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--- Starting Serverless Worker |  Version 1.6.2 ---
INFO   | Starting API server.
DEBUG  | Not deployed on RunPod serverless, pings will not be sent.
INFO:     Started server process [252]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
DEBUG  | test-1b8405d8-3e00-438e-b3cd-4bae73fc5e7a | Using Async Generator
DEBUG  | test-1b8405d8-3e00-438e-b3cd-4bae73fc5e7a | Async Generator output: {'error': {'object': 'error', 'message': '', 'type': 'BadRequestError', 'param': None, 'code': 400}}
INFO   | test-1b8405d8-3e00-438e-b3cd-4bae73fc5e7a | Finished running generator.

Errors cause the instance to run indefinitely

Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several running that have an error.

enforce_eager flag

2024-02-02 21:44:03.976
[b2k2ml81pl56tl]
[info]
engine.py :190 2024-02-03 03:44:03,976 Error initializing vLLM engine: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 44.35 GiB of which 53.38 MiB is free. Process 3663367 has 44.28 GiB memory in use. Of the allocated memory 37.38 GiB is allocated by PyTorch, and 275.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-02-02 21:43:49.554
[b2k2ml81pl56tl]
[info]
INFO 02-03 03:43:49 model_runner.py:629] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-02-02 21:43:49.554
[b2k2ml81pl56tl]
[info]
INFO 02-03 03:43:49 model_runner.py:625] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-02-02 21:43:48.339
[b2k2ml81pl56tl]
[info]
INFO 02-03 03:43:48 llm_engine.py:316] # GPU blocks: 7754, # CPU blocks: 2048
2024-02-02 21:43:24.749
[b2k2ml81pl56tl]
[info]
INFO 02-03 03:43:17 weight_utils.py:164] Using model weights format ['*.safetensors']

I get this error when trying to run gptq models. I've built the worker myself with enforce_eager set to True and it works. Maybe an environment variable? Or could there be something else preventing this model from utilizing the proper amount of vram?

Using mistral 0.3

Hi,

when launching it with "mistralai/Mistral-7B-Instruct-v0.3" I get the following error
KeyError: 'layers.0.attention.wk.weight'

Do you know how to fix it?

Thank you

Do the new images work?

I based my custom worker on the vllm-base image base-0.3.0-cuda12.1.0 but if I try to run it with multiple gpus I get this error:

"ImportError: NCCLBackend is not available. Please install cupy."

It works fine if I'm only using one gpu. I saw this comment on the sls-worker repo at Dockerfile:

# We used base cuda image because pytorch installs its own cuda libraries.
# However cupy depends on cuda libraries so we had to switch to the runtime image
# In the future it would be nice to get a container with pytorch and cuda without duplicating cuda
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04 AS vllm-base

Was this Dockerfile used to create the base image?

I can see cupy listed in the requirements.txt too.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.