embeddedllm / vllm-rocm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vllm-project/vllm

84.0 2.0 5.0 12.46 MB

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page: https://vllm.readthedocs.io

License: Apache License 2.0

Shell 0.51% C++ 3.94% Python 80.26% C 0.11% Cuda 14.12% Dockerfile 0.18% Jinja 0.07% CMake 0.81%

amdgpu gpt inference llm llm-inference model-serving pytorch rocm transformer

vllm-rocm's Introduction

Easy, fast, and cheap LLM serving for everyone

vLLM GPT-4V API Alpha Branch

Welcome to the vLLM GPT-4V API Alpha Branch, a cutting-edge experimental branch designed to enhance support for Vision Language Models (VLMs) and introduce capabilities for image input within the OpenAI Chat Completions API. It has been rigorously tested for compatibility with both AMD and NVIDIA GPUs.

Purpose of This Branch

This branch serves as a testing ground for our team and the broader community to evaluate the current support of VLMs, ensuring they meet all production demands, particularly on AMD GPUs. Your feedback and testing will help us refine and optimize the integration of VLMs.

Key Features

Enhanced VLM Support
AutoImageProcessor Integration: Utilizes HuggingFace's AutoImageProcessor for image pre-processing, configurable via config.json.
Pre-built Templates: Includes a ready-to-use chat template for models like llava-hf/llava-1.5-7b-hf, simplifying setup.
Image Input Support in OpenAI API
Model Compatibility: Works with all LlavaForConditionalGeneration models, especially llava-hf/llava-1.5-7b-hf.
Flexible Image Sources: Accepts image data and web URLs.
Comprehensive Format Support: Supports PNG, JPEG, WEBP, GIF, BMP, and TIFF formats.
Preprocessing: Images are automatically adjusted to the required dimensions and format as per model specifications.

Documentation

New VLM Usage Guide: Detailed documentation available under the "Models" section to help you get started with VLMs.

System Requirements

Compatible with AMD and NVIDIA GPUs.

Related Contributions

This PR directly resolves issues #2058 and #3873. It also supercedes PR #3467. PR #3042 added vision language model support.

Upstream Contributions

This fork is actively maintained, and we are committed to upstreaming these enhancements in PR #3978. Stay tuned for more updates and enhancements!

For any issues or contributions, please refer to the issues section or submit a pull request.

Latest News 🔥

[2024/04] We hosted the third vLLM meetup with Roblox! Please find the meetup slides here.
[2024/01] We hosted the second vLLM meetup in SF! Please find the meetup slides here.
[2024/01] Added ROCm 6.0 support to vLLM.
[2023/12] Added ROCm 5.7 support to vLLM.
[2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
[2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
[2023/09] We released our PagedAttention paper on arXiv!
[2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
[2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
[2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
[2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
Optimized CUDA kernels

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs and AMD GPUs
(Experimental) Prefix caching support
(Experimental) Multi-lora support

vLLM seamlessly supports many Hugging Face models, including the following architectures:

Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
Command-R (CohereForAI/c4ai-command-r-v01, etc.)
DBRX (databricks/dbrx-base, databricks/dbrx-instruct etc.)
DeciLM (Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.)
Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
Gemma (google/gemma-2b, google/gemma-7b, etc.)
GPT-2 (gpt2, gpt2-xl, etc.)
GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
InternLM2 (internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.)
Jais (core42/jais-13b, core42/jais-13b-chat, core42/jais-30b-v3, core42/jais-30b-chat-v3, etc.)
LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
LLavA-1.5 (llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, etc.)
MiniCPM (openbmb/MiniCPM-2B-sft-bf16, openbmb/MiniCPM-2B-dpo-bf16, etc.)
Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.)
MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
OLMo (allenai/OLMo-1B, allenai/OLMo-7B, etc.)
OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
Orion (OrionStarAI/Orion-14B-Base, OrionStarAI/Orion-14B-Chat, etc.)
Phi (microsoft/phi-1_5, microsoft/phi-2, etc.)
Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
Qwen2 (Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, etc.)
Qwen2MoE (Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.)
StableLM(stabilityai/stablelm-3b-4e1t, stabilityai/stablelm-base-alpha-7b-v2, etc.)
Starcoder2(bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.)
Xverse (xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.)
Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

vllm-rocm's People

Contributors

Stargazers

Watchers

Forkers

haoshan98 loongel darklight1337 zhcharles colorpepper

vllm-rocm's Issues

benchmark-latncy test bug???

env:
embeddedllminfo/vllm-rocm:vllm-v0.2.1.post1
paths:
/app/vllm-rocm/benchmarks
scripts:
python benchmark_latency.py --model /var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/ --input-len 512 --output-len 32 --batch-size 1 --n 1 --num-iters 100

screnn results:
root@a7:/app/vllm-rocm/benchmarks# python benchmark_latency.py --model /var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/ --input-len 512 --output-len 32 --batch-size 1 --n 1 --num-iters 100
Namespace(model='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer=None, quantization=None, tensor_parallel_size=1, input_len=512, output_len=32, batch_size=1, n=1, use_beam_search=False, num_iters=100, trust_remote_code=False, dtype='auto')
INFO 11-14 04:33:49 llm_engine.py:72] Initializing an LLM engine with config: model='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 11-14 04:33:49 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
WARNING[XFORMERS]: Need to compile C++ extensions to use all xFormers features.
Please install xformers properly (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
INFO 11-14 04:33:58 llm_engine.py:207] # GPU blocks: 5756, # CPU blocks: 512
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], ignore_eos=True, max_tokens=32, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
Warming up...
Profiling iterations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 12.71it/s]
Avg latency: 0.07859424050606321 seconds

The latency seems very abnormal. I tried modifying the output length, but got almost similar results. I wonder if only one token is generated?

[Feature]: vllm 0.4.1 in ROCM

🚀 The feature, motivation and pitch

Hello, I am using the vllm 0.2.6 image. But when I tried to install a new version of vllm myself, such as 0.4.1, it failed (I was using mi250x). Do you have any plans to update the images in the Docker Hub？

Alternatives

No response

Additional context

No response

Compatible GPU architectures

Hi, awesome work!
I have a question about supported GPU architectures, and I couldn't find anything about it in the repo.
All your tests seem to be done on the Mi210, which is a CDNA2 card.
Does your vLLM ROCm port also work on different architectures, like RDNA 3 and RDNA 2, which are now supported by ROCm 5.7?

AssertionError assert output == other_output

Hello,

I'm now using base docker rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1. It works fine when using TP=1 or the number of prompts is small, but when I using 2 GPUs there is the error:

  File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm-0.2.1-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 157, in generate
    return self._run_engine(use_tqdm)
  File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm-0.2.1-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 177, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm-0.2.1-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 562, in step
    output = self._run_workers(
  File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm-0.2.1-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 712, in _run_workers
    assert output == other_output
AssertionError

I saw the same problem is fixed in original repo: vllm-project#1389, will it be fixed here?

Unable to load models on RX 6800

On my RX 6800 I seem to get RuntimeError: FlashAttention only supports AMD MI200 GPUs or newer. for some reason, I Googled that GPU and it seems to be RDNA2 like mine but for enterprise. Is this not supported on consumer AMD cards or something? Will there be support for loading it without FlashAttention? I am using the Docker image on NixOS

Merging with vLLM main branch

Hi EmbeddedLLM team,

We are the maintainers of the vLLM project. We just found this project and it's very exciting! Are you interested in contributing the fork to the main branch to add official support to ROCM in vLLM? Feel free to reach out to me at zhuohan[at]berkeley.edu and happy to help in any way.

Thanks,
Zhuohan

vLLM >= 0.2.4 with ROCm 5.6.1

Hi,

I'm working in a cluster environment which has ROCm 5.6.1. I have successfully pulled and used your vLLM 0.2.3 docker image; thanks a lot for that! I'm aware that vLLM >= 0.2.4 is meant to work with ROCm 5.7. But I would like to ask you, whether it is possible to use ROCm 5.6.1 for those vLLM versions? Have you tried that? Thank you!

Model architectures ['MixtralForCausalLM'] are not supported for now

code snippet:

from vllm import LLM, SamplingParams
from time import time
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.

llm = LLM(model="DiscoResearch/DiscoLM-mixtral-8x7b-v2", tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
t = time()
outputs = llm.generate(prompts, sampling_params)
print(f"Finish all  prompts in total {time()-t} s")
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The latest version of transformers (4.36.0) is installed, CUDA Version: 12.2, 4*NVIDIA A100-SXM4-40GB but I am getting the following error, could you please help?

error:

2023-12-11 18:23:58,175 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 10.0.1.105:6379...
2023-12-11 18:23:58,182 INFO worker.py:1673 -- Connected to Ray cluster.
INFO 12-11 18:23:58 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)
Traceback (most recent call last):
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mix.py", line 15, in
llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/entrypoints/llm.py", line 93, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 246, in from_engine_args
engine = cls(*engine_configs,
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 107, in init
self._init_workers_ray(placement_group)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 194, in _init_workers_ray
self._run_workers(
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
self._run_workers_in_batch(workers, method, *args, **kwargs))
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 727, in _run_workers_in_batch
all_outputs = ray.get(all_outputs)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/ray/_private/worker.py", line 2563, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerVllm.execute_method() (pid=3668385, ip=10.0.1.105, actor_id=e107b40e2809d80330a4550406000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x15244194f0d0>)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/ray_utils.py", line 31, in execute_method
return executor(*args, **kwargs)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/worker/worker.py", line 72, in load_model
self.model_runner.load_model()
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/worker/model_runner.py", line 36, in load_model
self.model = get_model(self.model_config)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/model_executor/model_loader.py", line 62, in get_model
model_class = _get_model_architecture(model_config.hf_config)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/model_executor/model_loader.py", line 56, in _get_model_architecture
raise ValueError(
ValueError: Model architectures ['MixtralForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'FalconForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'OPTForCausalLM', 'PhiForCausalLM', 'QWenLMHeadModel', 'RWForCausalLM', 'YiForCausalLM']

Conflicts version of PyTorch on ROCm

The vllm-rocm is dependent on flash_attention, and it also relies on PyTorch on ROCm 5.7, while flash_attention depends on PyTorch ROCm 5.4? How should I proceed to ensure vllm runs smoothly? The AMD ROCm support in flash_attention isn't very clear, it only mention how flash_attention can be run in the docker. Could you provide a tutorial for installing a version of PyTorch that is compatible with both vllm and flash_attention? Since I have encountered so many problems due to the conflicts version of PyTorch on ROCm.

ImportError: cannot import name 'cuda_utils' from partially initialized module 'vllm'

When I use -tp >1 , the error occurs!
env:
8*MI210
images:embeddedllminfo/vllm-rocm:vllm-v0.2.1.post1

[Installation]: Is Branch v0.4.0.post1-rocm available for rocm-5.7?

Your current environment

[pip3] numpy==1.26.4
[pip3] pytorch-triton-rocm==2.2.0
[pip3] torch==2.2.2+rocm5.7
[pip3] torchaudio==2.2.2+rocm5.7
[pip3] torchvision==0.17.2+rocm5.7
[conda] mkl                       2024.1.0                 pypi_0    pypi
[conda] mkl-include               2024.1.0                 pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pytorch-triton-rocm       2.2.0                    pypi_0    pypi
[conda] torch                     2.2.2+rocm5.7            pypi_0    pypi
[conda] torchaudio                2.2.2+rocm5.7            pypi_0    pypi
[conda] torchvision               0.17.2+rocm5.7           pypi_0    pypiROCM Version: 5.7.31921-d1770ee1b
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

How you are installing vllm

python setup.py install

vllm 0.1.4 with ROCm 5.6

Hi,

Motivated by your great blogspot I attempted building vllm 0.1.4 with ROCm 5.6 on MI250x. I'm getting:

In file included from /users/kazakose/vllm-rocm/hipsrc/attention/attention_kernels_h
ip.hip:23:
In file included from /users/kazakose/vllm-rocm/hipsrc/attention/attention_dtypes.h:
6:
In file included from /users/kazakose/vllm-rocm/hipsrc/attention/dtype_bfloat16.hip:
26:
/opt/rocm-5.6.1/include/hip/hip_bf16.h:30:10: fatal error: 'hip/amd_detail/amd_hip_b
f16.h' file not found
#include <hip/amd_detail/amd_hip_bf16.h>
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx90a.

Did you face this issue? Is this file supposed to exist in ROCm 5.6? Thank you!

Roadmap

Port vllm/main feature to ROCm

Support Llama/Llama-2 models for v0.2.x
Support SqueezeLLM
Support YARN
Merge into upstream vllm (vllm-project#1836)
Look into supporting multi LORA on ROCm (vllm-project#1804)

Benchmark

Long-Input-Long-Output benchmarking.