Git Product home page Git Product logo

embeddedllm / vllm-rocm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vllm-project/vllm

84.0 2.0 5.0 12.46 MB

vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page: https://vllm.readthedocs.io

License: Apache License 2.0

Shell 0.51% C++ 3.94% Python 80.26% C 0.11% Cuda 14.12% Dockerfile 0.18% Jinja 0.07% CMake 0.81%
amdgpu gpt inference llm llm-inference model-serving pytorch rocm transformer

vllm-rocm's Introduction

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |


vLLM GPT-4V API Alpha Branch

Welcome to the vLLM GPT-4V API Alpha Branch, a cutting-edge experimental branch designed to enhance support for Vision Language Models (VLMs) and introduce capabilities for image input within the OpenAI Chat Completions API. It has been rigorously tested for compatibility with both AMD and NVIDIA GPUs.

Purpose of This Branch

This branch serves as a testing ground for our team and the broader community to evaluate the current support of VLMs, ensuring they meet all production demands, particularly on AMD GPUs. Your feedback and testing will help us refine and optimize the integration of VLMs.

Key Features

  • Enhanced VLM Support
  • AutoImageProcessor Integration: Utilizes HuggingFace's AutoImageProcessor for image pre-processing, configurable via config.json.
  • Pre-built Templates: Includes a ready-to-use chat template for models like llava-hf/llava-1.5-7b-hf, simplifying setup.
  • Image Input Support in OpenAI API
  • Model Compatibility: Works with all LlavaForConditionalGeneration models, especially llava-hf/llava-1.5-7b-hf.
  • Flexible Image Sources: Accepts image data and web URLs.
  • Comprehensive Format Support: Supports PNG, JPEG, WEBP, GIF, BMP, and TIFF formats.
  • Preprocessing: Images are automatically adjusted to the required dimensions and format as per model specifications.

Documentation

New VLM Usage Guide: Detailed documentation available under the "Models" section to help you get started with VLMs.

System Requirements

Compatible with AMD and NVIDIA GPUs.

Related Contributions

This PR directly resolves issues #2058 and #3873. It also supercedes PR #3467. PR #3042 added vision language model support.

Upstream Contributions

This fork is actively maintained, and we are committed to upstreaming these enhancements in PR #3978. Stay tuned for more updates and enhancements!

For any issues or contributions, please refer to the issues section or submit a pull request.

Latest News ๐Ÿ”ฅ

  • [2024/04] We hosted the third vLLM meetup with Roblox! Please find the meetup slides here.
  • [2024/01] We hosted the second vLLM meetup in SF! Please find the meetup slides here.
  • [2024/01] Added ROCm 6.0 support to vLLM.
  • [2023/12] Added ROCm 5.7 support to vLLM.
  • [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
  • [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
  • [2023/09] We released our PagedAttention paper on arXiv!
  • [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
  • [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
  • [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
  • [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support NVIDIA GPUs and AMD GPUs
  • (Experimental) Prefix caching support
  • (Experimental) Multi-lora support

vLLM seamlessly supports many Hugging Face models, including the following architectures:

  • Aquila & Aquila2 (BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B, etc.)
  • Baichuan & Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.)
  • BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
  • ChatGLM (THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.)
  • Command-R (CohereForAI/c4ai-command-r-v01, etc.)
  • DBRX (databricks/dbrx-base, databricks/dbrx-instruct etc.)
  • DeciLM (Deci/DeciLM-7B, Deci/DeciLM-7B-instruct, etc.)
  • Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.)
  • Gemma (google/gemma-2b, google/gemma-7b, etc.)
  • GPT-2 (gpt2, gpt2-xl, etc.)
  • GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
  • GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
  • GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • InternLM (internlm/internlm-7b, internlm/internlm-chat-7b, etc.)
  • InternLM2 (internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.)
  • Jais (core42/jais-13b, core42/jais-13b-chat, core42/jais-30b-v3, core42/jais-30b-chat-v3, etc.)
  • LLaMA & LLaMA-2 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • LLavA-1.5 (llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, etc.)
  • MiniCPM (openbmb/MiniCPM-2B-sft-bf16, openbmb/MiniCPM-2B-dpo-bf16, etc.)
  • Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
  • Mixtral (mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.)
  • MPT (mosaicml/mpt-7b, mosaicml/mpt-30b, etc.)
  • OLMo (allenai/OLMo-1B, allenai/OLMo-7B, etc.)
  • OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)
  • Orion (OrionStarAI/Orion-14B-Base, OrionStarAI/Orion-14B-Chat, etc.)
  • Phi (microsoft/phi-1_5, microsoft/phi-2, etc.)
  • Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.)
  • Qwen2 (Qwen/Qwen1.5-7B, Qwen/Qwen1.5-7B-Chat, etc.)
  • Qwen2MoE (Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.)
  • StableLM(stabilityai/stablelm-3b-4e1t, stabilityai/stablelm-base-alpha-7b-v2, etc.)
  • Starcoder2(bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.)
  • Xverse (xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.)
  • Yi (01-ai/Yi-6B, 01-ai/Yi-34B, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

vllm-rocm's People

Contributors

woosukkwon avatar zhuohan123 avatar simon-mo avatar yard1 avatar youkaichao avatar esmeetu avatar njhill avatar pcmoritz avatar cadedaniel avatar rkooo567 avatar beginlner avatar ywang96 avatar liuxiaoxuanpku avatar hongxiayang avatar ronensc avatar hermitsun avatar chenxu2048 avatar allendou avatar robertgshaw2-neuralmagic avatar hmellor avatar zspo avatar zhaoyang-star avatar sighingnow avatar gesanqiu avatar sanster avatar mspronesti avatar wrran avatar hanzhi713 avatar mgoin avatar twaka avatar

Stargazers

Jacques Moati avatar zhengjia avatar James Banks avatar  avatar  avatar  avatar yumdmb avatar flydust avatar Hyeokryeol Yang avatar  avatar Seungwoo hong avatar Summit Suen avatar Duc-Viet Hoang avatar Alex avatar  avatar  avatar Charlie Masters avatar Andreas Christopoulos Charitos avatar AGIFollow avatar  avatar cin-hubert avatar Michael Sluydts avatar Tertius Stander avatar Aleksey Smolenchuk avatar  avatar Remo H avatar Ivรกn Baldo avatar Michael E. Rowan avatar Fred von Graf avatar  avatar  avatar Kerim.Dev. avatar Teodor avatar  avatar Jioh L. Jung avatar 5l1v3r1 avatar Sudarshan Kamath Barkur avatar  avatar Ashwin Venkat avatar MIkhail avatar Vitali Avagyan avatar Sergei Bastrakov avatar ๅคง้–ขใ€€้‡‘ๅŸŽใ€€็ง€ๅ–œใ€€ใ‚ซใ‚ทใ‚ช avatar William-Gazeley avatar Dmitriy avatar  avatar  avatar Diwank Singh Tomer avatar  avatar Lun Zhongwang avatar Ola Magnusson avatar Yuchao Zhang avatar zhenrong-wang avatar ็ˆฑๅฏๅฏ-็ˆฑ็”Ÿๆดป avatar  avatar Jiawei Liu avatar StrongBob avatar  avatar Hongzheng Chen avatar  avatar  avatar Purinda Gunasekara avatar Sayantan Das avatar  avatar  avatar Mayank Kapoor avatar big_yellow_duck avatar Etienne Balit avatar Shundan Xiao avatar Harrison Chin avatar JiangJiazhi avatar Andrews Cordolino Sobral avatar  avatar WeiXin avatar Tan Pin Siang avatar Rishabh Srivastava avatar Kambar Mirmanov avatar  avatar Kalila avatar Sebastian Schubotz avatar Jiaqi avatar James Whedbee avatar TJian avatar Hao Shan avatar

Watchers

Tan Pin Siang avatar  avatar

vllm-rocm's Issues

benchmark-latncy test bug???

env:
embeddedllminfo/vllm-rocm:vllm-v0.2.1.post1
paths:
/app/vllm-rocm/benchmarks
scripts:
python benchmark_latency.py --model /var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/ --input-len 512 --output-len 32 --batch-size 1 --n 1 --num-iters 100

screnn results:
root@a7:/app/vllm-rocm/benchmarks# python benchmark_latency.py --model /var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/ --input-len 512 --output-len 32 --batch-size 1 --n 1 --num-iters 100
Namespace(model='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer=None, quantization=None, tensor_parallel_size=1, input_len=512, output_len=32, batch_size=1, n=1, use_beam_search=False, num_iters=100, trust_remote_code=False, dtype='auto')
INFO 11-14 04:33:49 llm_engine.py:72] Initializing an LLM engine with config: model='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer='/var/lib/jenkins/sa/AMD_MI210/llama2/hf_7B/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 11-14 04:33:49 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
WARNING[XFORMERS]: Need to compile C++ extensions to use all xFormers features.
Please install xformers properly (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
INFO 11-14 04:33:58 llm_engine.py:207] # GPU blocks: 5756, # CPU blocks: 512
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], ignore_eos=True, max_tokens=32, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True)
Warming up...
Profiling iterations: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100/100 [00:07<00:00, 12.71it/s]
Avg latency: 0.07859424050606321 seconds

The latency seems very abnormal. I tried modifying the output length, but got almost similar results. I wonder if only one token is generated?

[Feature]: vllm 0.4.1 in ROCM

๐Ÿš€ The feature, motivation and pitch

Hello, I am using the vllm 0.2.6 image. But when I tried to install a new version of vllm myself, such as 0.4.1, it failed (I was using mi250x). Do you have any plans to update the images in the Docker Hub๏ผŸ

Alternatives

No response

Additional context

No response

Compatible GPU architectures

Hi, awesome work!
I have a question about supported GPU architectures, and I couldn't find anything about it in the repo.
All your tests seem to be done on the Mi210, which is a CDNA2 card.
Does your vLLM ROCm port also work on different architectures, like RDNA 3 and RDNA 2, which are now supported by ROCm 5.7?

AssertionError assert output == other_output

Hello,

I'm now using base docker rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1. It works fine when using TP=1 or the number of prompts is small, but when I using 2 GPUs there is the error:

  File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm-0.2.1-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 157, in generate
    return self._run_engine(use_tqdm)
  File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm-0.2.1-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 177, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm-0.2.1-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 562, in step
    output = self._run_workers(
  File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm-0.2.1-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 712, in _run_workers
    assert output == other_output
AssertionError

I saw the same problem is fixed in original repo: vllm-project#1389, will it be fixed here?

Unable to load models on RX 6800

On my RX 6800 I seem to get RuntimeError: FlashAttention only supports AMD MI200 GPUs or newer. for some reason, I Googled that GPU and it seems to be RDNA2 like mine but for enterprise. Is this not supported on consumer AMD cards or something? Will there be support for loading it without FlashAttention? I am using the Docker image on NixOS

Merging with vLLM main branch

Hi EmbeddedLLM team,

We are the maintainers of the vLLM project. We just found this project and it's very exciting! Are you interested in contributing the fork to the main branch to add official support to ROCM in vLLM? Feel free to reach out to me at zhuohan[at]berkeley.edu and happy to help in any way.

Thanks,
Zhuohan

vLLM >= 0.2.4 with ROCm 5.6.1

Hi,

I'm working in a cluster environment which has ROCm 5.6.1. I have successfully pulled and used your vLLM 0.2.3 docker image; thanks a lot for that! I'm aware that vLLM >= 0.2.4 is meant to work with ROCm 5.7. But I would like to ask you, whether it is possible to use ROCm 5.6.1 for those vLLM versions? Have you tried that? Thank you!

Model architectures ['MixtralForCausalLM'] are not supported for now

code snippet:

from vllm import LLM, SamplingParams
from time import time
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.

llm = LLM(model="DiscoResearch/DiscoLM-mixtral-8x7b-v2", tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
t = time()
outputs = llm.generate(prompts, sampling_params)
print(f"Finish all  prompts in total {time()-t} s")
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The latest version of transformers (4.36.0) is installed, CUDA Version: 12.2, 4*NVIDIA A100-SXM4-40GB but I am getting the following error, could you please help?

error:

2023-12-11 18:23:58,175 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 10.0.1.105:6379...
2023-12-11 18:23:58,182 INFO worker.py:1673 -- Connected to Ray cluster.
INFO 12-11 18:23:58 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)
Traceback (most recent call last):
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mix.py", line 15, in
llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/entrypoints/llm.py", line 93, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 246, in from_engine_args
engine = cls(*engine_configs,
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 107, in init
self._init_workers_ray(placement_group)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 194, in _init_workers_ray
self._run_workers(
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
self._run_workers_in_batch(workers, method, *args, **kwargs))
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/llm_engine.py", line 727, in _run_workers_in_batch
all_outputs = ray.get(all_outputs)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/ray/_private/worker.py", line 2563, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerVllm.execute_method() (pid=3668385, ip=10.0.1.105, actor_id=e107b40e2809d80330a4550406000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x15244194f0d0>)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/engine/ray_utils.py", line 31, in execute_method
return executor(*args, **kwargs)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/worker/worker.py", line 72, in load_model
self.model_runner.load_model()
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/worker/model_runner.py", line 36, in load_model
self.model = get_model(self.model_config)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/model_executor/model_loader.py", line 62, in get_model
model_class = _get_model_architecture(model_config.hf_config)
File "/hkfs/home/project/hk-project-test-socialgroups/st_ac141953/mistral/lib64/python3.9/site-packages/vllm/model_executor/model_loader.py", line 56, in _get_model_architecture
raise ValueError(
ValueError: Model architectures ['MixtralForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'FalconForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'OPTForCausalLM', 'PhiForCausalLM', 'QWenLMHeadModel', 'RWForCausalLM', 'YiForCausalLM']

Conflicts version of PyTorch on ROCm

The vllm-rocm is dependent on flash_attention, and it also relies on PyTorch on ROCm 5.7, while flash_attention depends on PyTorch ROCm 5.4? How should I proceed to ensure vllm runs smoothly? The AMD ROCm support in flash_attention isn't very clear, it only mention how flash_attention can be run in the docker. Could you provide a tutorial for installing a version of PyTorch that is compatible with both vllm and flash_attention? Since I have encountered so many problems due to the conflicts version of PyTorch on ROCm.

[Installation]: Is Branch v0.4.0.post1-rocm available for rocm-5.7?

Your current environment

[pip3] numpy==1.26.4
[pip3] pytorch-triton-rocm==2.2.0
[pip3] torch==2.2.2+rocm5.7
[pip3] torchaudio==2.2.2+rocm5.7
[pip3] torchvision==0.17.2+rocm5.7
[conda] mkl                       2024.1.0                 pypi_0    pypi
[conda] mkl-include               2024.1.0                 pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pytorch-triton-rocm       2.2.0                    pypi_0    pypi
[conda] torch                     2.2.2+rocm5.7            pypi_0    pypi
[conda] torchaudio                2.2.2+rocm5.7            pypi_0    pypi
[conda] torchvision               0.17.2+rocm5.7           pypi_0    pypiROCM Version: 5.7.31921-d1770ee1b
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

How you are installing vllm

python setup.py install

vllm 0.1.4 with ROCm 5.6

Hi,

Motivated by your great blogspot I attempted building vllm 0.1.4 with ROCm 5.6 on MI250x. I'm getting:

In file included from /users/kazakose/vllm-rocm/hipsrc/attention/attention_kernels_h
ip.hip:23:
In file included from /users/kazakose/vllm-rocm/hipsrc/attention/attention_dtypes.h:
6:
In file included from /users/kazakose/vllm-rocm/hipsrc/attention/dtype_bfloat16.hip:
26:
/opt/rocm-5.6.1/include/hip/hip_bf16.h:30:10: fatal error: 'hip/amd_detail/amd_hip_b
f16.h' file not found
#include <hip/amd_detail/amd_hip_bf16.h>
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx90a.

Did you face this issue? Is this file supposed to exist in ROCm 5.6? Thank you!

Roadmap

  1. Port vllm/main feature to ROCm
  • Support Llama/Llama-2 models for v0.2.x
  • Support SqueezeLLM
  • Support YARN
  • Merge into upstream vllm (vllm-project#1836)
  • Look into supporting multi LORA on ROCm (vllm-project#1804)
  1. Benchmark
  • Long-Input-Long-Output benchmarking.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.