punica-ai / punica Goto Github PK

View Code? Open in Web Editor NEW

852.0 14.0 38.0 848 KB

Serving multiple LoRA finetuned LLM as one

Home Page: https://arxiv.org/abs/2310.18547

License: Apache License 2.0

Python 58.46% CMake 0.84% C++ 14.74% Cuda 24.78% Shell 1.18%

large-language-models llm lora

punica's Introduction

Punica: Serving multiple LoRA finetuned LLM as one

(paper)

Demo

punica-tui-demo-vp9.webm

python examples/tui-multi-lora.py

Overview

Low rank adapation (LoRA) is a parameter efficient way to add new knowledge to a pretrained LLM. Although the pretrained LLM takes 100s of GB storage, a LoRA finetuned model only adds 1% storage and memory overhead. Punica enables running multiple LoRA finetuned models at the cost of running one.

How?

Assuming W of shape [H1, H2] is the weight of the pretrained model, LoRA adds two small matrices A of shape [H1, r] and B of [r, H2]. Running a input x on the finetuned model would be y := x @ (W + A@B), which is the same as y := x@W + x@A@B.

When there are n LoRA models, there will be A1, B1, A2, B2, ..., An, Bn. Given a input batch X := (x1,x2,...,xn) that maps to each LoRA model, the output is Y := X@W + (x1@A1@B1, x2@A2@B2, ..., xn@An@Bn). The left-hand-side computes the input batch on the pretrained model. It is quite efficient. The latency is almost the same as when there's only one input, thanks to the strong batching effect.

We figured out an efficient way to compute the right-hand-side (the LoRA addon). We encapsulate this operation in a CUDA kernel, called Segmented Gather Matrix-Vector multiplication (SGMV), as illustrated below.

In the following microbenchmark figure, we can observe the strong batching effect of the pretrained model. Naive implementation of LoRA is slow, as depicted in the orange line. LoRA implemented via SGMV is efficient and preserves the strong batching effect.

The following figure shows the text generation throughput comparison between Punica and other systems, including HuggingFace Transformers, DeepSpeed, FasterTransformer, vLLM. The benchmark considers different settings of LoRA model popularity. Distinct means that each request is for a different LoRA model. Identical means that all requests are for the same LoRA model. Uniform and Skewed are in between. Punica achieves 12x throughput compared to state-of-the-art systems.

Read our paper to understand more: Punica: Multi-Tenant LoRA Serving.

Installation

You can install Punica from binary package or build from source.

Install from binary package

Pros: No need to compile. Fast to install.
Cons: Might not match your CUDA version, CUDA architecture, PyTorch version, or Python version.
Current precompiled versions:
- CUDA: 11.8, 12.1
- Python: 3.10, 3.11
- TORCH_CUDA_ARCH_LIST: 8.0 8.6 8.9+PTX

pip install ninja torch
pip install punica -i https://punica-ai.github.io/whl/cu121/ --extra-index-url https://pypi.org/simple
# Note: Change cu121 to your CUDA version.

Build from source

# Please install torch before punica
pip install ninja numpy torch

# Clone punica
git clone https://github.com/punica-ai/punica.git
cd punica
git submodule sync
git submodule update --init

# If you encouter problem while compilation, set TORCH_CUDA_ARCH_LIST to your CUDA architecture.
# export TORCH_CUDA_ARCH_LIST="8.0"

# Build and install punica
pip install -v --no-build-isolation .

Examples

Serving multiple LoRA models

See the demo above.

Finetune & convert to Punica format & serve with Punica

See examples/finetune/

Benchmark text generation

python -m benchmarks.bench_textgen_lora --system punica --batch-size 32

Citation

@misc{punica,
    title={Punica: Multi-Tenant LoRA Serving},
    author={Lequn Chen and Zihao Ye and Yongji Wu and Danyang Zhuo and Luis Ceze and Arvind Krishnamurthy},
    year={2023},
    eprint={2310.18547},
    archivePrefix={arXiv},
    primaryClass={cs.DC}
}

punica's People

Contributors

Stargazers

Watchers

punica's Issues

sgmv_cutlass calculate wrong output

I'm running the following code and find the answer goes wrong. I initialize the x and w to be all ones. So the output y value should be h1=4096.

But my output is not. Half of the output is 4096 and the other half is 2528. Weird!
My observation is that the wrong answer happens when h2>=32 for shrink.

The following code is adapted from benchmarks/bench_sgmv_cutlass.py

import torch
import punica.ops

bs = 4
h1 = 4096
h2 = 32
num_layers = 1
dtype = torch.float16
device = torch.device("cuda:0")
problem_sizes = [2, 2]

w = [
      torch.ones((num_layers, h1, h2), dtype=dtype, device=device)
      for _ in range(len(problem_sizes))
  ]
w_ptr = torch.tensor([t.data_ptr() for t in w],
                     dtype=torch.int64,
                     device=device)
s = torch.cumsum(
    torch.tensor([0] + problem_sizes, device=device),
    dim=0,
    dtype=torch.int32)
x = torch.ones((s[-1], h1), dtype=dtype, device=device)
y = torch.zeros((s[-1], h2), dtype=dtype, device=device)
punica.ops.sgmv_cutlass(y, x, w_ptr, s, layer_idx=0)

print(y)

pip install failed with cuda 12.2

Following the instruction for BUILD FROM SOURCE

gcc --version
9.4.0

/usr/local/cuda/bin/nvcc --version
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

ninja --version
1.11.1.git.kitware.jobserver-1

and

import torch
print(torch.version)
2.2.1+cu121

~/punica master ?2 ❯ pip install -v --no-build-isolation .                 Py venv310 hayley@compute-nv535-node-67 01:22:17 AM

Using pip 23.0.1 from /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/pip (python 3.10)
Processing /home/hayley/punica
  Running command Preparing metadata (pyproject.toml)
  /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
    warnings.warn(msg, _BetaConfiguration)
  running dist_info
  creating /tmp/pip-modern-metadata-_jux6hqk/punica.egg-info
  writing /tmp/pip-modern-metadata-_jux6hqk/punica.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-modern-metadata-_jux6hqk/punica.egg-info/dependency_links.txt
  writing requirements to /tmp/pip-modern-metadata-_jux6hqk/punica.egg-info/requires.txt
  writing top-level names to /tmp/pip-modern-metadata-_jux6hqk/punica.egg-info/top_level.txt
  writing manifest file '/tmp/pip-modern-metadata-_jux6hqk/punica.egg-info/SOURCES.txt'
  reading manifest file '/tmp/pip-modern-metadata-_jux6hqk/punica.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  no previously-included directories found matching 'benchmarks'
  no previously-included directories found matching '*/__pycache__'
  warning: no previously-included files matching '*.so' found anywhere in distribution
  adding license file 'LICENSE'
  writing manifest file '/tmp/pip-modern-metadata-_jux6hqk/punica.egg-info/SOURCES.txt'
  creating '/tmp/pip-modern-metadata-_jux6hqk/punica-1.1.0+c119.d20240308.591b598.dist-info'
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: torch in ./.venvs/venv310/lib/python3.10/site-packages (from punica==1.1.0+c119.d20240308.591b598) (2.2.1)
Collecting transformers
  Downloading transformers-4.38.2-py3-none-any.whl (8.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.5/8.5 MB 70.5 MB/s eta 0:00:00
Requirement already satisfied: numpy in ./.venvs/venv310/lib/python3.10/site-packages (from punica==1.1.0+c119.d20240308.591b598) (1.26.4)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (12.1.3.1)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (12.1.0.106)
Requirement already satisfied: networkx in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (3.2.1)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (12.1.105)
Requirement already satisfied: jinja2 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (3.1.3)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (12.1.105)
Requirement already satisfied: filelock in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (3.13.1)
Requirement already satisfied: sympy in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (1.12)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (8.9.2.26)
Requirement already satisfied: triton==2.2.0 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (2.2.0)
Requirement already satisfied: nvidia-nccl-cu12==2.19.3 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (2.19.3)
Requirement already satisfied: typing-extensions>=4.8.0 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (4.10.0)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (10.3.2.106)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (11.0.2.54)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (11.4.5.107)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (12.1.105)
Requirement already satisfied: fsspec in ./.venvs/venv310/lib/python3.10/site-packages (from torch->punica==1.1.0+c119.d20240308.591b598) (2024.2.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in ./.venvs/venv310/lib/python3.10/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch->punica==1.1.0+c119.d20240308.591b598) (12.4.99)
Collecting tokenizers<0.19,>=0.14
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 46.7 MB/s eta 0:00:00
Collecting tqdm>=4.27
  Using cached tqdm-4.66.2-py3-none-any.whl (78 kB)
Collecting requests
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting safetensors>=0.4.1
  Downloading safetensors-0.4.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 26.3 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (705 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 705.5/705.5 kB 16.2 MB/s eta 0:00:00
Collecting regex!=2019.12.17
  Downloading regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 774.0/774.0 kB 18.2 MB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.19.3
  Downloading huggingface_hub-0.21.4-py3-none-any.whl (346 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 346.4/346.4 kB 11.4 MB/s eta 0:00:00
Collecting packaging>=20.0
  Using cached packaging-23.2-py3-none-any.whl (53 kB)
Requirement already satisfied: MarkupSafe>=2.0 in ./.venvs/venv310/lib/python3.10/site-packages (from jinja2->torch->punica==1.1.0+c119.d20240308.591b598) (2.1.5)
Collecting certifi>=2017.4.17
  Using cached certifi-2024.2.2-py3-none-any.whl (163 kB)
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 142.1/142.1 kB 4.4 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1
  Using cached urllib3-2.2.1-py3-none-any.whl (121 kB)
Collecting idna<4,>=2.5
  Using cached idna-3.6-py3-none-any.whl (61 kB)
Requirement already satisfied: mpmath>=0.19 in ./.venvs/venv310/lib/python3.10/site-packages (from sympy->torch->punica==1.1.0+c119.d20240308.591b598) (1.3.0)
Building wheels for collected packages: punica
  Running command Building wheel for punica (pyproject.toml)
  /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
    warnings.warn(msg, _BetaConfiguration)
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-cpython-310
  creating build/lib.linux-x86_64-cpython-310/punica
  copying src/punica/__init__.py -> build/lib.linux-x86_64-cpython-310/punica
  copying src/punica/_build_meta.py -> build/lib.linux-x86_64-cpython-310/punica
  creating build/lib.linux-x86_64-cpython-310/punica/models
  copying src/punica/models/__init__.py -> build/lib.linux-x86_64-cpython-310/punica/models
  copying src/punica/models/llama.py -> build/lib.linux-x86_64-cpython-310/punica/models
  copying src/punica/models/llama_lora.py -> build/lib.linux-x86_64-cpython-310/punica/models
  creating build/lib.linux-x86_64-cpython-310/punica/ops
  copying src/punica/ops/__init__.py -> build/lib.linux-x86_64-cpython-310/punica/ops
  creating build/lib.linux-x86_64-cpython-310/punica/utils
  copying src/punica/utils/__init__.py -> build/lib.linux-x86_64-cpython-310/punica/utils
  copying src/punica/utils/cat_tensor.py -> build/lib.linux-x86_64-cpython-310/punica/utils
  copying src/punica/utils/convert_lora_weight.py -> build/lib.linux-x86_64-cpython-310/punica/utils
  copying src/punica/utils/kvcache.py -> build/lib.linux-x86_64-cpython-310/punica/utils
  copying src/punica/utils/lora.py -> build/lib.linux-x86_64-cpython-310/punica/utils
  running egg_info
  creating src/punica.egg-info
  writing src/punica.egg-info/PKG-INFO
  writing dependency_links to src/punica.egg-info/dependency_links.txt
  writing requirements to src/punica.egg-info/requires.txt
  writing top-level names to src/punica.egg-info/top_level.txt
  writing manifest file 'src/punica.egg-info/SOURCES.txt'
  reading manifest file 'src/punica.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  no previously-included directories found matching 'benchmarks'
  no previously-included directories found matching '*/__pycache__'
  warning: no previously-included files matching '*.so' found anywhere in distribution
  adding license file 'LICENSE'
  writing manifest file 'src/punica.egg-info/SOURCES.txt'
  running build_ext
  /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/utils/cpp_extension.py:415: UserWarning: The detected CUDA version (12.2) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
    warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
  /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/utils/cpp_extension.py:425: UserWarning: There are no x86_64-linux-gnu-g++ version bounds defined for CUDA version 12.2
    warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
  building 'punica.ops._kernels' extension
  creating /home/hayley/punica/build/temp.linux-x86_64-cpython-310
  creating /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc
  creating /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/bgmv
  creating /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter
  creating /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated
  creating /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/rms_norm
  creating /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv
  creating /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv_flashinfer
  Emitting ninja build file /home/hayley/punica/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/rms_norm/rms_norm_cutlass.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/rms_norm/rms_norm_cutlass.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/rms_norm/rms_norm_cutlass.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [2/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/flashinfer_all.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/flashinfer_all.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/flashinfer_all.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [3/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g1_h128_fp16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_decode_p16_g1_h128_fp16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g1_h128_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [4/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g1_h128_bf16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_decode_p16_g1_h128_bf16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g1_h128_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [5/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g2_h128_fp16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_decode_p16_g2_h128_fp16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g2_h128_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [6/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g2_h128_bf16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_decode_p16_g2_h128_bf16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g2_h128_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [7/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g4_h128_bf16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_decode_p16_g4_h128_bf16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g4_h128_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [8/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g4_h128_fp16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_decode_p16_g4_h128_fp16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g4_h128_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [9/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g8_h128_bf16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_decode_p16_g8_h128_bf16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g8_h128_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [10/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g8_h128_fp16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_decode_p16_g8_h128_fp16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_decode_p16_g8_h128_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [11/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g4_h128_fp16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_prefill_p16_g4_h128_fp16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g4_h128_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [12/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g8_h128_fp16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_prefill_p16_g8_h128_fp16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g8_h128_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [13/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g4_h128_bf16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_prefill_p16_g4_h128_bf16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g4_h128_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [14/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g8_h128_bf16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_prefill_p16_g8_h128_bf16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g8_h128_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [15/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g2_h128_bf16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_prefill_p16_g2_h128_bf16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g2_h128_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [16/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g1_h128_fp16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_prefill_p16_g1_h128_fp16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g1_h128_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [17/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g1_h128_bf16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_prefill_p16_g1_h128_bf16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g1_h128_bf16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [18/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g2_h128_fp16.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/flashinfer_adapter/generated/batch_prefill_p16_g2_h128_fp16.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/generated/batch_prefill_p16_g2_h128_fp16.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [19/22] c++ -MMD -MF /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/punica_ops.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/punica_ops.cc -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/punica_ops.o -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  FAILED: /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/punica_ops.o
  c++ -MMD -MF /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/punica_ops.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/punica_ops.cc -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/punica_ops.o -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  In file included from /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/Device.h:4,
                   from /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include/torch/python.h:8,
                   from /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/extension.h:9,
                   from /home/hayley/punica/csrc/punica_ops.cc:4:
  /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/python_headers.h:12:10: fatal error: Python.h: No such file or directory
     12 | #include <Python.h>
        |          ^~~~~~~~~~
  compilation terminated.
  [20/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv_flashinfer/sgmv_all.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/sgmv_flashinfer/sgmv_all.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv_flashinfer/sgmv_all.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [21/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv/sgmv_cutlass.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/sgmv/sgmv_cutlass.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv/sgmv_cutlass.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [22/22] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/bgmv/bgmv_all.o.d -I/home/hayley/punica/third_party/cutlass/include -I/home/hayley/punica/third_party/flashinfer/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/TH -I/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/hayley/punica/.venvs/venv310/include -I/usr/include/python3.10 -c -c /home/hayley/punica/csrc/bgmv/bgmv_all.cu -o /home/hayley/punica/build/temp.linux-x86_64-cpython-310/csrc/bgmv/bgmv_all.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O3 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 -std=c++17
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2096, in _run_ninja_build
      subprocess.run(
    File "/usr/lib/python3.10/subprocess.py", line 526, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
      main()
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel
      return _build_backend().build_wheel(wheel_directory, config_settings,
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/build_meta.py", line 412, in build_wheel
      return self._build_with_temp_dir(['bdist_wheel'], '.whl',
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/build_meta.py", line 397, in _build_with_temp_dir
      self.run_setup()
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/build_meta.py", line 335, in run_setup
      exec(code, locals())
    File "<string>", line 158, in <module>
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
      self.run_command(cmd)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
      super().run_command(command)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
      cmd_obj.run()
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/wheel/bdist_wheel.py", line 368, in run
      self.run_command("build")
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
      self.distribution.run_command(command)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
      super().run_command(command)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
      cmd_obj.run()
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132, in run
      self.run_command(cmd_name)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
      self.distribution.run_command(command)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
      super().run_command(command)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
      cmd_obj.run()
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
      self.build_extensions()
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 871, in build_extensions
      build_ext.build_extensions(self)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
      self._build_extensions_serial()
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
      self.build_extension(ext)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
      _build_ext.build_extension(self, ext)
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
      objects = self.compiler.compile(
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 684, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2112, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  error: subprocess-exited-with-error

  × Building wheel for punica (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /home/hayley/punica/.venvs/venv310/bin/python3.10 /home/hayley/punica/.venvs/venv310/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmpjlrn8pkd
  cwd: /home/hayley/punica
  Building wheel for punica (pyproject.toml) ... error
  ERROR: Failed building wheel for punica
Failed to build punica
ERROR: Could not build wheels for punica, which is required to install pyproject.toml-based projects

[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: pip install --upgrade pip

Error installing package

When I install the package, the following error occurred

Reasons for switching to CUTLASS-based kernel instead of custom kernel

Hey folks, awesome and really impactful work with the repo and the paper.

I was wondering what was the reason for switching from the original bgmv kernel to a CUTLASS-based sgmv one. I understand that one advantage of sgmv is that it doesn't require the LoRA tensors to be in a single contiguous block of memory, but aside from that, are there any performance considerations that made you switch?

I can also see that there is a custom sgmv shrink kernel implementation but the expand version is WIP. Is that something you are planning to work on in the near future?

Furthermore, do the performance results in the paper concern the CUTLASS kernel or the custom kernel? From the description of the implementation I inferred the later, but I was confused by the lack of the custom expand kernel in the repo.

Thanks, and great work!

Question about performance

Hi guys, @abcdabcd987 @yzh119
Thanks again for this great project.

It is observed that the prediction time profiled is like 60% longer than the base bare model (without LoRA adapters).

Runtime info:

Llama-34B (Yi-34B)
LoRA.
- rank: 32
- lora_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - up_proj
  - gate_proj
  - down_proj

Folloing is the profiling info, each decoding task is composed of 5 decoding steps.

LoRA Inference

INFO:root:Time taken is 0.18477 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16798 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16362 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16338 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.16323 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.

Bare Model Inference

INFO:root:Time taken is 0.12258 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10321 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10180 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10160 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10191 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10187 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10196 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.
INFO:root:Time taken is 0.10186 seconds. 1 decoding tasks, 0 prefill tasks, 0 delayed prefill tasks.

It's like 60% slower if equipped with this LoRA adapter. Kind of curious is this expected ? :)

Support for GPT-NEOX models

Hello! Thank you for this awesome work. I am testing Punica for serving my custom models and it has GPT-NEOX model as the base model. Currently, does Punica support other models than Llama?
If it doesn't require much modification, I would like to request to give us some manual to add new models.
Thank you! :)

Error when inferencing from the Punica format

Hello, I followed your tutorial of converting the fine-tuned weights to punica format [reference] using:

python -m punica.utils.convert_lora_weight model/table-peft/adapter_model.bin model/table-peft.punica.pt

Then, while running:

python punica/examples/textgen_lora.py --lora-weight model/table-peft.punica.pt --prompt "Tell me about yourself"

I am getting this error:

Traceback (most recent call last):
  File "/home/bibekyess/punica_sandbox/punica/examples/textgen_lora.py", line 180, in <module>
    main()
  File "/home/bibekyess/anaconda3/envs/punica-12.1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/bibekyess/punica_sandbox/punica/examples/textgen_lora.py", line 117, in main
    lora_weight.copy_from_tensors(tmp)
  File "/home/bibekyess/anaconda3/envs/punica-12.1/lib/python3.10/site-packages/punica/models/llama_lora.py", line 87, in copy_from_tensors
    self.q.copy_from_tensor(ts["q.A"], ts["q.B"])
  File "/home/bibekyess/anaconda3/envs/punica-12.1/lib/python3.10/site-packages/punica/utils/lora.py", line 33, in copy_from_tensor
    self.wa.copy_(a.to(self.wa.device).to(self.wa.dtype))
RuntimeError: The size of tensor a (4096) must match the size of tensor b (5120) at non-singleton dimension 2

What maybe the reason? I didn't use your suggested script for finetuning, I used huggingface libraries for that and converted it to the punica format.
I noticed that convert_lora_weight.py only takes LoRA weight file, so do I need to pass the LoRA adapter config file also?

Thank you for your help! :)

pip install -v --no-build-isolation . is giving errors

          instantiation of "void flashinfer::vec_t<float, vec_size>::cast_load(const T *) [with vec_size=8UL, T=nv_bfloat16]" at line 492 of /home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/../flashinfer/decode.cuh
          instantiation of "void flashinfer::BatchDecodeWithPagedKVCacheKernel<cooperative,rotary_mode,norm_on_the_fly,num_stages_smem,vec_size,bdx,bdy,bdz,DTypeIn,DTypeOut,IdType>(DTypeIn *, flashinfer::paged_kv_t<DTypeIn, IdType>, DTypeOut *, float *, float, float, float) [with cooperative=true, rotary_mode=flashinfer::RotaryMode::kNone, norm_on_the_fly=false, num_stages_smem=2U, vec_size=8U, bdx=8U, bdy=1U, bdz=16U, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" at line 1058 of /home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/../flashinfer/decode.cuh
          instantiation of "cudaError_t flashinfer::BatchDecodeWithPagedKVCache(DTypeIn *, flashinfer::paged_kv_t<DTypeIn, IdType>, DTypeOut *, float *, uint32_t, flashinfer::RotaryMode, float, float, cudaStream_t, uint32_t) [with DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" at line 20 of /home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/flashinfer_all.cu
          instantiation of "void FlashInferBatchDecodeKernel(T *, T *, T *, int32_t *, int32_t *, int32_t *, int, int, int, int, int, int, int) [with T=nv_bfloat16]" at line 71 of /home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/flashinfer_all.cu

/home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/../flashinfer/vec_dtypes.cuh(1287): error: identifier "__float22bfloat162_rn" is undefined
__float22bfloat162_rn(((float2*)(&src.data))[i]);
^
detected during:
instantiation of "void flashinfer::vec_t<nv_bfloat16, vec_size>::cast_from(const flashinfer::vec_t<T, vec_size> &) [with vec_size=8UL, T=float]" at line 78
instantiation of "void flashinfer::cast_store_impl(tgt_float_t *, const flashinfer::vec_t<src_float_t, vec_size> &) [with src_float_t=float, tgt_float_t=nv_bfloat16, vec_size=8UL]" at line 1184
instantiation of "void flashinfer::vec_t<float, vec_size>::cast_store(T *) const [with vec_size=8UL, T=nv_bfloat16]" at line 648 of /home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/../flashinfer/decode.cuh
instantiation of "void flashinfer::BatchDecodeWithPagedKVCacheKernel<cooperative,rotary_mode,norm_on_the_fly,num_stages_smem,vec_size,bdx,bdy,bdz,DTypeIn,DTypeOut,IdType>(DTypeIn *, flashinfer::paged_kv_t<DTypeIn, IdType>, DTypeOut *, float *, float, float, float) [with cooperative=true, rotary_mode=flashinfer::RotaryMode::kNone, norm_on_the_fly=false, num_stages_smem=2U, vec_size=8U, bdx=8U, bdy=1U, bdz=16U, DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" at line 1058 of /home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/../flashinfer/decode.cuh
instantiation of "cudaError_t flashinfer::BatchDecodeWithPagedKVCache(DTypeIn *, flashinfer::paged_kv_t<DTypeIn, IdType>, DTypeOut *, float *, uint32_t, flashinfer::RotaryMode, float, float, cudaStream_t, uint32_t) [with DTypeIn=nv_bfloat16, DTypeOut=nv_bfloat16, IdType=int32_t]" at line 20 of /home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/flashinfer_all.cu
instantiation of "void FlashInferBatchDecodeKernel(T *, T *, T *, int32_t *, int32_t *, int32_t *, int, int, int, int, int, int, int) [with T=nv_bfloat16]" at line 71 of /home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/flashinfer_all.cu

4 errors detected in the compilation of "/home/ubuntu/multi-tenant-test/punica/csrc/flashinfer_adapter/flashinfer_all.cu".
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
subprocess.run(
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in
main()
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 251, in build_wheel
return _build_backend().build_wheel(wheel_directory, config_settings,
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/build_meta.py", line 416, in build_wheel
return self._build_with_temp_dir(['bdist_wheel'], '.whl',
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/build_meta.py", line 401, in _build_with_temp_dir
self.run_setup()
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 51, in
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/init.py", line 107, in setup
return distutils.core.setup(**attrs)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/dist.py", line 1234, in run_command
super().run_command(command)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 364, in run
self.run_command("build")
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/dist.py", line 1234, in run_command
super().run_command(command)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/command/build.py", line 131, in run
self.run_command(cmd_name)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/dist.py", line 1234, in run_command
super().run_command(command)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
self.build_extensions()
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
build_ext.build_extensions(self)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 467, in build_extensions
self._build_extensions_serial()
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 493, in _build_extensions_serial
self.build_extension(ext)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
_build_ext.build_extension(self, ext)
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/setuptools/_distutils/command/build_ext.py", line 548, in build_extension
objects = self.compiler.compile(
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
error: subprocess-exited-with-error

× Building wheel for punica (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
full command: /home/ubuntu/miniconda3/envs/muti-tenant-test-1/bin/python /home/ubuntu/miniconda3/envs/muti-tenant-test-1/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py build_wheel /tmp/tmpxzh48d2c
cwd: /home/ubuntu/multi-tenant-test/punica
Building wheel for punica (pyproject.toml) ... error
ERROR: Failed building wheel for punica
Failed to build punica
ERROR: Could not build wheels for punica, which is required to install pyproject.toml-based projects

Question about the work

Nice work. Your work focuses on LLM inference and optimizing the inference speed.
Do you support backprop?
If not, how difficult would it be to allow backprop to work with the custom Cuda kernel?

Confusion about offset += 8

Thanks for your great work!

When I looked at the implementation of the sgmv_flashinfer version, I was very confused about offset += 8. In my understanding, should it be offset += 4?

https://github.com/punica-ai/punica/blob/master/csrc/sgmv_flashinfer/sgmv_flashinfer.cuh#L69
https://github.com/punica-ai/punica/blob/master/csrc/sgmv_flashinfer/sgmv_flashinfer.cuh#L116

Looking forward to your answer.

improve bgmv expand kernel performance

Hi, sorry to bother you again.
I'm using bgmv in our llm serving system since the sgmv kernels not ready. I do some profiling on bgmv kernels and find that the performance of expand kernel is worse compared to shrink kernel.
e.g. let's take batch_size = 4096, hidden_dim=4096 and lora_rank = 8. The performance profiled by ncu is shown below:

It seems that expand kernel's memory throughput is lower than shrink kernel. It's reasonable because all memory operations in expand kernel deal with global memory.
My question is that why expand kernel doesn't use pipeline asynchronous memory copy from global to shared memory?

License to use this library?

Do you have any plan to include any OSS license? I'd like to use this in my project, but it should be compatible with the license of this.

Support qwen?

Thanks!

[Feature Request] Add support for SM75

Any plans to add support for SM75 like V100 GPUs? Thank you!

BGMV performs better than SGMV?

I benchmarked various kernels on the A100 using the benchmark script, and it seems that the BGMV kernel outperforms the SGMV kernels for individual requests (bgmv senario). Is this expected?

The smallest rank supported is 16？

Hi, grateful for this nice work.

Confused that it seems only rank of 16, 32, 64 is supported right now? ref

Multi GPU and Multi Node solution

I wanted to know how to use Multi-GPUs and Multi-Node solutions with the current Punica code.
Also wanted to know about the runner and scheduler code which is mentioned in the paper, if it is implemented can you guide me about that.

Choosing adapters on inference

Hi, thanks for the great work.
Is it possible to use certain adapters while inferring?

Support for H100 GPUs?

When I set TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0", I got compiling errors. And then I found:

punica/.github/workflows/release_wheel.yml

Line 15 in 591b598

TORCH_CUDA_ARCH_LIST: "8.0 8.6 8.9+PTX" # Need fix for 9.0

Is there something we does not support yet? Thank you in advance!

Update: Adding +PTX works. But it would be great to have 9.0 supported in the future.

ModuleNotFoundError: No module named 'punica.ops._kernels'

Hello, I am trying to run punica in cuda-toolkit-11.8 but I get this error ModuleNotFoundError: No module named 'punica.ops._kernels', when running:
python -m benchmarks.bench_textgen_lora --system punica --batch-size 32.

The build seems successful except one warning:

/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no g++ version bounds defined for CUDA version 11.8
    warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')

The detailed log is this when running env TORCH_CUDA_ARCH_LIST="8.0" pip install -v --no-build-isolation:
(I tried running inside the docker container and also outside. In both cases, I get the ModuleNotFoundError.)

Building wheels for collected packages: punica
  Running command Building wheel for punica (pyproject.toml)
  No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
  /root/miniconda3/envs/punica/lib/python3.10/site-packages/setuptools/config/pyprojecttoml.py:66: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
    config = read_configuration(filepath, True, ignore_option_errors, dist)
  running bdist_wheel
  running build
  running build_py
  running egg_info
  writing punica.egg-info/PKG-INFO
  writing dependency_links to punica.egg-info/dependency_links.txt
  writing requirements to punica.egg-info/requires.txt
  writing top-level names to punica.egg-info/top_level.txt
  reading manifest file 'punica.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  no previously-included directories found matching 'benchmarks'
  no previously-included directories found matching '*/__pycache__'
  warning: no previously-included files matching '*.so' found anywhere in distribution
  adding license file 'LICENSE'
  writing manifest file 'punica.egg-info/SOURCES.txt'
  running build_ext
  /root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no g++ version bounds defined for CUDA version 11.8
    warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
  building 'punica.ops._kernels' extension
  Emitting ninja build file /punica/build/temp.linux-x86_64-cpython-310/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/6] /usr/local/cuda/bin/nvcc  -I/punica/third_party/cutlass/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/TH -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/punica/include/python3.10 -c -c /punica/csrc/rms_norm/rms_norm_cutlass.cu -o /punica/build/temp.linux-x86_64-cpython-310/csrc/rms_norm/rms_norm_cutlass.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [2/6] /usr/local/cuda/bin/nvcc  -I/punica/third_party/cutlass/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/TH -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/punica/include/python3.10 -c -c /punica/csrc/sgmv_flashinfer/sgmv_all.cu -o /punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv_flashinfer/sgmv_all.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [3/6] /usr/local/cuda/bin/nvcc  -I/punica/third_party/cutlass/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/TH -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/punica/include/python3.10 -c -c /punica/csrc/sgmv/sgmv_cutlass.cu -o /punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv/sgmv_cutlass.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [4/6] /usr/local/cuda/bin/nvcc  -I/punica/third_party/cutlass/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/TH -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/punica/include/python3.10 -c -c /punica/csrc/flashinfer_adapter/flashinfer_all.cu -o /punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/flashinfer_all.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=sm_80 -std=c++17
  [5/6] c++ -MMD -MF /punica/build/temp.linux-x86_64-cpython-310/csrc/punica_ops.o.d -pthread -B /root/miniconda3/envs/punica/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/punica/include -fPIC -O2 -isystem /root/miniconda3/envs/punica/include -fPIC -I/punica/third_party/cutlass/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/TH -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/punica/include/python3.10 -c -c /punica/csrc/punica_ops.cc -o /punica/build/temp.linux-x86_64-cpython-310/csrc/punica_ops.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  [6/6] /usr/local/cuda/bin/nvcc  -I/punica/third_party/cutlass/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/TH -I/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/include/THC -I/usr/local/cuda/include -I/root/miniconda3/envs/punica/include/python3.10 -c -c /punica/csrc/bgmv/bgmv_all.cu -o /punica/build/temp.linux-x86_64-cpython-310/csrc/bgmv/bgmv_all.o --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_kernels -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_80,code=sm_80 -std=c++17
  g++ -pthread -B /root/miniconda3/envs/punica/compiler_compat -shared -Wl,-rpath,/root/miniconda3/envs/punica/lib -Wl,-rpath-link,/root/miniconda3/envs/punica/lib -L/root/miniconda3/envs/punica/lib -Wl,-rpath,/root/miniconda3/envs/punica/lib -Wl,-rpath-link,/root/miniconda3/envs/punica/lib -L/root/miniconda3/envs/punica/lib /punica/build/temp.linux-x86_64-cpython-310/csrc/bgmv/bgmv_all.o /punica/build/temp.linux-x86_64-cpython-310/csrc/flashinfer_adapter/flashinfer_all.o /punica/build/temp.linux-x86_64-cpython-310/csrc/punica_ops.o /punica/build/temp.linux-x86_64-cpython-310/csrc/rms_norm/rms_norm_cutlass.o /punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv/sgmv_cutlass.o /punica/build/temp.linux-x86_64-cpython-310/csrc/sgmv_flashinfer/sgmv_all.o -L/root/miniconda3/envs/punica/lib/python3.10/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/punica/ops/_kernels.cpython-310-x86_64-linux-gnu.so
  installing to build/bdist.linux-x86_64/wheel
  running install
  running install_lib
  creating build/bdist.linux-x86_64
  creating build/bdist.linux-x86_64/wheel
  creating build/bdist.linux-x86_64/wheel/punica
  creating build/bdist.linux-x86_64/wheel/punica/ops
  copying build/lib.linux-x86_64-cpython-310/punica/ops/_kernels.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/wheel/punica/ops
  copying build/lib.linux-x86_64-cpython-310/punica/ops/__init__.py -> build/bdist.linux-x86_64/wheel/punica/ops
  copying build/lib.linux-x86_64-cpython-310/punica/__init__.py -> build/bdist.linux-x86_64/wheel/punica
  creating build/bdist.linux-x86_64/wheel/punica/models
  copying build/lib.linux-x86_64-cpython-310/punica/models/llama.py -> build/bdist.linux-x86_64/wheel/punica/models
  copying build/lib.linux-x86_64-cpython-310/punica/models/llama_lora.py -> build/bdist.linux-x86_64/wheel/punica/models
  copying build/lib.linux-x86_64-cpython-310/punica/models/__init__.py -> build/bdist.linux-x86_64/wheel/punica/models
  creating build/bdist.linux-x86_64/wheel/punica/utils
  copying build/lib.linux-x86_64-cpython-310/punica/utils/cat_tensor.py -> build/bdist.linux-x86_64/wheel/punica/utils
  copying build/lib.linux-x86_64-cpython-310/punica/utils/convert_lora_weight.py -> build/bdist.linux-x86_64/wheel/punica/utils
  copying build/lib.linux-x86_64-cpython-310/punica/utils/__init__.py -> build/bdist.linux-x86_64/wheel/punica/utils
  copying build/lib.linux-x86_64-cpython-310/punica/utils/kvcache.py -> build/bdist.linux-x86_64/wheel/punica/utils
  copying build/lib.linux-x86_64-cpython-310/punica/utils/lora.py -> build/bdist.linux-x86_64/wheel/punica/utils
  running install_egg_info
  Copying punica.egg-info to build/bdist.linux-x86_64/wheel/punica-0.0.1-py3.10.egg-info
  running install_scripts
  creating build/bdist.linux-x86_64/wheel/punica-0.0.1.dist-info/WHEEL
  creating '/tmp/pip-wheel-xt0fqdps/.tmp-jps4usf1/punica-0.0.1-cp310-cp310-linux_x86_64.whl' and adding 'build/bdist.linux-x86_64/wheel' to it
  adding 'punica/__init__.py'
  adding 'punica/models/__init__.py'
  adding 'punica/models/llama.py'
  adding 'punica/models/llama_lora.py'
  adding 'punica/ops/__init__.py'
  adding 'punica/ops/_kernels.cpython-310-x86_64-linux-gnu.so'
  adding 'punica/utils/__init__.py'
  adding 'punica/utils/cat_tensor.py'
  adding 'punica/utils/convert_lora_weight.py'
  adding 'punica/utils/kvcache.py'
  adding 'punica/utils/lora.py'
  adding 'punica-0.0.1.dist-info/LICENSE'
  adding 'punica-0.0.1.dist-info/METADATA'
  adding 'punica-0.0.1.dist-info/WHEEL'
  adding 'punica-0.0.1.dist-info/top_level.txt'
  adding 'punica-0.0.1.dist-info/RECORD'
  removing build/bdist.linux-x86_64/wheel
  Building wheel for punica (pyproject.toml) ... done
  Created wheel for punica: filename=punica-0.0.1-cp310-cp310-linux_x86_64.whl size=799747 sha256=f423816025988aa50102a5792e5dff1debdd9d6910ea4b74364ce1610a216684
  Stored in directory: /tmp/pip-ephem-wheel-cache-fkt205mc/wheels/0e/58/4b/992f075cedd202c2dc89c9ac8a7146ab9ff7495bc4741422bf
Successfully built punica
Installing collected packages: tqdm, safetensors, regex, packaging, fsspec, huggingface-hub, tokenizers, transformers, punica
  changing mode of /root/miniconda3/envs/punica/bin/tqdm to 755
  changing mode of /root/miniconda3/envs/punica/bin/huggingface-cli to 755
  changing mode of /root/miniconda3/envs/punica/bin/transformers-cli to 755
Successfully installed fsspec-2023.10.0 huggingface-hub-0.19.4 packaging-23.2 punica-0.0.1 regex-2023.10.3 safetensors-0.4.0 tokenizers-0.15.0 tqdm-4.66.1 transformers-4.35.2

Can you inform the suggested cudatookit-version for building?
Thank you!

[Question] How to avoid matrices conflict

Assuming W of shape [H1, H2] is the weight of the pretrained model, LoRA adds two small matrices A of shape [H1, r] and B of [r, H2]. Running a input x on the finetuned model would be y := x @ (W + A@B), which is the same as y := x@W + x@A@B.
When there are n LoRA models, there will be A1, B1, A2, B2, ..., An, Bn. Given a input batch X := (x1,x2,...,xn) that maps to each LoRA model, the output is Y := X@W + (x1@A1@B1, x2@A2@B2, ..., xn@An@Bn). The left-hand-side computes the input batch on the pretrained model. It is quite efficient. The latency is almost the same as when there's only one input, thanks to the strong batching effect.

So the question is if A1,A2,A3 shape has conflict, like modifying same parameters in base model, does the result will still be correct? Thanks!

Support Continuous Batching?

Hi. I've read Punica paper. It says that "We put the batching dimension on the outmost to enable continuous batching" in Sec 5.4. Could you please tell me which code of Punica achieves continuous batching? I didn't find it in this repo. Thanks a lot!

bug: sgmv_shrink does not support CUDA graph tracing

When attempting to run the sgmv_shrink kernel as part of a CUDA graph, an error will occur. Digging into it further, it seems that some of the operations performed in the kernel are not supported by CUDA graphs. I haven't dug too deeply into it, but going through the code, I wonder if it might be related to memcpy or other (presumably) blocking calls.

Example:

torch.cuda.synchronize(device)

graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph, pool=None):
    sgmv_shrink(...)

torch.cuda.synchronize(device)

Thanks for the great work on this project! We're huge fans at LoRAX.

A question on serving multiple-lora models in one server

Great project! It's very usful in industry application.

In this example(https://github.com/punica-ai/punica/blob/master/examples/tui-multi-lora.py), all requests are same for different LoRA models, computational efficiency via SGMV is easily understandable.

But, how "Distinct" requests for different LoRA models can still have performance boost? That each request is for a different LoRA model.

ImportError: cannot import name 'BatchedKvCache' from 'punica'

Everything go well when I install punica from binary package. However, it shows "ImportError: cannot import name 'BatchedKvCache' from 'punica'" when I run "python -m benchmarks.bench_textgen_lora --system punica --batch-size 32". How can I fix this?

About scheduler

Thank you for your great work! May I ask about some details on the scheduler?

In paper, it is mentioned that "To minimize latency penalty, we limit the prefill batch size to 1 for each batch." So if multiple requests are at prefill stage, they will either be scheduled to different Runners or be in the first-arrive-first service queue. Is this understanding correct? By the way, May I know whether this scheduling code (for section 5.1 Scheduling new request) is released?
In figure 2, May I know the difference between runner and LLMs under a runner?

Looking forward to hearing from you~

why cuda arch should >= 8.0?

Thanks for this nice work in serving multi loras.
The problem is as the title says : )

关于测试代码 bench_lora.py 中 pastlen 不更新的问题

你好，非常棒的工作！

在这里 https://github.com/punica-ai/punica/blob/master/benchmarks/bench_lora.py#L132 req 的 pastlen 每次 decode 都是使用编码阶段的 prompt 长度（后面没看到更新的操作），感觉是有问题的~

Rotary_pos_emb Miss?

I have a question about the Rotary_pos_emb function in llama. I have not find this function in the code, and I can only guess that it has been implemented in _kernels.batch_decode( in the batch_decode function?
So if i want to find Rotary_pos_emb, I need to check flashinfer's code?

Why punica is faster than pretrained LLM?

I notice in your figure, LoRA with SGMA is even faster than pretrained LLM. It looks like punica is faster than LLM with no-lora. Can you explain the details of expriment and evaluation？

flashinfer shrink vs cutlass

Hi, I really enjoyed learning about SGMV.

I was grokking through the code and wanted to check my understanding. It seems that there are two implementations of SGMV, one based on Grouped GEMM cutlass and another hand written one (using some utils from flashinfer). Just wondering, what is the performance benchmark between the two?

Inquiry on cuda memory across processes

Hi,

Congratulations on the great work you have done! I am very interested in your work. Specifically, I want to know how you allow multiple serving processes to share the same Cuda memory spaces (for the frozen parameters in the LoRA models).

Could you please point out the code? I want to study your implementation. Thanks!

BR//Zizhao

ModuleNotFoundError: No module named 'rich'

Amazing work on Punica read the research paper @abcdabcd987
I am having some issue running python examples/tui-multi-lora.py

I am getting following error :
(muti-tenant-test-2) ubuntu@ip-10-14-1-163:~/multi-tenant-test/punica$ python examples/tui-multi-lora.py
Traceback (most recent call last):
File "/home/ubuntu/multi-tenant-test/punica/examples/tui-multi-lora.py", line 11, in
from rich.containers import Lines
ModuleNotFoundError: No module named 'rich'

does the Loras all trained start from a base model.

RunTimeError: output must be a cuda tensor

Hi!

I tried using the benchmark text generation

python -m benchmarks.bench_textgen_lora --system punica --batch-size 32

but when I did I got a runtime error stating the output should be a cuda tensor.
I am not sure if this error is from my side or if the code is from the code. This is the error I am given

The cuda version I use is 12.4, python version 3.10.12, ninja version 1.11.1.git.kitware.jobserver-1, torch version 2.2.2.

Add support for running on Colab

I'm not able to install this library on Colab. I tried this

git clone https://github.com/punica-ai/punica
cd punica && pip install .

But this is failing with the following error

Processing /content/punica
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'
Cloning into 'punica'...
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.