squeezeailab / kvquant Goto Github PK

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Home Page: https://arxiv.org/abs/2401.18079

Python 98.86% Makefile 0.01% Dockerfile 0.04% Jsonnet 0.01% Shell 0.12% Jupyter Notebook 0.36% C++ 0.05% Cuda 0.53% C 0.01% Cython 0.01%

compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression

kvquant's People

Contributors

Stargazers

Watchers

Forkers

eternalops 0x7d0 niconico6 akrichikov yangwang92 ishine mard1no pprp wanglongzhi2001 knowledgehacker zakorainc guan-jw zjc664656505 cli99 shahzadatintel yitianhao peilin-chen bojun-feng commonality2u

kvquant's Issues

Question about storage

Thanks for your great work and the open-sourced code！
I have some problems with the storage of sparse matrix. Could you please provide the code to reproduce Table 10 in ablation experiments in your paper?
Thanks a lot!!!

[Question] How to run HF model with 1m-length tokens in your exp?

Hi,
@chooper1

We need to use some calibration datasets to do quantization in the exp, but this type of sequence is too long to run in even 80G GPU for HF model.

So I wanna know how do you solve this problem?

Look forward your reply, Thanks!

CUDA error: an illegal memory access was encountered

Thank you for your excellent work!

Currently, I am trying to reproduce KVQaunt but have encountered some errors. Your assistance with this matter would be appreciated.

1. Reproduce the bug

I followed the provided instructions and set up the environment for gradient/quant/deployment. The gradient and quantization processes performed well; I successfully computed the gradient and built the quantizer. However, when I tested the deployment code using the following instructions, I encountered the error message "CUDA error: an illegal memory access was encountered."

cp ../quant/quantizers.pickle .

CUDA_VISIBLE_DEVICES=1 python llama.py JackFram/llama-160m wikitext2 \
    --abits 4 \
    --include_sparse \
    --sparsity-threshold 0.99 \
    --quantizer-path quantizers.pickle \
    --benchmark 128 \
    --check

2. Error logs

The detailed error logs are shown as follows:

/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
splitting into 1 GPUs
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Load quantizers.
k:  model.layers.0.self_attn.k_proj
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:449: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_upper = torch.tensor(quantizer[0]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:450: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_lower = torch.tensor(quantizer[1]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:484: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  lut_tmp = torch.tensor(self.lut)
k:  model.layers.0.self_attn.v_proj
k:  model.layers.1.self_attn.k_proj
k:  model.layers.1.self_attn.v_proj
k:  model.layers.2.self_attn.k_proj
k:  model.layers.2.self_attn.v_proj
k:  model.layers.3.self_attn.k_proj
k:  model.layers.3.self_attn.v_proj
k:  model.layers.4.self_attn.k_proj
k:  model.layers.4.self_attn.v_proj
k:  model.layers.5.self_attn.k_proj
k:  model.layers.5.self_attn.v_proj
k:  model.layers.6.self_attn.k_proj
k:  model.layers.6.self_attn.v_proj
k:  model.layers.7.self_attn.k_proj
k:  model.layers.7.self_attn.v_proj
k:  model.layers.8.self_attn.k_proj
k:  model.layers.8.self_attn.v_proj
k:  model.layers.9.self_attn.k_proj
k:  model.layers.9.self_attn.v_proj
k:  model.layers.10.self_attn.k_proj
k:  model.layers.10.self_attn.v_proj
k:  model.layers.11.self_attn.k_proj
k:  model.layers.11.self_attn.v_proj
Model type : llama
Benchmarking ...
Traceback (most recent call last):
  File "/root/KVQuant/deployment/llama.py", line 224, in <module>
    benchmark(model, input_ids, check=args.check)
  File "/root/KVQuant/deployment/llama.py", line 82, in benchmark
    out = model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2683, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2565, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2250, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 1965, in forward
    attn_weights = self.kcache.forward_fused_sparse(query_states, key_states)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 710, in forward_fused_sparse
    outliers_rescaled = outliers_rescaled.cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

According to my understanding, it appears that the error is somehow related to CUDA kernel implementation "vecquant4appendvecKsparse," which modifies the variable "outliers_rescaled".

3. Environment

OS: Ubuntu 20.04 LTS
GPU: Tesla P100-PCIE-16GB
Packages (pip list):

Package                  Version     Editable project location
------------------------ ----------- -------------------------------------
accelerate               0.29.3
aiohttp                  3.9.5
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
certifi                  2024.2.2
charset-normalizer       3.3.2
datasets                 2.19.0
dill                     0.3.8
einops                   0.8.0
filelock                 3.14.0
flash-attn               2.5.8
frozenlist               1.4.1
fsspec                   2024.3.1
huggingface-hub          0.23.0
idna                     3.7
Jinja2                   3.1.3
kvquant                  0.1.0       /root/KVQuant/deployment
MarkupSafe               2.1.5
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.2.1
ninja                    1.11.1.1
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
packaging                24.0
pandas                   2.2.2
pip                      23.3.1
protobuf                 5.26.1
psutil                   5.9.8
pyarrow                  16.0.0
pyarrow-hotfix           0.6
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
quant-cuda               0.0.0
regex                    2024.4.28
requests                 2.31.0
safetensors              0.4.3
sentencepiece            0.2.0
setuptools               68.2.2
six                      1.16.0
sympy                    1.12
tokenizers               0.15.2
torch                    2.3.0
tqdm                     4.66.4
transformers             4.38.0.dev0 /root/KVQuant/deployment/transformers
triton                   2.3.0
typing_extensions        4.11.0
tzdata                   2024.1
urllib3                  2.2.1
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

LLM weights: https://huggingface.co/JackFram/llama-160m

Due to hardware constraints, I intend to perform a quick test on the smaller model weights as indicated above. KVQuant is expected to work properly, as the smaller model differs from Llama-7B only in terms of weight size while sharing a similar architecture.

4、Related solutions that I have tried

As suggested in the discussion related to this CUDA error on https://github.com/pytorch/pytorch/issues/21819 , I have updated CUDA, torch, and other relevant components to the latest versions. However, I am still encountering the same error.

What's the potential problem of this error and how could I solve it?

Thanks in advance!

Coupled Channel-wise Quantization

This paper introduces a quantization method that couples contiguous channels and quantizes them in this coupled form (using Fisher Information). It seems rather intuitive to extend this method into KVQuant to enhance the quantization of key activations. Are there any plans by this team to do such work?

problem when reproduce experiment

Thanks for the great work!
I'm having a little problem reproducing the PPL results in the paper. I used the code snippet from the gptq repo for measuring ppl and was able to reproduce the fp16 baseline for the llama family in the paper, but I was unable to reproduce the fp16 baseline for mistral-7b using the same test code:

testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')["input_ids"]

nsamples = testenc.numel() // input_len
nlls = []

loss_fct = nn.CrossEntropyLoss()
for i in tqdm(range(nsamples)):
    batch = testenc[:, (i * input_len) : ((i + 1) * input_len)].to(model.device)
    outputs = model.model(batch)
    hidden_states = outputs[0]
    logits = model.lm_head(hidden_states)
    shift_logits = logits[:, :-1, :]
    shift_labels = batch[:, 1:].to(model.lm_head.weight.device)
    loss = loss_fct(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
    )
    neg_log_likelihood = loss.float() * input_len
    nlls.append(neg_log_likelihood)

ppl = torch.exp(torch.stack(nlls).sum() / (nsamples * input_len)).item()

Specifically, I use mistral-7b-v0.1, tried seqlen=8000 as well as seqlen=8192, both slightly lower than the results in the paper, which gave us a bit of trouble.

I would like to ask will you release the code of measuring ppl?

The value of self.include_sparse being 0 causes the assert (False) error

Excuse me, when executing cache-llama-activations.py in the deployment directory to generate activations.pickle, an assert (False) error is raised in the QuantK class's parallel_pack function in deployment/transformers/src/transformers/models/llama/modeling_llama.py file, with self.include_sparse being set to 0, as shown in the image. It seems that there is an issue with the workflow.

The quantizers.pickle file has been successfully generated.Should the instructions in the README file be adjusted in order to generate activations.pickle successfully?

reproduce the ablation results in Figure 1

Thanks for your great works!
I want to reproduce the ablation results presented in paper Figure 1. According to Figure 1, Per-Channel Key Quantization + Pre-RoPE Key Quantization yields PPL=6.34 in Llama-7b (3bit) setting. However, I got PPL=6.71 by running the following command:
CUDA_VISIBLE_DEVICES=0 python llama_simquant.py <path-to-llama-7b-hf> --abits 4 --nsamples 16 --seqlen 2048 --quantize --quantizer_path quantizers.pickle ;
can't figure out why. Would you please give me some advice? Thank you so much!

AttributeError: 'LlamaModel' object has no attribute 'split_gpus'

when I try
CUDA_VISIBLE_DEVICES=0 python llama_simquant.py --abits 4 --nsamples 16 --seqlen 2048 --nuq --fisher --quantize --include_sparse --sparsity-threshold 0.99 --quantizer_path quantizers.pickle ;

get this error
AttributeError: 'LlamaModel' object has no attribute 'split_gpus'

what is the problem

Would the current implementation of Fisher Information work out of the box with Multi-head Latent Attention

Multi-head Latent Attention is described in DeepSeekV2. Instead of caching KV they construct KV from a low dimension latent vector C which can be cached instead. KVQuant can obviously be applied to this Attention architecture but I am wondering if it can do so out of the box, specifically in regards to Fisher Information.

Can this be done for other transformer based models?

Where is the code of "ATOM-4bit"in the KVQuant codebase?

Thank you for your great work!

Now I want to reproduce the Perlexity of LLaMA-7B on Wikitext-2 with the method of "ATOM-4bit", but I can not find the code in KVQuant.
Should I clone the repo of Atom and reproduce the Perlexity on it?
Waiting for your reply， Thanks.

PRE-ROPE quantization during inference

Thanks for the great work! I am curious about the time complexity of the pre-rope quantization.

In detail, I assume the operations act as the following orders with pre-rope quant during inference: qkv_projection_matmul -> quantize_k -> write_cache_k -> load_cache_k -> dequantize_k -> rope_k -> transpose_k. However, in the decode phase, the sequence length is getting longer per step, making it necessary to apply rope_k on all the previous token features for each step. This is an O(m*m) time complexity where m is sequence_length.

This differs with post-rope case, because for post one, what in cache is post-rope quantized key. Time complexity is O(m).

One way to walk around is saving the rope result to another cache, making the time complexity O(m) but it costs much more storage space. Another way I suppose is to over-write the cache with post rope key (bfloat16/float16) but it will be conflict with the default cache dtype (INT4/INT2).

Please correct me if anything wrong above. And looking forward to your reply. Thanks.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.