Git Product home page Git Product logo

squeezeailab / kvquant Goto Github PK

View Code? Open in Web Editor NEW
263.0 12.0 23.0 20.28 MB

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Home Page: https://arxiv.org/abs/2401.18079

Python 98.86% Makefile 0.01% Dockerfile 0.04% Jsonnet 0.01% Shell 0.12% Jupyter Notebook 0.36% C++ 0.05% Cuda 0.53% C 0.01% Cython 0.01%
compression efficient-inference efficient-model large-language-models llama llm localllama localllm mistral model-compression

kvquant's People

Contributors

chooper1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kvquant's Issues

Question about storage

Thanks for your great work and the open-sourced code!
I have some problems with the storage of sparse matrix. Could you please provide the code to reproduce Table 10 in ablation experiments in your paper?
Thanks a lot!!!

CUDA error: an illegal memory access was encountered

Thank you for your excellent work!

Currently, I am trying to reproduce KVQaunt but have encountered some errors. Your assistance with this matter would be appreciated.

1. Reproduce the bug

I followed the provided instructions and set up the environment for gradient/quant/deployment. The gradient and quantization processes performed well; I successfully computed the gradient and built the quantizer. However, when I tested the deployment code using the following instructions, I encountered the error message "CUDA error: an illegal memory access was encountered."

cp ../quant/quantizers.pickle .

CUDA_VISIBLE_DEVICES=1 python llama.py JackFram/llama-160m wikitext2 \
    --abits 4 \
    --include_sparse \
    --sparsity-threshold 0.99 \
    --quantizer-path quantizers.pickle \
    --benchmark 128 \
    --check

2. Error logs

The detailed error logs are shown as follows:

/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
splitting into 1 GPUs
/root/anaconda3/envs/deploy/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Load quantizers.
k:  model.layers.0.self_attn.k_proj
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:449: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_upper = torch.tensor(quantizer[0]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:450: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.outlier_threshold_lower = torch.tensor(quantizer[1]).cuda().half().flatten()
/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py:484: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  lut_tmp = torch.tensor(self.lut)
k:  model.layers.0.self_attn.v_proj
k:  model.layers.1.self_attn.k_proj
k:  model.layers.1.self_attn.v_proj
k:  model.layers.2.self_attn.k_proj
k:  model.layers.2.self_attn.v_proj
k:  model.layers.3.self_attn.k_proj
k:  model.layers.3.self_attn.v_proj
k:  model.layers.4.self_attn.k_proj
k:  model.layers.4.self_attn.v_proj
k:  model.layers.5.self_attn.k_proj
k:  model.layers.5.self_attn.v_proj
k:  model.layers.6.self_attn.k_proj
k:  model.layers.6.self_attn.v_proj
k:  model.layers.7.self_attn.k_proj
k:  model.layers.7.self_attn.v_proj
k:  model.layers.8.self_attn.k_proj
k:  model.layers.8.self_attn.v_proj
k:  model.layers.9.self_attn.k_proj
k:  model.layers.9.self_attn.v_proj
k:  model.layers.10.self_attn.k_proj
k:  model.layers.10.self_attn.v_proj
k:  model.layers.11.self_attn.k_proj
k:  model.layers.11.self_attn.v_proj
Model type : llama
Benchmarking ...
Traceback (most recent call last):
  File "/root/KVQuant/deployment/llama.py", line 224, in <module>
    benchmark(model, input_ids, check=args.check)
  File "/root/KVQuant/deployment/llama.py", line 82, in benchmark
    out = model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2683, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2565, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 2250, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 1965, in forward
    attn_weights = self.kcache.forward_fused_sparse(query_states, key_states)
  File "/root/KVQuant/deployment/transformers/src/transformers/models/llama/modeling_llama.py", line 710, in forward_fused_sparse
    outliers_rescaled = outliers_rescaled.cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

According to my understanding, it appears that the error is somehow related to CUDA kernel implementation "vecquant4appendvecKsparse," which modifies the variable "outliers_rescaled".

3. Environment

  • OS: Ubuntu 20.04 LTS
  • GPU: Tesla P100-PCIE-16GB
  • Packages (pip list):
Package                  Version     Editable project location
------------------------ ----------- -------------------------------------
accelerate               0.29.3
aiohttp                  3.9.5
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    23.2.0
certifi                  2024.2.2
charset-normalizer       3.3.2
datasets                 2.19.0
dill                     0.3.8
einops                   0.8.0
filelock                 3.14.0
flash-attn               2.5.8
frozenlist               1.4.1
fsspec                   2024.3.1
huggingface-hub          0.23.0
idna                     3.7
Jinja2                   3.1.3
kvquant                  0.1.0       /root/KVQuant/deployment
MarkupSafe               2.1.5
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
networkx                 3.2.1
ninja                    1.11.1.1
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
packaging                24.0
pandas                   2.2.2
pip                      23.3.1
protobuf                 5.26.1
psutil                   5.9.8
pyarrow                  16.0.0
pyarrow-hotfix           0.6
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
quant-cuda               0.0.0
regex                    2024.4.28
requests                 2.31.0
safetensors              0.4.3
sentencepiece            0.2.0
setuptools               68.2.2
six                      1.16.0
sympy                    1.12
tokenizers               0.15.2
torch                    2.3.0
tqdm                     4.66.4
transformers             4.38.0.dev0 /root/KVQuant/deployment/transformers
triton                   2.3.0
typing_extensions        4.11.0
tzdata                   2024.1
urllib3                  2.2.1
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4

Due to hardware constraints, I intend to perform a quick test on the smaller model weights as indicated above. KVQuant is expected to work properly, as the smaller model differs from Llama-7B only in terms of weight size while sharing a similar architecture.

4、Related solutions that I have tried

As suggested in the discussion related to this CUDA error on https://github.com/pytorch/pytorch/issues/21819 , I have updated CUDA, torch, and other relevant components to the latest versions. However, I am still encountering the same error.

What's the potential problem of this error and how could I solve it?

Thanks in advance!

Coupled Channel-wise Quantization

This paper introduces a quantization method that couples contiguous channels and quantizes them in this coupled form (using Fisher Information). It seems rather intuitive to extend this method into KVQuant to enhance the quantization of key activations. Are there any plans by this team to do such work?

problem when reproduce experiment

Thanks for the great work!
I'm having a little problem reproducing the PPL results in the paper. I used the code snippet from the gptq repo for measuring ppl and was able to reproduce the fp16 baseline for the llama family in the paper, but I was unable to reproduce the fp16 baseline for mistral-7b using the same test code:

testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')["input_ids"]

nsamples = testenc.numel() // input_len
nlls = []

loss_fct = nn.CrossEntropyLoss()
for i in tqdm(range(nsamples)):
    batch = testenc[:, (i * input_len) : ((i + 1) * input_len)].to(model.device)
    outputs = model.model(batch)
    hidden_states = outputs[0]
    logits = model.lm_head(hidden_states)
    shift_logits = logits[:, :-1, :]
    shift_labels = batch[:, 1:].to(model.lm_head.weight.device)
    loss = loss_fct(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
    )
    neg_log_likelihood = loss.float() * input_len
    nlls.append(neg_log_likelihood)

ppl = torch.exp(torch.stack(nlls).sum() / (nsamples * input_len)).item()

Specifically, I use mistral-7b-v0.1, tried seqlen=8000 as well as seqlen=8192, both slightly lower than the results in the paper, which gave us a bit of trouble.

I would like to ask will you release the code of measuring ppl?

The value of self.include_sparse being 0 causes the assert (False) error

Excuse me, when executing cache-llama-activations.py in the deployment directory to generate activations.pickle, an assert (False) error is raised in the QuantK class's parallel_pack function in deployment/transformers/src/transformers/models/llama/modeling_llama.py file, with self.include_sparse being set to 0, as shown in the image. It seems that there is an issue with the workflow.

The quantizers.pickle file has been successfully generated.Should the instructions in the README file be adjusted in order to generate activations.pickle successfully?
bug

reproduce the ablation results in Figure 1

Thanks for your great works!
I want to reproduce the ablation results presented in paper Figure 1. According to Figure 1, Per-Channel Key Quantization + Pre-RoPE Key Quantization yields PPL=6.34 in Llama-7b (3bit) setting. However, I got PPL=6.71 by running the following command:
CUDA_VISIBLE_DEVICES=0 python llama_simquant.py <path-to-llama-7b-hf> --abits 4 --nsamples 16 --seqlen 2048 --quantize --quantizer_path quantizers.pickle ;
can't figure out why. Would you please give me some advice? Thank you so much!

AttributeError: 'LlamaModel' object has no attribute 'split_gpus'

when I try
CUDA_VISIBLE_DEVICES=0 python llama_simquant.py --abits 4 --nsamples 16 --seqlen 2048 --nuq --fisher --quantize --include_sparse --sparsity-threshold 0.99 --quantizer_path quantizers.pickle ;

get this error
AttributeError: 'LlamaModel' object has no attribute 'split_gpus'

what is the problem

Where is the code of "ATOM-4bit"in the KVQuant codebase?

Thank you for your great work!

Now I want to reproduce the Perlexity of LLaMA-7B on Wikitext-2 with the method of "ATOM-4bit", but I can not find the code in KVQuant.
Should I clone the repo of Atom and reproduce the Perlexity on it?
Waiting for your reply, Thanks.

PRE-ROPE quantization during inference

Thanks for the great work! I am curious about the time complexity of the pre-rope quantization.

In detail, I assume the operations act as the following orders with pre-rope quant during inference: qkv_projection_matmul -> quantize_k -> write_cache_k -> load_cache_k -> dequantize_k -> rope_k -> transpose_k. However, in the decode phase, the sequence length is getting longer per step, making it necessary to apply rope_k on all the previous token features for each step. This is an O(m*m) time complexity where m is sequence_length.

This differs with post-rope case, because for post one, what in cache is post-rope quantized key. Time complexity is O(m).

One way to walk around is saving the rope result to another cache, making the time complexity O(m) but it costs much more storage space. Another way I suppose is to over-write the cache with post rope key (bfloat16/float16) but it will be conflict with the default cache dtype (INT4/INT2).

Please correct me if anything wrong above. And looking forward to your reply. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.