Hey, thanks for your work. I saw <a href="https://huggingface.co/ISTA-DASLab/Meta-Llam

For anyone curious, huggingface has done it: <a href="https://huggingface.co/blog/kv-c

KV Cache Quantization about aqlm HOT 4 CLOSED

vahe1994 commented on August 21, 2024

KV Cache Quantization

from aqlm.

Comments (4)

justheuristic commented on August 21, 2024 2

Hi! I am not an author, but a contributor, and I have some familiarity with the issue.

As you correctly describe, the AQLM does not do cache quantization, relying on standard transformers code.
If you plug in any data-free cache quantization, e.g. 8-bit KV compression, it is likely that the impact will be the same as when using KV quantization in any other model.
As for 4-bit variant - I did not try that specific one, but according to your description and code, it should be easy to implement with AQLM.

One way you can do this is by extending transformers Cache class roughly as follows:

at __init__, create a storage of quantized KVs, similarly to StaticCache
during update, de-quantize cache items, return de-quantized cache to the user
as a side-effect of update, quantize user's key_states and value_states and write them to your KV storage

This should give you the expected memory saving since only one attention layer is de-quantized at a time.
As for speed-ups, this is unlikely to work any faster than 16bit attention. If you want speedups, you will need to use custom kernels for attention that accept KV inputs in 4 or 8 bit precision.

Since the neurips deadline is soon, it is unlikely that paper authors will be able to write this anytime soon. However, if you try this and share your observations, we'd be glad to take a look. In turn, if you have any issues with AQLM while doing so, please tell us.

from aqlm.

github-actions commented on August 21, 2024

This issue is stale because it has been open for 30 days with no activity.

from aqlm.

Interpause commented on August 21, 2024

For anyone curious, huggingface has done it: https://huggingface.co/blog/kv-cache-quantization

from aqlm.

Recommend Projects

KV Cache Quantization about aqlm HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent