Comments (21)
How about HQQ ?
- No calibration needed
- Supports 8,4,3,2,1 bits
- Very fast quantization, instead of waiting for hours with GPTQ/AWQ
- The quality is on-par or better than GPTQ/AWQ, especially at low bits
- bit-unpacking and dequantization CUDA kernels available for all bits
- Supports backprop for QLoRA training
- Works with FSDP for distributed QLoRA training
from ao.
It would be nice to ensure that FSDP works out-of-the-box with quantized models -- we've explained the steps needed to make this work in these articles (and have provided a demonstration script, along with the needed to modifications to HQQ and bitsandbytes):
- https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html
- https://www.answer.ai/posts/2024-03-14-fsdp-qlora-deep-dive.html
from ao.
I think one important thing to consider is having a standardized Pytorch way of bitpacking/unpacking that all quantization methods can use. Currently, each quantization technique would implement its own bitpacking logic, which makes the CUDA kernels incompatible between different (linear quant) methods.
from ao.
How about HQQ ?
- No calibration needed
- Supports 8,4,3,2,1 bits
- Very fast quantization, instead of waiting for hours with GPTQ/AWQ
- The quality is on-par or better than GPTQ/AWQ, especially at low bits
- bit-unpacking and dequantization CUDA kernels available for all bits
- Supports backprop for QLoRA training
- Works with FSDP for distributed QLoRA training
@mobicham HQQ looks great and it would be great to add a PyTorch implementation of this to torchao. Supporting 3, 2, 1 bits is pretty neat and the QLoRA support is useful for us too (we have support for NF4 tensor).
We're currently in the process of adding GPTQ to enable running quantized models on GPU, CPU and executorch (on-device).
Would you be interested in contributing? We have a lightweight API recommendation for new techniques like - https://github.com/pytorch-labs/ao/blob/main/torchao/quantization/quant_api.py#L50. We can also add the kernels into torchao so users can take full advantage of the e2e inference speedups.
from ao.
I'd be interested in hearing more details on how these changes will be implemented, and how extensible things will be. Will quantization support be added to triton? And torch.compile
will be able to convert the python API to triton?
Will there be some fairly low level python API that we could pass all the needed quant state too, so that new algorithms could be largely implemented in pure python?
from ao.
Would you be interested in contributing? We have a lightweight API recommendation for new techniques like - https://github.com/pytorch-labs/ao/blob/main/torchao/quantization/quant_api.py#L50. We can also add the kernels into torchao so users can take full advantage of the e2e inference speedups.
Great thanks, I will take a look at it @supriyar !
from ao.
Hi @jph00, thanks for your interest in the RFC! Let me try to answer your questions here
Will quantization support be added to triton?
triton already supports 8-bits, and can also be used to do 4-bits too. We gave this a shot last year but saw significantly worse perf. CUTLASS might be a better option here (torch.compile should be able to pick the right kernels for us).
And torch.compile will be able to convert the python API to triton?
Yes, here is an example that @cpuhrsch just added https://github.com/pytorch-labs/ao/pull/60/files. The part where we do the quantization and call torch.compile is here
Will there be some fairly low level python API that we could pass all the needed quant state too, so that new algorithms could be largely implemented in pure python?
yes that is the goal with the proposed API, users can implement their quantize function in python and when they torch.compile it, it should be able to automatically generate triton code for it as long as we have the underlying dtype support in core (like int8, int4). We'll be working on releasing more examples on how to add more custom dtypes, what is the basic interface to implement and how to get it to work with FSDP and torch.compile.
cc @cpuhrsch @msaroufim in case you'd like to add anything else to this.
from ao.
triton already supports 8-bits, and can also be used to do 4-bits too. We gave this a shot last year but saw significantly worse perf. CUTLASS might be a better option here (torch.compile should be able to pick the right kernels for us).
@supriyar
Initializing an empty tensor first like this is faster than using torch.cat
to bit-unpack tensors. torch.compile
still struggles with int32 bitpacking, mainly used for 3-bit. We have a barchart that compares inference speed of a quantized Llama2-7B model usingtorch.compile
for the dequantization step vs. using CUDA kernels for reference: https://github.com/mobiusml/hqq/tree/master?tab=readme-ov-file#backend
from ao.
Hi @mklasby (we met last year at NeurIPS) so eventually once things get proven out in torchao
they would get upstreamed to torch.ao
so the goal is to have this be a standalone repo to have a higher development velocity
from ao.
@mobicham are you on CUDA MODE? If not is there an email you could share? Mine is [email protected] we're quite excited to see an HQQ contribution so wanted to see where your head is at and how we could collaborate together
from ao.
Thanks @msaroufim! Looking forward to contributing to the effort!
from ao.
one more thing to add to the list might be data layouts. For example, for unstructured weight sparsity, storing activations as [batch, ..., features]
is much less efficient than [..., batch]
on GPUs, because the latter allows for coalesced access patterns. E.g, the Sputnik paper just spares this a single comment
To enable coalesced memory accesses into all input and
output matrices, we store dense matrices in row-major layout
and sparse matrices in CSR format
But this actually means that you might need to add many transpose operations if the sparse matmuls are interspersed with standard, non-pointwise operations.
Similarly, I think, e.g., mixing convolution and attention layers currently would be quite problematic, because one uses [batch, feature, seq]
and the other [batch, seq, feature]
storage order. And of course, there is the old NCHW vs NHWC channels first/last issue.
from ao.
@cpuhrsch Yes, that is essentially what I am envisioning. The functional pruners would be essentially a sophisticated topk function to score parameters based on the specific pruning algorithm and return the updated mask. Any state / buffers required to score the params can be passed from caller to the pruners.
I note that jaxpruner
and the cerebras pruning library wrap or subclass the optimizer, respectively, for dynamic sparse training. This is a potential route to consider as well for the modules that track state if we feel that having an additional sparsifier object is less than ideal.
from ao.
@msaroufim sure would to do that! I will send you an email for a follow-up!
from ao.
What is the relationship between this project and torch.ao
? Is torchao a seperate project / development repo?
Happy to have found torchao in any case, lots of goodies...
from ao.
Is there any interest in achieving better alignment of ao.pruning
with torch.nn.utils.prune
functionalities?
For example, torch.nn.utils.prune
reparameterizes modules with <param_name>_orig
and sets the original param name to the pruned tensor. In contrast, ao.pruning
reparameterizes modules with parametrizations.<param_name>.original
.
My assumption is that the sparsifier
is intended for dynamic mask updates, which is somewhat hacky to perform on top of torch.nn.utils.prune
functions currently. However, I think aligning these modules, where possible, will lead to a more compelling and intuitive user experience.
from ao.
@mklasby - I think set of functional pruners (similar to torch.nn.functional) would be fairly universal. Then we can build modules that track state needed for incremental pruning or such. Do you think that'd fit the requirements?
from ao.
Hi, your work is fantastic. Do you have plan to support static quantization? That is to say, not computing amax of activation when running inference, but stead using calibration to pre-compute the quantization scale to reduce dynamic scale overhead?
And do you have plan to support more op like conv2d?
Thanks!
from ao.
Hi @zhexinli - Yes, we want to create a design that can separate calibration from quantization and that should include this as well. We can also add support for conv2d. We have limited support for 1-by-1 convolutions by swapping them for linears.
from ao.
Hi, your work is fantastic. Do you have plan to support static quantization? That is to say, not computing amax of activation when running inference, but stead using calibration to pre-compute the quantization scale to reduce dynamic scale overhead? And do you have plan to support more op like conv2d? Thanks!
Hi @zhexinli are you looking for static quant of specific ops like conv/linear or general graph based quantization support? And on what backends?
In addition to what @cpuhrsch said, we have a PT2 export based quantization flow in PyTorch that's based on full-graph capture that you can use to run models on x86 CPU and edge runtimes (https://pytorch.org/tutorials/prototype/pt2e_quant_ptq_x86_inductor.html)
from ao.
Hi, your work is fantastic. Do you have plan to support static quantization? That is to say, not computing amax of activation when running inference, but stead using calibration to pre-compute the quantization scale to reduce dynamic scale overhead? And do you have plan to support more op like conv2d? Thanks!
Hi @zhexinli are you looking for static quant of specific ops like conv/linear or general graph based quantization support? And on what backends?
In addition to what @cpuhrsch said, we have a PT2 export based quantization flow in PyTorch that's based on full-graph capture that you can use to run models on x86 CPU and edge runtimes (https://pytorch.org/tutorials/prototype/pt2e_quant_ptq_x86_inductor.html)
hi, thanks for your introduction, currently I'm looking for a cuda backend for static quantization.
from ao.
Related Issues (20)
- `torchao/_models/llama/eval.py` does not work with latest `lm_eval` HOT 4
- Nightly job keeps getting canceled
- Remove docstring duplications
- Enable docstring and README code testing
- The next tutorials HOT 5
- MX tests failing HOT 2
- Understanding 8da4w HOT 2
- NotImplementedError: Bitnet dispatch: attempting to run aten.to.dtype_layout, this is not supported
- Error message missing format string specifier
- what if below condition? about OCP Microscaling HOT 4
- [RFC] Intx Tensor Subclasses Quantization HOT 2
- What should `.dtype` for tensor subclass return? HOT 13
- HF checkpoint integration story HOT 5
- Any palns for surpporting more conv kernel? HOT 11
- [Tracker] WIP features for torchao 0.4
- Shard tests over multiple machines
- 7.16 nightly binaries are broken
- Fix skipped tests due to int4 weight packing op changes
- unwrap_tensor_subclass and nested tensor subclasses issue HOT 1
- Paged Low Bit Optimizers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ao.