Summary Last year, we released <a href="https://github.com/pytorch

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[RFC] Plans for torchao about ao HOT 21 OPEN

supriyar commented on July 18, 2024 22

[RFC] Plans for torchao

from ao.

Comments (21)

mobicham commented on July 18, 2024 6

How about HQQ ?

No calibration needed
Supports 8,4,3,2,1 bits
Very fast quantization, instead of waiting for hours with GPTQ/AWQ
The quality is on-par or better than GPTQ/AWQ, especially at low bits
bit-unpacking and dequantization CUDA kernels available for all bits
Supports backprop for QLoRA training
Works with FSDP for distributed QLoRA training

from ao.

jph00 commented on July 18, 2024 6

It would be nice to ensure that FSDP works out-of-the-box with quantized models -- we've explained the steps needed to make this work in these articles (and have provided a demonstration script, along with the needed to modifications to HQQ and bitsandbytes):

from ao.

mobicham commented on July 18, 2024 3

I think one important thing to consider is having a standardized Pytorch way of bitpacking/unpacking that all quantization methods can use. Currently, each quantization technique would implement its own bitpacking logic, which makes the CUDA kernels incompatible between different (linear quant) methods.

from ao.

supriyar commented on July 18, 2024 2

How about HQQ ?

No calibration needed

Supports 8,4,3,2,1 bits

Very fast quantization, instead of waiting for hours with GPTQ/AWQ

The quality is on-par or better than GPTQ/AWQ, especially at low bits

bit-unpacking and dequantization CUDA kernels available for all bits

Supports backprop for QLoRA training

Works with FSDP for distributed QLoRA training

@mobicham HQQ looks great and it would be great to add a PyTorch implementation of this to torchao. Supporting 3, 2, 1 bits is pretty neat and the QLoRA support is useful for us too (we have support for NF4 tensor).

We're currently in the process of adding GPTQ to enable running quantized models on GPU, CPU and executorch (on-device).

Would you be interested in contributing? We have a lightweight API recommendation for new techniques like - https://github.com/pytorch-labs/ao/blob/main/torchao/quantization/quant_api.py#L50. We can also add the kernels into torchao so users can take full advantage of the e2e inference speedups.

from ao.

jph00 commented on July 18, 2024 2

I'd be interested in hearing more details on how these changes will be implemented, and how extensible things will be. Will quantization support be added to triton? And torch.compile will be able to convert the python API to triton?

Will there be some fairly low level python API that we could pass all the needed quant state too, so that new algorithms could be largely implemented in pure python?

from ao.

mobicham commented on July 18, 2024 2

Would you be interested in contributing? We have a lightweight API recommendation for new techniques like - https://github.com/pytorch-labs/ao/blob/main/torchao/quantization/quant_api.py#L50. We can also add the kernels into torchao so users can take full advantage of the e2e inference speedups.

Great thanks, I will take a look at it @supriyar !

from ao.

supriyar commented on July 18, 2024 2

Hi @jph00, thanks for your interest in the RFC! Let me try to answer your questions here

Will quantization support be added to triton?

triton already supports 8-bits, and can also be used to do 4-bits too. We gave this a shot last year but saw significantly worse perf. CUTLASS might be a better option here (torch.compile should be able to pick the right kernels for us).

And torch.compile will be able to convert the python API to triton?

Yes, here is an example that @cpuhrsch just added https://github.com/pytorch-labs/ao/pull/60/files. The part where we do the quantization and call torch.compile is here

Will there be some fairly low level python API that we could pass all the needed quant state too, so that new algorithms could be largely implemented in pure python?

yes that is the goal with the proposed API, users can implement their quantize function in python and when they torch.compile it, it should be able to automatically generate triton code for it as long as we have the underlying dtype support in core (like int8, int4). We'll be working on releasing more examples on how to add more custom dtypes, what is the basic interface to implement and how to get it to work with FSDP and torch.compile.

cc @cpuhrsch @msaroufim in case you'd like to add anything else to this.

from ao.

mobicham commented on July 18, 2024 2

triton already supports 8-bits, and can also be used to do 4-bits too. We gave this a shot last year but saw significantly worse perf. CUTLASS might be a better option here (torch.compile should be able to pick the right kernels for us).

@supriyar
Initializing an empty tensor first like this is faster than using torch.cat to bit-unpack tensors. torch.compile still struggles with int32 bitpacking, mainly used for 3-bit. We have a barchart that compares inference speed of a quantized Llama2-7B model usingtorch.compile for the dequantization step vs. using CUDA kernels for reference: https://github.com/mobiusml/hqq/tree/master?tab=readme-ov-file#backend

from ao.

msaroufim commented on July 18, 2024 2

Hi @mklasby (we met last year at NeurIPS) so eventually once things get proven out in torchao they would get upstreamed to torch.ao so the goal is to have this be a standalone repo to have a higher development velocity

from ao.

msaroufim commented on July 18, 2024 2

@mobicham are you on CUDA MODE? If not is there an email you could share? Mine is [email protected] we're quite excited to see an HQQ contribution so wanted to see where your head is at and how we could collaborate together

from ao.

mklasby commented on July 18, 2024 1

Thanks @msaroufim! Looking forward to contributing to the effort!

from ao.

ngc92 commented on July 18, 2024 1

one more thing to add to the list might be data layouts. For example, for unstructured weight sparsity, storing activations as [batch, ..., features] is much less efficient than [..., batch] on GPUs, because the latter allows for coalesced access patterns. E.g, the Sputnik paper just spares this a single comment

To enable coalesced memory accesses into all input and
output matrices, we store dense matrices in row-major layout
and sparse matrices in CSR format

But this actually means that you might need to add many transpose operations if the sparse matmuls are interspersed with standard, non-pointwise operations.
Similarly, I think, e.g., mixing convolution and attention layers currently would be quite problematic, because one uses [batch, feature, seq] and the other [batch, seq, feature] storage order. And of course, there is the old NCHW vs NHWC channels first/last issue.

from ao.

mklasby commented on July 18, 2024 1

@cpuhrsch Yes, that is essentially what I am envisioning. The functional pruners would be essentially a sophisticated topk function to score parameters based on the specific pruning algorithm and return the updated mask. Any state / buffers required to score the params can be passed from caller to the pruners.

I note that jaxpruner and the cerebras pruning library wrap or subclass the optimizer, respectively, for dynamic sparse training. This is a potential route to consider as well for the modules that track state if we feel that having an additional sparsifier object is less than ideal.

from ao.

mobicham commented on July 18, 2024 1

@msaroufim sure would to do that! I will send you an email for a follow-up!

from ao.

mklasby commented on July 18, 2024

What is the relationship between this project and torch.ao ? Is torchao a seperate project / development repo?

Happy to have found torchao in any case, lots of goodies...

from ao.

mklasby commented on July 18, 2024

Is there any interest in achieving better alignment of ao.pruning with torch.nn.utils.prune functionalities?
For example, torch.nn.utils.prune reparameterizes modules with <param_name>_orig and sets the original param name to the pruned tensor. In contrast, ao.pruning reparameterizes modules with parametrizations.<param_name>.original.

My assumption is that the sparsifier is intended for dynamic mask updates, which is somewhat hacky to perform on top of torch.nn.utils.prune functions currently. However, I think aligning these modules, where possible, will lead to a more compelling and intuitive user experience.

from ao.

cpuhrsch commented on July 18, 2024

@mklasby - I think set of functional pruners (similar to torch.nn.functional) would be fairly universal. Then we can build modules that track state needed for incremental pruning or such. Do you think that'd fit the requirements?

from ao.

zhexinli commented on July 18, 2024

Hi, your work is fantastic. Do you have plan to support static quantization? That is to say, not computing amax of activation when running inference, but stead using calibration to pre-compute the quantization scale to reduce dynamic scale overhead?
And do you have plan to support more op like conv2d?
Thanks!

from ao.

cpuhrsch commented on July 18, 2024

Hi @zhexinli - Yes, we want to create a design that can separate calibration from quantization and that should include this as well. We can also add support for conv2d. We have limited support for 1-by-1 convolutions by swapping them for linears.

from ao.

supriyar commented on July 18, 2024

Hi, your work is fantastic. Do you have plan to support static quantization? That is to say, not computing amax of activation when running inference, but stead using calibration to pre-compute the quantization scale to reduce dynamic scale overhead? And do you have plan to support more op like conv2d? Thanks!

Hi @zhexinli are you looking for static quant of specific ops like conv/linear or general graph based quantization support? And on what backends?

In addition to what @cpuhrsch said, we have a PT2 export based quantization flow in PyTorch that's based on full-graph capture that you can use to run models on x86 CPU and edge runtimes (https://pytorch.org/tutorials/prototype/pt2e_quant_ptq_x86_inductor.html)

from ao.

zhexinli commented on July 18, 2024

Hi, your work is fantastic. Do you have plan to support static quantization? That is to say, not computing amax of activation when running inference, but stead using calibration to pre-compute the quantization scale to reduce dynamic scale overhead? And do you have plan to support more op like conv2d? Thanks!

Hi @zhexinli are you looking for static quant of specific ops like conv/linear or general graph based quantization support? And on what backends?

In addition to what @cpuhrsch said, we have a PT2 export based quantization flow in PyTorch that's based on full-graph capture that you can use to run models on x86 CPU and edge runtimes (https://pytorch.org/tutorials/prototype/pt2e_quant_ptq_x86_inductor.html)

hi, thanks for your introduction, currently I'm looking for a cuda backend for static quantization.

from ao.

[RFC] Plans for torchao about ao HOT 21 OPEN

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent