pytorch-labs / ao Goto Github PK

View Code? Open in Web Editor NEW

420.0 26.0 53.0 6.81 MB

Custom data types and layouts for training and inference

License: BSD 3-Clause "New" or "Revised" License

Python 95.05% Shell 0.50% Cuda 4.17% C++ 0.28%

brrr dtypes inference layouts mx pytorch quantization sparsity training

ao's Issues

FloatQuantization subclass

As I was reviewing #223

I was reminded of this PR #214

And I'd be curious what range of floating point numbers we can just express using sublcasses

Project implicitly depends on torch nightly

Traceback (most recent call last):
  File "C:\code\foo\scripts\quantize.py", line 4, in <module>
    from torchao.quantization.smoothquant import (
  File "C:\code\py-envs\foo\lib\site-packages\torchao\quantization\__init__.py", line 7, in <module>
    from .smoothquant import *  # noqa: F403
  File "C:\code\py-envs\foo\lib\site-packages\torchao\quantization\smoothquant.py", line 17, in <module>
    import torchao.quantization.quant_api as quant_api
  File "C:\code\py-envs\foo\lib\site-packages\torchao\quantization\quant_api.py", line 18, in <module>
    from .subclass import (
  File "C:\code\py-envs\foo\lib\site-packages\torchao\quantization\subclass.py", line 13, in <module>
    from torch.utils._python_dispatch import return_and_correct_aliasing
ImportError: cannot import name 'return_and_correct_aliasing' from 'torch.utils._python_dispatch' (C:\code\py-envs\foo\lib\site-packages\torch\utils\_python_dispatch.py)

This project seems to rely on torch nightly, which exports return_and_correct_aliasing. It might be worthwhile to document this. I suppose one could argue it's obvious enough from this being an experimental repo, but it was surprising to me.

Nice work team, I'm looking forward to using this package.

torchao init: ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

https://github.com/pytorch/torchchat/actions/runs/9166937828/job/25203278945?pr=842

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

******** ET: a8w4dq INT4 group-wise quantized *******

Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
PyTorch version 2.4.0.dev20240507+cpu available.
Using device=cpu

Loading model...
Time to load model: 0.02 seconds
Quantizing the model with: {'linear:a8w4dq': {'groupsize': 32}}
Downloading builder script: 0%| | 0.00/5.67k [00:00<?, ?B/s]
Downloading builder script: 100%|██████████| 5.67k/5.67k [00:00<00:00, 27.6MB/s]
Traceback (most recent call last):
Time to quantize model: 3.01 seconds
File "/home/runner/work/torchchat/torchchat/export.py", line 119, in
main(args)
File "/home/runner/work/torchchat/torchchat/export.py", line 70, in main
model = _initialize_model(
File "/home/runner/work/torchchat/torchchat/build/builder.py", line 433, in initialize_model
quantize_model(model, builder_args.device, quantize, tokenizer)
File "/home/runner/work/torchchat/torchchat/quantize.py", line 58, in quantize_model
).quantized_model()
File "/home/runner/work/torchchat/torchchat/quantize.py", line 679, in quantized_model
return self.quantize(self.model)
File "/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/runner/work/torchchat/torchchat/quantize.py", line 629, in quantize
from torchao.quantization.quant_primitives import (
File "/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/torchao/init.py", line 14, in
from . import _C
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
Error: Process completed with exit code 1.

Semi-Structured Sparsity unsupported for Windows

Running
py pytorch/benchmarks/sparse/benchmark_semi_structured_sparsity.py --mode nvidia-fixed-k --dtype bf16 --backend cutlass
(from #174)

results in
RuntimeError: _sparse_semi_structured_linear: CUTLASS not supported

@jcaip believes it's an issue with Windows, and the best work around would be to dual-boot Linux (which I'll try today!)

Full output:

PS C:\Users\phili\dev> 
Started benchmark: nvidia-fixed-k | dtype: bf16
  0%|                                                                                                             | 0/18 [00:00<?, ?it/s]C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\sparse\semi_structured.py:111: UserWarning: The PyTorch API of SparseSemiStructuredTensor is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.sparse module for further information about the project.
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\phili\dev\pytorch\benchmarks\sparse\benchmark_semi_structured_sparsity.py", line 247, in <module>
    df = pd.DataFrame.from_records(results)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\frame.py", line 2450, in from_records
    first_row = next(data)
                ^^^^^^^^^^
  File "C:\Users\phili\dev\pytorch\benchmarks\sparse\benchmark_semi_structured_sparsity.py", line 220, in <genexpr>
    eval_fn(mn, 10240, mn, dtype, args.contiguous, args.backend)
  File "C:\Users\phili\dev\pytorch\benchmarks\sparse\benchmark_semi_structured_sparsity.py", line 123, in test_tensor
    sparse_output = torch.mm(sA, B)
                    ^^^^^^^^^^^^^^^
  File "C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\sparse\semi_structured.py", line 199, in __torch_dispatch__
    return cls.SPARSE_DISPATCH[func._overloadpacket](func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\sparse\_semi_structured_ops.py", line 115, in semi_sparse_mm
    res = A._mm(B_padded)
          ^^^^^^^^^^^^^^^
  File "C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\sparse\semi_structured.py", line 439, in _mm
    res = torch._sparse_semi_structured_linear(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: _sparse_semi_structured_linear: CUTLASS not supported
  0%|                                                                                                             | 0/18 [00:07<?, ?it/s]

[Tracker] WIP features for torchao 0.3

Focus - benchmarking, documentation, tutorials, prototype to beta

Due date: June 13 2024

Spillover from 0.2.0

Consolidating workflows to use tensor subclass @jerryzh168
Fast sparse training @jcaip
Explore adding HQQ 4/3/2-bit quant to torchao @HDCharles @mobicham
Don't cause a "import error" when someone is using a feature unsupported by e.g. torch 2.1.2 (e.g. https://github.com/pytorch-labs/ao/blob/046dc985de6d5eac05c6575cc71505687e3aadf1/torchao/quantization/quant_primitives.py#L42 will cause an import error if someone tries to use torchao.quantization.quant_primitives.per_token_dynamic_quant on 2.2.2

Benchmarking

Setup model level benchmark for accuracy and performance in torchbench for single quantization API, so that we can start deprecating quant primitives and quant apis after making sure no regressions (@HDCharles)
Benchmarks for auto quant on pytorch benchmark's inference quant pane (@HDCharles)

Documentation

#181
Make our website sparkle https://pytorch.org/ao (everyone)
Unify on README or website, but don't duplicate content
Get started on a design to slot in the prototype features - HQQ (@msaroufim)

Tutorials

Tutorial for affine quantization dtype and unified quant primitives - Found lots of subtle differences, especially w.r.t. preserving zeros and tinygemm (@jerryzh168)

Core

QAT workflow (@andrewor14)
dedup the implementations of quant primitives (@jerryzh168)
dedup the implementations of quant APIs (@jerryzh168)
Deduplicate int4 workflows
Factory function ahd implements decorator for affine quantization dtype
Bit packing interfaces @msaroufim
float6 kernels @gau-nernst
int 3/5 kernel @msaroufim

Increased memory usage for int8/int4 weight only quantization compared to gpt-fast

if you compare peak memory usage for gpt-fast and torchao's quantization APIs, the torchAO APIs have extremely high peak memory usage (though perf is the same)

with the new benchmark code we can see https://github.com/pytorch/ao/blob/main/torchao/_models/llama/benchmark_results.txt

Lines 5,7 and 6 correspond to no, int8wo, and int4wo quant on llama2-7b which show a peak_mems of 13.88, 14.50 and 15.92 GB respectfully (perf: 105.02, 147.03, 199.81 tok/s)

meanwhile for gpt-fast if i run

export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --compile
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 --groupsize 64
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4.g64.pth --compile
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --compile

we see for no, int8wo and int4wo quant on llama2-7b peak mem: 13.88, 7.74, 4.48 GB (perf: 105.09, 150.58, 204 tok/s)

@cpuhrsch @jerryzh168 @msaroufim @supriyar

Improvement ideas for `hf_eval.py`

When I tried out scripts/hf_eval.py, I faced some problems (with 16GB VRAM GPU):

OOM when loading original BF16 Llama3-8B model.
- Workaround: load to CPU first, do quantization on CPU, and transfer to CUDA -> perhaps we should add --quantize_on_cpu flag (or something similar)? Quantize on CPU is slower than CUDA, so we probably don't want quantize on CPU being the default.
- For typically home PCs with 16GB RAM, even loading the original model to CPU RAM is not so ideal. The best is if we can do (1) instantiate the model with "meta" device (2) quantize the "meta" model (so that the weight is in quantized shape) (3) memory-map state dict -> no RAM usage (4) load each weight from memory-map state dict to CPU/GPU and quantize one at a time. This probably cannot be implemented cleanly since we don't have access to internal of AutoModelForCausalLM.from_pretrained().
- This problem should be applicable for using a large model with torchao in general too, not just for hf_eval.py
We might want to expose more flags to HFLM, for example batch_size, max_length.
- batch_size can help speed up inference for some tasks, like hellaswag
- On my GPU, I got OOM with Llama3-8B even with int4wo when evaluating on wikitext, probably because (1) Llama3 has huge vocab size and (2) wikitext has very long text -> use max context length for Llama3-8B. When evaluating Llama2-7B w/ int4wo on wikitext, I didn't face OOM. -> useful to limit max_length
Other misc stuff: change --task_list to --tasks to align with lm_eval CLI; --device flag is currently not used (but I don't think anyone will do LLM eval on CPU anyway?)

torchao.utils.benchmark_model support cpu and mps benchmarking

ao/torchao/utils.py

Line 5 in 42c2376

def benchmark_model(model, num_runs, input_tensor):

currently only supports cuda, we need to extend this to support cpu and mps as well

FP6 dtype!

🚀 The feature, motivation and pitch

https://arxiv.org/abs/2401.14112

I think you guys are really going to like this.
The deepspeed developers introduce FP6 datatype on cards without fp8 support, while maintaining full tensor core suppourt using a kernel they created called tc-fpX. Tests were done on a a100! And they achieved 1.69x-2.65x inference performance! And I assume this can be transferred over to training (with the exception of possibly the KV cache, and embedding module). This is really exiting, this will breathe new life into the rapidly aging a100 due to the introduction of the h100’s fp8.

It was merged into deepspeed in this commit:
microsoft/DeepSpeed@ccfdb84

Getting this pushed into the Pytorch as a dtype, that would be a major win. These are the benefits FP6 provides:

Alternatives

These kernels shouldn’t be limited by only the a100, they theoretically could work on any card with uint8_t and fp16 support. Provided these kernels were only written for a100 so without modification it might only work on ampere cards.

Additional context

The tc-FPx kernel essentially takes 4fp16 values, quantizes them to fp6 with some place holders. Then they get pushed into an array built of 3x Uint8_t. As shown here:

cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel

[RFC] Plans for torchao

Summary

Last year, we released pytorch-labs/torchao to provide acceleration of Generative AI models using native PyTorch techniques. Torchao added support for running quantization on GPUs, including int8 dynamic quantization (W8A8) and weight-only quantization (int8 and int4) that were composable with torch.compile. Combined, the APIs launched in torchao were able to power SOTA generative AI models across multiple modalities: Segment Anything, Stable Diffusion, and LLaMa.
The results were showcased in these blog posts -
https://pytorch.org/blog/accelerating-generative-ai/,
https://pytorch.org/blog/accelerating-generative-ai-2/,
https://pytorch.org/blog/accelerating-generative-ai-3/

Our investment in torchao is to accelerate Generative AI, using native PyTorch features, ensuring composability with torch.compile.

In 2024, we plan to adopt the following strategy for development of torchao

We will launch torchao with the most important quantization techniques for LLMs and other GenAI models via a simple UX. Examples - GPTQ, AWQ, int8 dynamic quant.
We will stay on top of SOTA kernels within these spaces through to PTC and commit cpu/gpu kernels ourselves as necessary. Torchao will host a limited set of performant kernels for server (cpu/gpu) and executorch, with a clear recommendation on how to integrate and run inference on these backends.
Torchao will host non-standard dtypes, implemented via tensor subclasses. Examples - nf4, any4, mx4
Following the PyTorch design principle, the offerings of torchao will be usable and simple, including setup, dependencies, API surfaces.
We actively engage with the community/researchers to contribute new quantization techniques in native PyTorch code and developers to author performant kernels for these into torchao for different backends. An example would be to upstream the kernels built by the CUDA_MODE community into torchao.
As the code gets more mature/based on community demand - we will upstream techniques/kernels into PyTorch Core.

Let’s dive deeper into some of the coverage areas mentioned above.

Emerging dtypes

Dtypes like NF4, MX4, groupwise quantized int4 are used for implementing various optimization techniques in the models. Last year, we posted a plan on how we wish to support these dtypes in PyTorch. In torchao, we will host tensor subclass based implementation of dtypes, existing examples include uint4 and NF4 that users can use for their own quantization techniques or override the implementation to support other dtypes that might be useful.
Moreover, users don’t need to write triton or cuda kernels for their custom dtypes. The implementation can be in python and torch.compile will take care of generating performant kernels under the hood.

Quantization techniques

Quantization can be done on only weights or weights+activations. Typically LLM quantization techniques for BS 1 (memory BW bound) use weight-only quantization techniques. But for larger batch sizes, or longer context length cases or for general throughput bound models quantizing the activations is also beneficial. Quantization, however, impacts the model accuracy and researchers have published techniques to mitigate this accuracy impact which currently exist externally as one repository per technique.

In torchao, we will plan to support the following class of techniques using PyTorch, made available via a simple UX and following the one-file-per-technique principle.

LLM weight only quantization techniques

Post training quantization
The two most popular techniques externally are GTPQ and AWQ, available via AutoGPTQ and AutoAWQ which include the technique as well as the performant kernels for faster quantization ops.
To that end, we will start by re-implementing the GPTQ and AWQ techniques into torchao using PyTorch via a simple/intuitive UX that supports saving/loading of quantized models, while realizing the memory savings on disk. Some open questions we need to address here include -
How much VRAM will be required for different quantization techniques
How do we convert to-from weights quantized for different backends (cpu and gpu today use different weight packing format)

In the future, as more interesting and cutting edge techniques are introduced, researchers can directly implement them in torchao or our team can re-implement them in PyTorch.

Weight and activation quantization techniques

Post training quantization
We’ve already implemented W8A8 quantization via the int_mm kernel in core. This has shown speedup on models like SAM, SDXL without any impact to model accuracy and can be turned on via a simple one-line UX implemented via module swap or tensor subclass.

However the challenge here is that some smaller layer shapes might not benefit from quantization due to the overhead in quantizing and dequantizing the activation tensors. Users can either statically ignore quantizing these layers or have a higher level API that figures out which layers are sensitive to quantization. We plan to provide a higher level API via the auto quantizer that applies this technique to the layers that stand to benefit the most to provide the benefits of quantization without having to worry too much about the configs to use.

Quantization aware training
Techniques here require access to fine-tuning, to tune the model to reduce accuracy impact of quantization. Recently, research like LLM-QAT is promising, showing that we can go down to W4A8 and 4-bit KV cache for LLMs. Moreover, newer lower bit techniques like AQLM, Quip# also include a component of fine-tuning to improve the model accuracy.

We will include the APIs and workflow to enable users to do QAT on LLMs, starting with implementing the LLM-QAT paper in torchao and further extending it to support other dtypes like MX4.

Optimized kernels

Kernels
Optimized kernels are key to making models run faster during inference. Today, in core we already have performant kernels like int_mm and 4-bit weight quantization kernels for cpu (via intel) and gpu (via tinygemm). torchao will host performant kernels that will work with different backends with a guide on how to plug in these kernels into PyTorch models via the custom ops API. These kernels will compose with torch.compile, with the expectation that the user is expected to write a meta kernel implementation for this. For executorch, the expectation is that if the user provides a kernel that works with executorch then it should also work in eager mode.

We will also directly engage with the community, to upstream their performant kernels into torchao.

Autotuner

In order to use any CUDA kernel efficiently, we'll need to pick the right kernel hyperparameters. For an eager mode kernel, the same is true as well. A kernel autotuner will help here. We expect that the auto quantizer along with the kernel autotuner will make int8 dynamic quantization and int8/int4 weight-only quantization more usable and performant. A WIP example of what this might look like can be found here.

Release engineering

Shipping optimized, custom kernels requires extensibility mechanisms and release channels. We have custom operator support that integrates broadly, but our release mechanism might need to be optimized. It can be quite difficult to ship custom binaries across a broad range of operating systems and accelerators.

Conversion to-from popular model formats

We can add a conversion util from popular model storage formats like gguf into PyTorch’s state_dict format. This will enable users to take a pre-existing quantized model from llama.cpp and have it run via PyTorch eager mode for desktop cpu/gpu and executorch for on-device cases. We’ll share more details here soon.

Pruning

In addition to quantization, we’ve seen promising results with sparsity as well on GPUs. We will share more updates on what torchao will host for the space of sparsity/pruning in the near future.

We'd love to hear any feedback or questions from the OSS community on this RFC. Thank you!

cc @msaroufim @cpuhrsch @jerryzh168 @HDCharles @andrewor14 @jcaip @jisaacso

Bitnet 1.58 prework, POC, and staging

Bitnet 1.58 Groundwork

After some talks with Saroufim and the cuda mode team working on bitnet, we've outlined a strategy for implementing bitnet 1.58 method into torch. This issue lays the groundwork for 2-bit trinary tensor quantization and bitnet linear work for Bitnet 1.58

I've set up a staging repo Staging with a number of items:

To the point minimal lib
Training notebook for creating a full model, up to the point where we quantize and pack
Cleaned up minimal training example for running as a script
Example of the compiled kernel

This covers the initial groundwork for getting working trinary networks into torch.

Example Quantization Method
POC layer quantization
Runnable example model with quantized layers (In progress Dtype and Runnable Model)
AO dtype
AO layer type (?) for bitnet linear
Runnable example model with full dtype + bitnet linear layer, shippable

SDPA: scaled_dot_product_attention to use this attetino donot need to retain ?

when use new sdpa,why does it dont need to be trained,now paras is added ?

Building torchao from source installs unnecessary torch and nvidia packages every time

To reproduce

conda create -n test_ao python=3.10
conda activate test_ao
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install git+https://github.com/pytorch/ao -v

Outputs (the packages are already cached in this case)

Using pip 24.0 from /home/ubuntu/miniconda3/envs/test_ao/lib/python3.10/site-packages/pip (python 3.10)
Collecting git+https://github.com/pytorch/ao
  Cloning https://github.com/pytorch/ao to /tmp/pip-req-build-bcuh0mqg
  Running command git version
  git version 2.34.1
  Running command git clone --filter=blob:none https://github.com/pytorch/ao /tmp/pip-req-build-bcuh0mqg
  Cloning into '/tmp/pip-req-build-bcuh0mqg'...
  Running command git rev-parse HEAD
  b91b6be24afd1220331790ff0866f5b091165cd5
  Resolved https://github.com/pytorch/ao to commit b91b6be24afd1220331790ff0866f5b091165cd5
  Running command git rev-parse HEAD
  b91b6be24afd1220331790ff0866f5b091165cd5
  Running command pip subprocess to install build dependencies
  Collecting setuptools
    Using cached setuptools-69.5.1-py3-none-any.whl.metadata (6.2 kB)
  Collecting wheel
    Using cached wheel-0.43.0-py3-none-any.whl.metadata (2.2 kB)
  Collecting ninja
    Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
  Collecting torch
    Using cached torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
  Collecting filelock (from torch)
    Using cached filelock-3.14.0-py3-none-any.whl.metadata (2.8 kB)
  Collecting typing-extensions>=4.8.0 (from torch)
    Using cached typing_extensions-4.11.0-py3-none-any.whl.metadata (3.0 kB)
  Collecting sympy (from torch)
    Using cached sympy-1.12-py3-none-any.whl.metadata (12 kB)
  Collecting networkx (from torch)
    Using cached networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
  Collecting jinja2 (from torch)
    Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
  Collecting fsspec (from torch)
    Using cached fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
  Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
    Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
  Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
    Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
  Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
    Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
  Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
    Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
  Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
    Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
  Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
    Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
  Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
    Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
  Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch)
    Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
  Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch)
    Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
  Collecting nvidia-nccl-cu12==2.20.5 (from torch)
    Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
  Collecting nvidia-nvtx-cu12==12.1.105 (from torch)
    Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)
  Collecting triton==2.3.0 (from torch)
    Using cached triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
  Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch)
    Using cached nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
  Collecting MarkupSafe>=2.0 (from jinja2->torch)
    Using cached MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
  Collecting mpmath>=0.19 (from sympy->torch)
    Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
  Using cached setuptools-69.5.1-py3-none-any.whl (894 kB)
  Using cached wheel-0.43.0-py3-none-any.whl (65 kB)
  Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
  Using cached torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
  Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
  Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
  Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
  Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
  Using cached triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
  Using cached typing_extensions-4.11.0-py3-none-any.whl (34 kB)
  Using cached filelock-3.14.0-py3-none-any.whl (12 kB)
  Using cached fsspec-2024.3.1-py3-none-any.whl (171 kB)
  Using cached jinja2-3.1.4-py3-none-any.whl (133 kB)
  Using cached networkx-3.3-py3-none-any.whl (1.7 MB)
  Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
  Using cached MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
  Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
  Using cached nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
  Installing collected packages: ninja, mpmath, wheel, typing-extensions, sympy, setuptools, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
  Successfully installed MarkupSafe-2.1.5 filelock-3.14.0 fsspec-2024.3.1 jinja2-3.1.4 mpmath-1.3.0 networkx-3.3 ninja-1.11.1.1 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105 setuptools-69.5.1 sympy-1.12 torch-2.3.0 triton-2.3.0 typing-extensions-4.11.0 wheel-0.43.0
  Installing build dependencies ... done

On my machine, it takes 30s just to install the cached packages. Note that this is done every time I re-install torchao. During development, it's quite annoying to have this extra 30s every time I need to re-compile CUDA/C++ code (pip install -e . only works for python code).
If this is the first time installing torchao from source, there will be extra time downloading the packages (which are huge).

The culprit seems to be torch being a build-system.requires in pyproject.toml. Perhaps this is a limitation of pip not being able to recognize torch from conda? During the build process, it's also not clear if it is using existing torch (from conda) or pip-torch (may cause issues if the two versions mismatch? I'm using the latest version so issues may not arise).

Run semi-structured spare benchmarks on consumer hardware

2:4 sparisty is only supported on Ampere+ , we've only run benchmarks with A100s, but Phil (@philipbutler) has access to consumer GPUs that could also take advantage of sparse acceleration as well.

Steps to get numbers:

install pytorch pip nightlies from here
verify that your consumer GPU supports semi-structured sparsity

import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())

Clone pytorch and get benchmark script:
Run benchmarks. For now, let's see if the nvidia-fixed-mn / nvidia-fixed-k benchmarks still show speedups.

python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-k --dtype bfloat16 --backend cutlass
python benchmarks/sparse/benchmark_semi_structured_sparsity.py  --mode nvidia-fixed-mn --dtype bfloat16 --backend cutlass

Afterwards, it would be great to get benchmarks for the ViT-B shapes found here: https://github.com/pytorch/ao/blob/main/benchmarks/sam_vit_b_shapes.csv

Make custom FPx dtype conversion easier to use

Referring this https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats/custom_cast.py

Although it was originally meant for MX dtypes only (FP4 E2M1, FP6 E2M3, FP6 E3M2), expanding its functionality to support any custom FPx dtype would be useful for developing and experimenting with custom FPx kernels.

Case in point, FP6-LLM upstream added support for FP5 E2M2 (https://github.com/usyd-fsalab/fp6_llm). This is what I need to write to support FP32->FP5 E2M2.

# define constants for F32 <-> F5_E2M2
F5_E2M2_MAX = 7.0  # (2 ** (0b11 - 0b01)) * (1 + 0.5 + 0.25)
F5_E2M2_MIN_NORMAL = 1.0  # (2 ** (0b01 - 0b01))
EBITS_F5_E2M2 = 2
MBITS_F5_E2M2 = 2
F5_E2M2_EXP_BIAS = 0b01
F5_E2M2_MAX_INT = (1 << 4) - 1
SIGN_MASK_F5_E2M2 = 1 << 4

MAGIC_ADDER_F5_E2M2 = (1 << (MBITS_F32 - EBITS_F5_E2M2)) - 1

DENORM_F32TOF5_E2M2_EXP = (
    # exp bias conversion between formats
    (F32_EXP_BIAS - F5_E2M2_EXP_BIAS)
    # mantissa length difference between formats
    + (MBITS_F32 - MBITS_F5_E2M2)
    # add one to encoded exponent for denormalized numbers
    + 1
)
DENORM_F32TOF5_E2M2_MASK_INT = DENORM_F32TOF5_E2M2_EXP << MBITS_F32
# reinterpret int32 as float32 in Python
# see https://stackoverflow.com/a/34446112/1058521
DENORM_F32TOF5_E2M2_MASK_FLOAT = struct.unpack("!f", struct.pack("!I", DENORM_F32TOF5_E2M2_MASK_INT))[0]


def f32_to_f5_e2m2_unpacked(x: Tensor):
    return _f32_to_f4_or_f6_unpacked(
        x,
        F5_E2M2_MAX,
        F5_E2M2_MIN_NORMAL,
        DENORM_F32TOF5_E2M2_MASK_FLOAT,
        DENORM_F32TOF5_E2M2_MASK_INT,
        EBITS_F5_E2M2,
        MBITS_F5_E2M2,
        F5_E2M2_EXP_BIAS,
        MAGIC_ADDER_F5_E2M2,
        F5_E2M2_MAX_INT,
        SIGN_MASK_F5_E2M2,
    )

Ideally, we shouldn't need to calculate all the constants by ourselves, only provide number of E and M bits, and these constants should be calculated within the function (or cache them somewhere, though I think re-calculating these constants shouldn't take much time).

The other direction (FPx->FP32) is a bit trickier when handling denormal FPx, but should still be possible to make it more generic.

Proposed changes

Change _f32_to_f4_or_f6_unpacked() and _f4_or_f6_unpacked_to_f32() to _f32_to_fpx_unpacked(x, n_ebits, n_mbits) and _fpx_unpacked_to_f32(x, n_ebits, n_mbits) (packed format is out of scope, should be handled separately for each case)
(Maybe) Move non-mx specific stuff from custom_cast.py to an upper level e.g. prototype/fp_cast_utils.py (e.g. functions for packed fp4, custom triton kernels should stay in custom_cast.py)

Tagging @vkuzo and @msaroufim for discussion and opinion.

`dequantize_affine` modified the `input` in-place

The dequantize_affine function modifies the input in place, resulting in different outputs when a model is run multiple times with the same inputs.

Should we consider cloning the input, scale, and zero_point at the beginning of the function to prevent this issue?

Code to reproduce the issue

# demo.py
# modified from https://github.com/pytorch/ao/blob/cbc74ee6a3dc0bae367db5b03bc58896fffe3ae0/test/quantization/test_quant_api.py#L449-L467

import unittest
import torch

from torchao.dtypes import to_aq, AffineQuantizedTensor
from torchao.quantization.quant_primitives import MappingType, ZeroPointDomain
from torchao.quantization.quant_api import quantize
from torchao.quantization.utils import TORCH_VERSION_AFTER_2_4

import copy


class ToyLinearModel(torch.nn.Module):
    def __init__(self, m=64, n=32, k=64):
        super().__init__()
        self.linear1 = torch.nn.Linear(m, n, bias=False).to(torch.float)
        self.linear2 = torch.nn.Linear(n, k, bias=False).to(torch.float)

    def example_inputs(self, batch_size=1):
        return (torch.randn(batch_size, self.linear1.in_features).to(torch.float),)

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(x)
        return x


@unittest.skipIf(not TORCH_VERSION_AFTER_2_4, "Test only enabled for 2.4+")
@unittest.skipIf(not torch.cuda.is_available(), "Need CUDA available")
def test_quantized_tensor_subclass_int4():
    # weight settings
    groupsize = 32
    mapping_type = MappingType.ASYMMETRIC
    block_size = (1, groupsize)
    target_dtype = torch.int32
    quant_min = 0
    quant_max = 15
    eps = 1e-6
    zero_point_dtype = torch.bfloat16
    # !! For using `AffineQuantizedTensor.dequantize`
    preserve_zero = True
    zero_point_domain = ZeroPointDomain.INT

    # use 1024 so that we don't need padding
    m = ToyLinearModel(1024, 1024, 1024).eval().to(torch.bfloat16).to("cuda")
    m_copy = copy.deepcopy(m)
    example_inputs = tuple(map(lambda x: x.to(torch.bfloat16).to("cuda"), m.example_inputs()))

    def apply_weight_quant(weight):
        return to_aq(
            weight,
            mapping_type,
            block_size,
            target_dtype,
            quant_min,
            quant_max,
            eps,
            zero_point_dtype=zero_point_dtype,
            preserve_zero=preserve_zero,
            zero_point_domain=zero_point_domain,
        )

    m = quantize(m, apply_weight_quant)
    assert isinstance(m.linear1.weight, AffineQuantizedTensor)
    assert isinstance(m.linear2.weight, AffineQuantizedTensor)

    res = m(*example_inputs)
    res2 = m(*example_inputs)
    assert torch.equal(res, res2), f"The results of the model are not consistent"

    
# pytest -sv demo.py

[NF4][FSDP2]: enable multi-gpu CI

CI config and unit test template: #190

apply_dynamic_quant for vit_b_16

import torch
import torchvision.models.vision_transformer as models

# Load Vision Transformer model
model = models.vit_b_16(pretrained=True)
import torchao


model.eval().cuda().to(torch.bfloat16)
from torchao.quantization import apply_dynamic_quant
apply_dynamic_quant(model)
from torch._inductor import config as inductorconfig
inductorconfig.force_fuse_int_mm_with_mul = True

model = torch.compile(model, mode='max-autotune')

input_tensor = torch.randn(1, 3, 224, 224, dtype=torch.bfloat16, device='cuda')
model(input_tensor)

causes crash

[...]
    self.out_proj.weight,
  File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1704, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___encoder_layers_encoder_layer_0_self_attention(*(FakeTensor(..., device='cuda:0', size=(1, 197, 768), dtype=torch.bfloat16,
           grad_fn=<NativeLayerNormBackward0>), FakeTensor(..., device='cuda:0', size=(1, 197, 768), dtype=torch.bfloat16,
           grad_fn=<NativeLayerNormBackward0>), FakeTensor(..., device='cuda:0', size=(1, 197, 768), dtype=torch.bfloat16,
           grad_fn=<NativeLayerNormBackward0>)), **{'need_weights': False}):
'DynamicallyPerAxisQuantizedLinear' object has no attribute 'weight'

from user code:
   File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torchvision/models/vision_transformer.py", line 298, in forward
    x = self.encoder(x)
  File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torchvision/models/vision_transformer.py", line 157, in forward
    return self.ln(self.layers(self.dropout(input)))
  File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torchvision/models/vision_transformer.py", line 113, in forward
    x, _ = self.self_attention(x, x, x, need_weights=False)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

1 bit LLM implementation

The code is out, it's quite simple and short

Opening this so I can track how to add this to ao and make sure it works well with torch.compile(). This will likely need blackwell to perform decently

https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

Generic packing algorithms from size N to M

In order to support sub-byte dtypes for quantization, I (and many others) believe that it is better to pack these smaller dtypes into existing pytorch dtypes in order to reduce memory bandwidth contention for a bit of increased computation. Here is a preliminary algorithm in pytorch for doing this. It supports many types of conversions as seen in the tests.

Inspecting the compiled Triton code seems promising because it only launches one kernel and one buffer. Here is a snippit

@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 4
    x1 = (xindex // 4)
    x2 = xindex
    tmp0 = x0
    tmp1 = tl.full([1], 0, tl.int64)
    tmp2 = tmp0 >= tmp1
    tmp3 = tl.full([1], 1, tl.int64)
    tmp4 = tmp0 < tmp3
    tmp5 = tl.load(in_ptr0 + (x1), tmp4 & xmask, eviction_policy='evict_last', other=0.0)
    tmp6 = tl.full([1], 6, tl.uint8)
    tmp7 = tmp5 >> tmp6
    tmp8 = tl.full([1], 3, tl.uint8)
    tmp9 = tmp7 & tmp8
    tmp10 = tl.full(tmp9.shape, 0.0, tmp9.dtype)
    tmp11 = tl.where(tmp4, tmp9, tmp10)
    tmp12 = tmp0 >= tmp3
    tmp13 = tl.full([1], 2, tl.int64)
    tmp14 = tmp0 < tmp13
    tmp15 = tmp12 & tmp14
    tmp16 = tl.load(in_ptr0 + (x1), tmp15 & xmask, eviction_policy='evict_last', other=0.0)
    tmp17 = tl.full([1], 4, tl.uint8)
    tmp18 = tmp16 >> tmp17
    tmp19 = tmp18 & tmp8
    tmp20 = tl.full(tmp19.shape, 0.0, tmp19.dtype)
    tmp21 = tl.where(tmp15, tmp19, tmp20)
    tmp22 = tmp0 >= tmp13
    tmp23 = tl.full([1], 3, tl.int64)
    tmp24 = tmp0 < tmp23
    tmp25 = tmp22 & tmp24
    tmp26 = tl.load(in_ptr0 + (x1), tmp25 & xmask, eviction_policy='evict_last', other=0.0)
    tmp27 = tl.full([1], 2, tl.uint8)
    tmp28 = tmp26 >> tmp27
    tmp29 = tmp28 & tmp8
    tmp30 = tl.full(tmp29.shape, 0.0, tmp29.dtype)
    tmp31 = tl.where(tmp25, tmp29, tmp30)
    tmp32 = tmp0 >= tmp23
    tmp33 = tl.full([1], 4, tl.int64)
    tmp34 = tmp0 < tmp33
    tmp35 = tl.load(in_ptr0 + (x1), tmp32 & xmask, eviction_policy='evict_last', other=0.0)
    tmp36 = tl.full([1], 0, tl.uint8)
    tmp37 = tmp35 >> tmp36
    tmp38 = tmp37 & tmp8
    tmp39 = tl.full(tmp38.shape, 0.0, tmp38.dtype)
    tmp40 = tl.where(tmp32, tmp38, tmp39)
    tmp41 = tl.where(tmp25, tmp31, tmp40)
    tmp42 = tl.where(tmp15, tmp21, tmp41)
    tmp43 = tl.where(tmp4, tmp11, tmp42)
    tl.store(out_ptr0 + (x2), tmp43, xmask)
''', device_str='cuda')

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1 = args
    args.clear()
    s0 = arg0_1
    s1 = arg1_1
    s2 = arg2_1
    assert_size_stride(arg3_1, (s0, s1, s2), (s1*s2, s2, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((s0, s1, s2, 4), (4*s1*s2, 4*s2, 4, 1), torch.uint8)
        # Source Nodes: [stack], Original ATen: [aten.stack]
        triton_poi_fused_stack_0_xnumel = 4*s0*s1*s2
        stream0 = get_raw_stream(0)
        triton_poi_fused_stack_0.run(arg3_1, buf0, triton_poi_fused_stack_0_xnumel, grid=grid(triton_poi_fused_stack_0_xnumel), stream=stream0)
        del arg3_1
    return (reinterpret_tensor(buf0, (s0, s1, 4*s2), (4*s1*s2, 4*s2, 1), 0), )

Sparsity OSS colab tracker

We're putting together a loose RFC for our general plans that should be out shortly.

However we know that we want to work with the researchers / OSS to land advanced pruning algorithms into torchao.
These pruning algorithms should extend the Sparsifier class found in torch.ao.pruning.

(@mklasby) Add SET and RigL pruning algorithms
(@mklasby) Add DST specialized to 2:4 sparsity pruning recipe.
(@hgyhungry) Add SHFL-BW pruning algorithm.

We're also interested in potentially adding additional fast sparse kernels into torchao. This requires some additional discussion, in particular if we want to land these kernels as is for eager mode support, or to try and generate these kernels with triton.

[BUG] No module named 'expecttest' when import `torchao`

To reproduce, install torchao from main, then import torchao

pip install git+https://github.com/pytorch/ao
python -c "import torchao"

Error

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torchao

File ~/code/ao/torchao/__init__.py:8
      6 from . import dtypes
      7 import torch
----> 8 from torch.testing._internal.common_utils import IS_FBCODE
      9 if not IS_FBCODE:
     10     from . import _C

File ~/miniconda3/envs/dev2.3/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py:62
     47 from typing import (
     48     Any,
     49     Callable,
   (...)
     58     Union,
     59 )
     60 from unittest.mock import MagicMock
---> 62 import expecttest
     63 import numpy as np
     65 import __main__  # type: ignore[import]

ModuleNotFoundError: No module named 'expecttest'

The problem is self-explanatory. 2 possible solutions:

Add expecttest to requirements.txt
Add a try-except clause

torch.iinfo() support for sub byte dtypes

Add an override for torch.iinfo for sub byte dtypes so that the following doesn't happen:

ARM builds in CI

Right now if I pip install torchao on my M1 laptop it'll install 0.1 binaries and not the latest release for 0.2

Similarly if I try to install the nightly binaries I get a hard error

(ao) pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cpu
Looking in indexes: https://download.pytorch.org/whl/nightly/cpu
ERROR: Could not find a version that satisfies the requirement torchao-nightly (from versions: none)
ERROR: No matching distribution found for torchao-nightly

(ao) pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/ 
Looking in indexes: https://download.pytorch.org/whl/nightly/
Requirement already satisfied: torchao in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (0.1)
Requirement already satisfied: torch in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torchao) (2.2.2)
Requirement already satisfied: numpy in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torchao) (1.26.4)
Requirement already satisfied: sentencepiece in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torchao) (0.2.0)
Requirement already satisfied: packaging in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torchao) (24.0)
Requirement already satisfied: filelock in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (3.13.4)
Requirement already satisfied: typing-extensions>=4.8.0 in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (4.11.0)
Requirement already satisfied: sympy in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (1.12)
Requirement already satisfied: networkx in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (3.3)
Requirement already satisfied: jinja2 in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (3.1.3)
Requirement already satisfied: fsspec in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (2024.3.1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from jinja2->torch->torchao) (2.1.5)
Requirement already satisfied: mpmath>=0.19 in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from sympy->torch->torchao) (1.3.0)
(ao) pip list
Package                Version
---------------------- -----------
attrs                  23.2.0
bashlex                0.18
bitsandbytes           0.42.0
bracex                 2.4
certifi                2024.2.2
charset-normalizer     3.3.2
cibuildwheel           2.17.0
contourpy              1.2.1
cycler                 0.12.1
exceptiongroup         1.2.1
expecttest             0.2.1
filelock               3.13.4
fonttools              4.51.0
fsspec                 2024.3.1
git-filter-repo        2.38.0
huggingface-hub        0.23.0
hypothesis             6.100.5
idna                   3.7
iniconfig              2.0.0
Jinja2                 3.1.3
kiwisolver             1.4.5
lxml                   5.2.1
MarkupSafe             2.1.5
matplotlib             3.8.4
mpmath                 1.3.0
networkx               3.3
ninja                  1.11.1.1
numpy                  1.26.4
packaging              24.0
pandas                 2.2.2
parameterized          0.9.0
pillow                 10.3.0
pip                    23.3.1
platformdirs           4.2.0
pluggy                 1.5.0
pyparsing              3.1.2
pytest                 7.4.0
python-dateutil        2.9.0.post0
pytz                   2024.1
PyYAML                 6.0.1
qtorch                 0.3.0
regex                  2024.4.28
requests               2.31.0
safetensors            0.4.3
scipy                  1.13.0
sentencepiece          0.2.0
setuptools             68.2.2
six                    1.16.0
sortedcontainers       2.4.0
sympy                  1.12
tokenizers             0.19.1
tomli                  2.0.1
torch                  2.2.2
torchao                0.1
tqdm                   4.66.4
transformers           4.40.2
typing_extensions      4.11.0
tzdata                 2024.1
unittest-xml-reporting 3.2.0
urllib3                2.2.1
wheel                  0.41.2
(ao)

custom cuda extensions make installing ao hard

i'm collecting a few issues I've seen, I have no clear picture of how to solve them as of this moment but aggregating them in the hopes that inspiration will strike

Problems

Problem 1

The below issue is solved by installing ao and then cd out of the ao directory. IIRC PyTorch has a similar problem in a repro shared by @jerryzh168

Traceback (most recent call last):
  File "/home/jerryzh/ao/example.py", line 2, in <module>
    from torchao.quantization.quant_primitives import MappingType, ZeroPointDomain
  File "/home/jerryzh/ao/torchao/__init__.py", line 8, in <module>
    from . import _C
ImportError: cannot import name 'C' from partially initialized module 'torchao' (most likely due to a circular import) (/home/jerryzh/ao/torchao/__init_.py)

Problem 2

Another issue here is building the fp6 kernels is failing https://hastebin.com/share/riridivafa.rust but the nvcc and gcc versions seem fine in a repro shared by @CoffeeVampir3

Problem 3

This error shows up when you either pip install ao or build it with a mismatch in cuda versions in a repro shared by @vayuda

python test/quantization/test_quant_api.py
Traceback (most recent call last):
  File "/u/pj8wfq/ao/test/quantization/test_quant_api.py", line 21, in <module>
    from torchao.dtypes import (
  File "/u/pj8wfq/ao/torchao/__init__.py", line 8, in <module>
    from . import _C
ImportError: /u/pj8wfq/ao/torchao/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

Problem 4

pypi binaries are crashing on non CUDA devices

File "/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/torchao/init.py", line 14, in
from . import _C
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Solutions

We need graceful solutions but in the meantime I'm embarassed to say I've been recommending a nuclear option which is to disable C extensions

Specifically in torchao/__init__.py delete

if not _IS_FBCODE:
    from . import _C
    from . import ops

And in setup.py delete

    ext_modules=get_extensions(),

how do you implement int8 mm on cuda?

Does Torch already support int8 mm on cuda? Also, what version of torch can run torch now?

[Tracker] Outstanding Issues and WIP Features for version 0.1

This issue tracks outstanding issues for a torchao 0.1 release

New Functionality
- Test compatibility with PyTorch 2.2 and 2.3rc1 (@cpuhrsch)
- Fix tests marked as flaky (@cpuhrsch)
- int4, int8 weight only quantization support (only need one of the paths to work)
  - path 1: int4, int8 weight quantization subclass API works with TorchTune (@jerryzh168), blocked by tensor subclass save load
  - path 2: int4, int8 weight quantization module swap API works with TorchTune (@jerryzh168), WIP
- Add GPTQuantizer workflow for 4-bit weight quantization (W4A16) for GPU that works for gpt-fast (and executorch) (@jerryzh168, @HDCharles)
  - remove lm-eval from GPTQ code (@HDCharles)
  - Only one of the following need to happen
    - change torchtune code to be compatible with current implementation of GPTQ, specifically change this: https://github.com/pytorch/torchtune/blob/main/torchtune/modules/kv_cache.py#L61-L62 to use index_put_ op (note this is not needed since we don't need to turn on cache during GPTQ)
    - refactor GPTQ to use tensor subclass to remove dependency on export
- Add workflow for 4-bit weight, 8-bit activation quantization (W4A8) with/without GPTQ for executorch (@jerryzh168)
  - without GPTQ path is working, still verifying the GPTQ path
- NF4 Dtype that works for QLoRA in TorchTune (@cpuhrsch)
- Fix API so it works with LoRACompatibleLinear
- Allow apply_quant_api()
  - it currently looks for the children of the module and so doesn't do anything
Tutorials/BE
- Using/Writing a quantization technique using torchao (@jerryzh168)
- Using kernels written in torchao with PyTorch
- Replace Int8WeightOnlyQuantizedLinearWeight and Int8DynamicallyQuantizedLinearWeight with a single class
- Reconsider using class method for Int8DynamicallyQuantizedLinearWeight.from_float
- Remove / guard catch all forward args, kwargs for module swap API
- Land Tutorial pytorch/tutorials#2730
If time permits (or v0.2)
- Enable test_8da4w_quantize for 2.4 @jerryzh168
- 4-bit quantization CPU perf numbers
- Feature parity between module swap api and subclass api
- Align smoothquant api with others
- - Add high level auto quant API for int8 dynamic and weight-only quantization with benchmarks (@HDCharles)

[Tracker] General feature requests for torchao

This issue tracks outstanding feature requests for torchao. If you'd like a specific feature to be added to torchao, please comment directly here.

Quantization Techniques (based on planned, new requests)

GPTQ
HQQ

DTypes

fp8
mx format

Sparsity APIs

int8 + 2:4 sparsity
fp8 + 2:4 sparsity

Kernels

kernel autotuner for dynamic quant
C++ extension for custom kernel - starting with PagedAttention (pytorch/pytorch#121465)
CUTLASS w4a8 kernel #64

cc @cpuhrsch

[Tracker] WIP Features for torchao v0.2

New Features

Better Engineering

Remove _is_gpt_fast flag @jerryzh168
Add rudimentary shape checks (e.g. multiple of 16 constraints)
Generic test guards for sm80 (bfloat16) and CPU-only environment
[done for 0.2] Dedup https://github.com/pytorch-labs/ao/blob/046dc985de6d5eac05c6575cc71505687e3aadf1/torchao/quantization/quant_primitives.py#L23-L42 @jerryzh168
Better sparsity docs @jcaip
Documentation of torchao features @msaroufim
GPTQ Refactor

Repo Health

Setup docs page pytorch.org/ao @svekars
OSS -> fbcode sync (setup difftrain) @jerryzh168 and Jon from OSS team
Don't cause a "import error" when someone is using a feature unsupported by e.g. torch 2.1.2 (e.g. https://github.com/pytorch-labs/ao/blob/046dc985de6d5eac05c6575cc71505687e3aadf1/torchao/quantization/quant_primitives.py#L42 will cause an import error if someone tries to use torchao.quantization.quant_primitives.per_token_dynamic_quant on 2.2.2
Enable test_8da4w_quantize for 2.4 @cpuhrsch

[feature request] np.packbits / np.unpackbits, general BitTensors (maybe can be just tensors with dtype torch.bits8 or have a new dtype torch.bits introduced) and bit packed tensors utilities for saving memory / accesses, support for BitTensors wherever BoolTensors are used

A usecase: storing a full backtracking pointer matrix can be okay for needleman/ctc alignment (4x memory saving compared to uint8 representation), if 2bit data type is used. Currently it's possible to do this with bit manipulation magic, but probably not very efficient (store and load will require masking and shifting, not fused)

Another usecase: compressed BoolTensor for binary neural networks

Another usecase: extremely low-bit quantized representations.

Is something like this already implemented for quantization? Probably a simple version of this feature could be providing some explicitly utility functions like calculating size of the holder uint8 tensor, fused store and load functions (potentially explicitly batched, e.g. actual store is delayed until some aligned number of memory lines has arrived)

In NumPy the related functionality is np.packbits and np.unpackbits, however these are designed to work only with 1-bit contained type. 2-bit/4-bit would be cool as well.

On 1-bit side, another related project is RoaringBitmap https://github.com/RoaringBitmap/RoaringBitmap (http://roaringbitmap.org/) - for compressed bitsets for set operations.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @izdeby

Numerics checks between NF4 and bnb nf4

@ebsmothers shared this interesting issue pytorch/torchtune#1020

And it got me thinking a bit more about how we validate numerical correctness.

On one hand we do rely on the compiler a lot so we can do numerical checks by comparing eager vs the compiler

But on the other hand as people start to use ao more they're likely to compare it to existing libraries with expert written kernels so in this case there's a few layers of validation we can do and taking the nf4 tensor from qlora as an example we can

Validate that quant and dequant nf4 ops are comparable between ao and bnb
Validate that an NF4 linear behaves in comparable ways
Validate that end to end eval results are comparable

Opening this issue so we can discuss more, my 2c is 1 and 2 are sufficient for CI and 3 can be a 1 time thing

[RFC] Plans for sparsity

Summary

Sparsity, like quantization, offers increased model performance at the expense of some model quality. However, it is not as widely used / researched as a technique, despite offering similar performance benefits. With the recent explosion in model sizes in GenAI, and with quantization pushing 1-bit limits, there has been renewed interest in sparsity, specifically for GPU backend sparsity patterns.

The parallel nature of GPU backends makes accelerating unstructured sparsity difficult. However, there exist specific sparsity patterns (block-wise, semi-structured) that are more amenable to acceleration on GPUs. Over the last year, we’ve integrated these fast sparse kernels into PyTorch Core, so that all users can show up to a with just a few lines of code:

Our goal for torchao.sparsity is to drive research / adoption of these GPU sparsity patterns.

We feel that the main problem current researchers / users face is fragmentation. Researchers rightfully aim to show end-to-end results, but this means a lot of time is spent figuring out how to integrate with PyTorch and implementation questions like: When should I mask? When/how should I store the compressed representation? Do I want in-place or out-of-place mask updates? How can I call sparse matmul instead of dense?

We hope to change that by providing tutorials and APIs for both sparse kernels (tensor subclassing) and pruning algorithms (torch.ao.pruning.Sparsifier) that users can extend. We feel like the above problems can be solved once, by torchao, letting researchers focus on pushing sparse kernel performance or more accurate pruning algorithms.

We're also hoping to create a new extension point by releasing the workflows we have designed with xFormers that enable accelerated sparse training, not just sparse inference.
As such, we plan on launching torchao.sparsity with the following features in v0.2:

OSS contributed Sparsifiers
Int8 dynamic quant + 2:4 sparse composition subclasses
Accelerated 2:4 Sparse Training

However, we’d like feedback from the community to set the longer-tem vision of sparsity. Also fee free to chime in with any other thoughts you want to share!

Pruning Algorithms

We plan to host a set of OSS pruning algorithms in torchao. These pruning algorithms should extend the torch.ao.pruning.BaseSparsifier class, like WandaSparsifier. We welcome community contributions for pruning algorithms, provided they extend the BaseSparsifier.

Open Questions:

Are there changes that need to be made to Sparsifier?
- Pruning uses parameterizations ( FakeSparsity ), should we switch to MaskedTensor?
- Global mask updates are difficult to support, is this something researchers care about?
We are designing this with first-time users of sparsity in mind, not researchers, does that resonate with the community?
What pruning algorithms are interesting for the community? RigL seems like it would be good to have, any others?

Recipes / Benchmarks

We have often found pruning to be very model specific, with little generalization across domains. As such we hope to land sparse training recipes for specific models / datasets, showing how different pruning algorithms can be used. We are specifically interested in recipes that compose with quantization.

Additionally, we hope that these benchmark numbers can help first-time users of sparsity better understand the tradeoffs involved and encourage researchers to contribute SOTA pruning algorithms.

Open Questions:

We plan to focus on vision models for now, should we focus on LLMs?
What benchmarks / datasets are interesting to the community? It looks like ViT on ImageNet is the most common architecture.
Does the community feel there is value in having a suite of sparse microbenchmarks for the different sparsity patterns or just E2E results?
Does the community feel that there is value in having a suite of different pruning (accuracy) benchmarks, for something like unstructured pruning, for comparisons sake?
If yes to the above two questions, are there specific combinations that are interesting?

Accelerated Sparse Training

While much work has been done on sparsity for inference, sparsity for training has remained much more challenging. Thanks to the work done by xFormers, we’ve upstreamed fast runtime semi-structured sparsification kernels into PyTorch Core, which allow for prune -> compress -> sparse_mm to happen faster than dense matmul. We also aim to release an example of accelerated sparse training for the OSS community to extend.

Does the community want us to focus more on fast sparse training or pruning workflows, where you start with an existing trained model?
Should we extend this with sparse M:N kernels? This would allow for more flexible accuracy at the expense of some performance.
The masking mechanism is different from the torch.ao.pruning masking mechanism (FakeSparsity), should we unify the two?

Performant Sparse Kernels

There are additional sparsity patterns that may be supported on GPUs, which would require additional fast sparse kernels. We hope that torchao can be a staging ground for these kernels. We plan to upstream these kernels to Core as we see fit, depending on adoption.

Some initial options are:

Block sparse + 2:4 sparse kernel - Combining 2:4 sparsity with block sparsity for maximum speedups.
SHFL-BW kernel - These kernels add a row-wise permutation before block sparsity, to allow for a more flexible sparsity pattern. It would be interesting to see if this shuffle could be used for M:N sparsity as well.
M:N / Sparse fan-in kernel - These are similar to 2:4 sparse kernels, but generalized to N:M. While they do not offer the same hardware acceleration as 2:4 sparsity, you can still get memory speedups by sending a compressed representation.

Open Questions:

What about load as sparse, compute as dense kernels?
What about other backends like COO CPU kernels for unstructured sparsity? We believe that we should focus on these M:N / block-sparse GPU patterns in particular.
Can we generate these kernels with torch.compile rather than hand writing them?

cc @supriyar @cpuhrsch @msaroufim @pytorch-labs/team-superblock @danthe3rd @mklasby @ngc92 @hgyhungry

Remove workaround for unwrapping tensor subclasses

Currently we need to do:

from torchao.quantization.utils import unwrap_tensor_subclass
m_unwrapped = unwrap_tensor_subclass(m)


# export
m = torch.export.export(m_unwrapped, example_inputs).module()

# aot_compile
torch._export.aot_compile(m_unwrapped, example_inputs)

to make tensor subclass work with export/aot_compile, this should be added to default export path directly

[RFC] Plans for LLM QAT

Following the recent success of the LLM-QAT paper, our high-level goal is to provide a PyTorch native workflow for LLM quantization-aware training (QAT) leveraging the quantization primitives and kernels provided by torchao, which is planned to become the de facto OSS library for AO techniques and kernels in PyTorch across different platforms (#47). We also hope to eventually integrate with TorchTune, a recently open-sourced library for fine-tuning and experimenting with LLMs, to provide an end-to-end flow that supports both finetuning and QAT.

Workstream 1: Edge Devices

Executorch provides a mechanism for quantizing Llama2 using post-training quantization (PTQ) techniques such as GPTQ, and lowering it to backends like XNNPACK. The main goal of this workstream is to provide a QAT drop-in replacement for GPTQ but with superior accuracy, starting with Llama2 7b using the following quantization/training configurations:

Linear weights: 4-bit per channel grouped symmetric static quantization
Linear activations: 8-bit per token symmetric dynamic quantization
Not “data-free”: Use original dataset unlike in the LLM-QAT paper

We plan to adopt the same eager mode quantization implementation used by the PTQ flow. In the future, if we decide to experiment with static quantization for activations for example, then we can explore using the PT2 Export QAT flow.

Workstream 2: Explore new quantization methods

This workstream is largely backend agnostic; our goal is to motivate the backends (mobile or server CPU/GPU) to build the relevant kernels once we have demonstrated the initial success of a particular quantization configuration. There is a large design space we can experiment with summarized below. The suggested quantization and training techniques are primarily motivated by the LLM-QAT paper, but also by ongoing developments across the industry.

We can start with the following dimensions:

KV-cache quantization: 4- or 8-bit KV-cache quantization can alleviate throughput bottlenecks with long sequences, and this has been shown (in the LLM-QAT paper) to hurt QAT a lot less than PTQ in terms of accuracy.

Custom dtypes: The latest Hopper and upcoming Blackwell GPU generations no longer support int4 tensor cores, and so int4 kernels may not be as performant as other 4-bit dtypes. For example, both nf4 and MX4 promise higher fidelity than any a priori fixed quantization like int4. Experimenting with newer dtypes in QAT may lead to further accuracy gains.

Lower bit-widths: 2- or 3-bit weight quantization can help further lower memory footprint and speed up inference. There have been PTQ attempts at such bit-widths (e.g. Quip#, AQLM), but QAT has the potential to further mitigate the accuracy degradation.

Workstream 3: Server GPU Inference

This is an extension of the recent gpt-fast efforts to quantize Llama but for QAT. An important goal here is to reuse the same quantization primitives as Workstream 1 to unify the two flows as much as possible. We can start with the following quantization configurations:

Int4 weight-only quantization. This was the focus last half for Llama2, which targeted batch size 1 local chat agent use cases. This particular workload is memory bound, not compute bound, when run on GPUs, and so quantizing the activations here may not be particularly beneficial. For QAT, we can perform the same weight-only quantization for better accuracy.

Int4 weight quantization + int8 activation dynamic quantization, similar to Workstream 1. One advantage here is we will have numerical baselines from the ExecuTorch workstream to compare against. However, as explained above, it may not make sense for Llama2 batch size 1 use cases for GPUs, so this configuration may be more suitable for larger batch sizes or other more compute bound models. The plan here is to be able to leverage ongoing efforts to provide mixed 4-bit / 8-bit GEMM in cutlass: NVIDIA/cutlass#1413.

MX4 weight + activation quantization. Please see the previous section under Custom dtypes for more details.

Reconsider using class method for Int8DynamicallyQuantizedLinearWeight.from_float

Handle NonDynamicallyQuantizableLinear in smoothquant module

Traceback (most recent call last):
  File "C:/Program Files/JetBrains/PyCharm Community Edition 2023.2.1/plugins/python-ce/helpers/pydev/pydevd.py", line 1527, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2023.2.1\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:\code\foobar\scripts\quantize.py", line 37, in <module>
    swap_linear_with_smooth_fq_linear(model)
  File "C:\code\py-envs\foobar\lib\site-packages\torchao\quantization\smoothquant.py", line 219, in swap_linear_with_smooth_fq_linear
    swap_linear_with_smooth_fq_linear(child, skip_fqn_list, new_fqn, alpha)
  File "C:\code\py-envs\foobar\lib\site-packages\torchao\quantization\smoothquant.py", line 219, in swap_linear_with_smooth_fq_linear
    swap_linear_with_smooth_fq_linear(child, skip_fqn_list, new_fqn, alpha)
  File "C:\code\py-envs\foobar\lib\site-packages\torchao\quantization\smoothquant.py", line 219, in swap_linear_with_smooth_fq_linear
    swap_linear_with_smooth_fq_linear(child, skip_fqn_list, new_fqn, alpha)
  [Previous line repeated 1 more time]
  File "C:\code\py-envs\foobar\lib\site-packages\torchao\quantization\smoothquant.py", line 215, in swap_linear_with_smooth_fq_linear
    target_cls = source_cls_to_target_cls[type(child)]
KeyError: <class 'torch.nn.modules.linear.NonDynamicallyQuantizableLinear'>
python-BaseException

Expected: NonDynamicallyQuantizableLinear layer is skipped (possibly with a warning), or properly handled.
Actual: exception.

It sounds like HDCharles was planning on fixing this more generally: pytorch/pytorch#58969

NF4Tensor uses 8 bits of memory

Hi, I've been playing around with QLoRA using the NF4Tensor class from this great library. But i noticed that the NF4 data type is using 8 bits of memory, where it should be using ~4.1 bits according to the paper. I verified this by initializing a single tensor

>>> t4 = torchao.dtypes.nf4tensor.NF4Tensor.from_tensor(torch.rand([1024, 4096], dtype=torch.bfloat16), 64, 256)
>>> torch.cuda.memory_allocated(), torch.cuda.memory_reserved()
(0, 0)
>>> t4 = t4.cuda()
>>> torch.cuda.memory_allocated(), torch.cuda.memory_reserved()
(4194304, 20971520)

which is 4194304 / (1024 * 4096) * 8 = 8 bits per parameter

I was wondering if this a bug, or is there some intrinsic limitation here? Thanks

Error when using to_nf4 function, inside NF4Tensor Class

I have encountered a problem when using QLoRa from the LoRALinear class you have in Torchtune. Apparently, when the ‘quantize_base’ parameter is set to True, the NF4Tensor class is called, which transforms the tensor to 4 bits from the to_nf4 function.

This function set the tensor in bf16 without specifying the device, giving the problem of tensors on two different devices, CPU and GPU. When the tensor is set, by default it is assigned to CPU unless otherwise specified. I got this error trying to do QLoRA on my GPU.

I leave here the original function vs the one I modified to get out of it:

Original function according to torchao.dtypes.nf4tensor:

def to_nf4(tensor, block_size: int = 64, scaler_block_size: int = 256): tensor1 = tensor.to(torch.bfloat16) return NF4Tensor.from_tensor(tensor1, block_size, scaler_block_size)

Modified function for dealing with the problem:

def to_nf4(tensor, device, block_size: int = 64, scaler_block_size: int = 256): tensor1 = tensor.to(device=device, dtype=torch.bfloat16) return NF4Tensor.from_tensor(tensor1, block_size, scaler_block_size)

[New Feature] CUTLASS kernels for w4a8 quantization

We plan to add QAT for LLMs to torchao (as mentioned in the original RFC here #47)

For this to run efficiently on the GPU we'd need kernel support for W4A8 quantization (int4 weights, int8 activations).

Other places where this has been raised before
NVIDIA/cutlass#1316,
NVIDIA/cutlass#1370

cc @andrewor14

Docs Revamp

Just listing out all the issues I'm seeing with our docs, feel free to pick something up and fix it. First step just add your documentation directly in a relevant subfolder in the repo directly and tag me to review

For API docstrings and end usage instructions that won't change a lot please put them here https://github.com/pytorch/ao/tree/main/docs so they get rendered on pytorch.org/docs

Numbers

The repo is primarily about performance so we should share performance tables directly in the README until we figure out a dashboard like solution

For each sparsity or quantization technique you're working on feel free to add another subsection

End to end tutorials

Revamp the main README.md to have the features we want to advertise the most broadly
How to configure compile for consumer GPUs
End to end tutorial with llama3
Run an evaluation with eleuther eval @andrewor14
torch.ao.pruning accuracy benchmarks on llama2 or llama3

Core concepts

Sparsity patterns and how they work
What are the different kinds of quantization algorithms
How to make quantization/sparsity kernels faster
Sparsity for LLMs overview @jcaip
docs for AffineQuantizedTensor in quantization.md @jerryzh168

Contributing

How to test
Version guards
How to benchmark

Features

AOT inductor and no python overhead tutorial @jerryzh168
Update autoquant tutorial to work OOB with llama2/3, should be some copy pastable snippet using our llama model in torchao
Smoothquant tutorial is placeholder code, needs an actual runnable snippet or be moved to prototype

Composability

Docstrings

e.g https://pytorch.org/ao/stable/generated/torchao.sparsity.WandaSparsifier.html#torchao.sparsity.WandaSparsifier

dtypes @jainapurva
kernel @msaroufim
quantization @msaroufim
sparsity @jcaip

Confirm they're visible on pytorch.org

Completed

We don't have a wanda tutorial @jcaip
Sparsity we mention tons of algorithms but should suggest a simple one people should start with @jcaip
Our main goals are performance w/ composability with torch.compile and FSDP + performance. And also easy packaging for wide reach @msaroufim
In the main README when we talk about features we should link to usage instructions and code not papers @msaroufim
Mention HQQ, GaLore and prototype folder somewhere in main docs @msaroufim
A doc for how to register a new custom OP for both C++ and Triton @msaroufim
Mention tinygemm @msaroufim

[RFC] More general affine quantization primitives

PR is here, please feel free to comment in PR directly: #159

Context

Currently there are many q/dq functions in torchao and pytorch, they mainly differ in the following dimensions:

dtype/bitwidth + quant_min/quant_max: e.g. torch.uint8 with quant_min=0 and quant_max = 255
symmetric/asymmetric quantization
granularity: per_tensor, per_channel, per_channel_group
dtype for scales and zero_points

Ideally, I think we should unify them, it might complicate the operator pattern that’s used by backends like xnnpack, but the code sharing and simplification of the representation it brings will be beneficial in the long term.

We defined three functions: choose_qparams_affine_per_block, quantize_affine_per_block, dequantize_affine_per_block, please checkout the docstrings of these functions in the PR for the definitions

Some Questions

for input and scale/zero_point, what do we do when they have different dtypes, e.g. when input is fp16, scales and zero_points are fp32? do we always convert to fp32 and then do the computation?
Concerns about using torch.Tensor for per_tensor quantization instead of float/int numbers?
It may run slower, is there any concerns on perf?
Other ways to choose qparams apart from symmetric and asymmetric?
clampping for quant_min/quant_max, should we include this in the quantize op or leave this out?
I'm also thinking of API for end users, I think we could provide a util function to get the block size, e.g. get_block_size(input, {"quant_type": "per_channel_group", "group_size": 32, "axis": -1})

Doc build failing on main

This is the error https://github.com/pytorch/ao/actions/runs/8977543410/job/24656486432?pr=216

Don't have time to debug tonight but cc @svekars who might have some ideas

Run cd docs
Running Sphinx v5.0.0
torchao_version_docs: refs/pull/216/merge
Version: main
making output directory... done
Using Sphinx-Gallery to convert rst text blocks to markdown for .ipynb files.
[autosummary] generating autosummary for: api_ref_dtypes.rst, api_ref_intro.rst, api_ref_kernel.rst, api_ref_quantization.rst, api_ref_sparsity.rst, dtypes.rst, getting-started.rst, index.rst, overview.rst, performant_kernels.rst, quantization.rst, sparsity.rst

Extension error (sphinx.ext.autosummary):
Handler <function process_generate_options at 0x7f0bd5e8cfe0> for event 'builder-inited' threw an exception (exception: no module named torchao.sparsity)
make: *** [Makefile:41: html] Error 2
Error: Process completed with exit code 2.

2:4 sparsity + PTQ(int8) model's inference

Are there any runnable demos of using Sparse-QAT/PTQ (2:4) to accelerate inference, such as applying PTQ to a 2:4 sparse LLaMA for inference acceleration? I am curious about the potential speedup ratio this could achieve.
The overall pipeline might be: compressing the Weight matrix using 2:4 sparsity and quantizing it to INT8 format through PTQ/QAT. The Activation matrix should also be quantized to INT8 format through PTQ/QAT. After such processing, the main type of computation would be INT8*INT8.
I would like to know if there is a tutorial document available, as I am a beginner in the field of quantization.
Thx！

[NF4][FSDP2] avoid peaking GPU memory when constructing NF4 tensors

construct NF4 tensors in chunks and check memory traces: #196

HQQ Tracker

A16W4 axis=1
- Low hanging fruit we can add to int4wo quant as either a flag or replace the quant method
  - test eval with HQQ axis=1 and compare to existing version
- if axis = 1 doesn't get enough accuracy improvement, we could also combine with equalization
  - test perf/eval with HQQ axis=1 + equalization
A16W4+ axis=1
- Can quantize certain columns of W to 4/8 bit
  - may be faster to do a 4 bit matmul on all of W and a sparse 8 bit matmul?
  - test perf for int4wo + int8 matmul for n columns
- HQQ+ end result is an int4wo matmul + lora matmul
  - back of envelope numbers look like 1/3 slowdown over int4 which is still better than int8
  - test perf for int4wo + lora
A8W4 axis=1
- test eval accuracy with HQQ axis=1 and compare to existing version
A16W3 and A16W5
- existing numbers depend on axis = 0, how do these numbers look with axis = 1
  - also relevant whether these numbers scale to llama3 since some quantization difficulty has been reported there
- get numbers for 3/5 bit quantization with axis = 1, ideally for llama 3

[NF4][FSDP2] DTensor + fused adam on cpu

MX types failing CI

Mark said you should be familiar with this @vkuzo
https://github.com/pytorch/ao/actions/runs/9487632431/job/26150568432?pr=347

 =================================== FAILURES ===================================
  ___________________ test_inference_compile_simple[fp4_e2m1] ____________________
  
  elem_dtype = 'fp4_e2m1'
  
      @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
      @pytest.mark.parametrize("elem_dtype", SUPPORTED_ELEM_DTYPES)
      def test_inference_compile_simple(elem_dtype):
          """
          Smoke test for inference compile
          """
          if elem_dtype in (torch.float8_e4m3fn, torch.float8_e5m2):
              if not IS_CUDA_GE_89:
                  pytest.skip("CUDA capability >= 8.9 required for float8 in triton")
          m = nn.Sequential(nn.Linear(4, 6, bias=False, dtype=torch.bfloat16))
          m = m.cuda()
          m_mx = copy.deepcopy(m)
          block_size = 2
          swap_linear_with_mx_inference_linear(m_mx, elem_dtype, block_size)
          m_mx = torch.compile(m_mx, fullgraph="true")
      
          x = torch.randn(2, 4, device="cuda", dtype=torch.bfloat16)
          y_ref = m(x)
          y_mx = m_mx(x)
          sqnr = compute_error(y_ref, y_mx)
          if elem_dtype is torch.float8_e4m3fn:
              assert sqnr >= 20.0
          else:
  >           assert sqnr >= 14.0
  E           AssertionError: assert tensor(13.1250, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MulBackward0>) >= 14.0
  
  test/prototype/mx_formats/test_mx_linear.py:192: AssertionError

[Question] MBU in automated CI?

Hi folks, thanks for the great work.

With #135 merged, vLLM could see benefit from torch.compile backend given compiler-native integration with PagedAttention kernels.

Is there an easy way to see what the latest/nightly MBU is for torch compile on say, H100 / Llama3 70B?

Also interested in cold start compile time

cc @msaroufim

Custom CUDA extensions

We'd like to make it really easy for people to add support for custom CUDA extensions in ao and there's a few pieces of work we need to do to get there

Get an example of a custom cuda extension working #135
Add a tutorial for a custom cuda extension #186
Make missing cuda toolkit error louder and more obvious #186
Ensure cuda extensions are built using manylinux - thanks @seemethere

Follow up work in a separate issue

Make an example without premium runners - you can build cuda extensions without a cuda machine per @malfet
Add a useful kernel people should be using like paged attention

Saving autoquant quantization plan

First of all, thank you for the great library! It makes quantization really easy.

Is it possible to run autoquant once and later applying the same quantization plan again? Or would I need to manually look at logs right now to see what autoquant came up with so I can apply the same quantization later?

// I see there's AUTOQUANT_CACHE that gets used to save the timings, maybe just saving/loading that will do?
// Seems like ^ works!

pytorch-labs / ao Goto Github PK

ao's Issues

Spillover from 0.2.0

Benchmarking

Documentation

Tutorials

Core

🚀 The feature, motivation and pitch

Alternatives

Additional context

Summary

Emerging dtypes

Quantization techniques

Optimized kernels

Conversion to-from popular model formats

Pruning

Bitnet 1.58 Groundwork

Steps to get numbers:

Code to reproduce the issue

Problems

Problem 1

Problem 2

Problem 3

Problem 4

Solutions

New Features

Better Engineering

Repo Health

Summary

Pruning Algorithms

Open Questions:

Recipes / Benchmarks

Open Questions:

Accelerated Sparse Training

Performant Sparse Kernels

Open Questions:

Workstream 1: Edge Devices

Workstream 2: Explore new quantization methods

Workstream 3: Server GPU Inference

Numbers

End to end tutorials

Core concepts

Contributing

Features

Composability

Docstrings

Completed

Context

Some Questions

Recommend Projects

Recommend Topics

Recommend Org