pytorch-labs / ao Goto Github PK
View Code? Open in Web Editor NEWCustom data types and layouts for training and inference
License: BSD 3-Clause "New" or "Revised" License
Custom data types and layouts for training and inference
License: BSD 3-Clause "New" or "Revised" License
Traceback (most recent call last):
File "C:\code\foo\scripts\quantize.py", line 4, in <module>
from torchao.quantization.smoothquant import (
File "C:\code\py-envs\foo\lib\site-packages\torchao\quantization\__init__.py", line 7, in <module>
from .smoothquant import * # noqa: F403
File "C:\code\py-envs\foo\lib\site-packages\torchao\quantization\smoothquant.py", line 17, in <module>
import torchao.quantization.quant_api as quant_api
File "C:\code\py-envs\foo\lib\site-packages\torchao\quantization\quant_api.py", line 18, in <module>
from .subclass import (
File "C:\code\py-envs\foo\lib\site-packages\torchao\quantization\subclass.py", line 13, in <module>
from torch.utils._python_dispatch import return_and_correct_aliasing
ImportError: cannot import name 'return_and_correct_aliasing' from 'torch.utils._python_dispatch' (C:\code\py-envs\foo\lib\site-packages\torch\utils\_python_dispatch.py)
This project seems to rely on torch nightly, which exports return_and_correct_aliasing
. It might be worthwhile to document this. I suppose one could argue it's obvious enough from this being an experimental repo, but it was surprising to me.
Nice work team, I'm looking forward to using this package.
https://github.com/pytorch/torchchat/actions/runs/9166937828/job/25203278945?pr=842
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
******** ET: a8w4dq INT4 group-wise quantized *******
Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
PyTorch version 2.4.0.dev20240507+cpu available.
Using device=cpu
Loading model...
Time to load model: 0.02 seconds
Quantizing the model with: {'linear:a8w4dq': {'groupsize': 32}}
Downloading builder script: 0%| | 0.00/5.67k [00:00<?, ?B/s]
Downloading builder script: 100%|██████████| 5.67k/5.67k [00:00<00:00, 27.6MB/s]
Traceback (most recent call last):
Time to quantize model: 3.01 seconds
File "/home/runner/work/torchchat/torchchat/export.py", line 119, in
main(args)
File "/home/runner/work/torchchat/torchchat/export.py", line 70, in main
model = _initialize_model(
File "/home/runner/work/torchchat/torchchat/build/builder.py", line 433, in initialize_model
quantize_model(model, builder_args.device, quantize, tokenizer)
File "/home/runner/work/torchchat/torchchat/quantize.py", line 58, in quantize_model
).quantized_model()
File "/home/runner/work/torchchat/torchchat/quantize.py", line 679, in quantized_model
return self.quantize(self.model)
File "/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/runner/work/torchchat/torchchat/quantize.py", line 629, in quantize
from torchao.quantization.quant_primitives import (
File "/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/torchao/init.py", line 14, in
from . import _C
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
Error: Process completed with exit code 1.
Running
py pytorch/benchmarks/sparse/benchmark_semi_structured_sparsity.py --mode nvidia-fixed-k --dtype bf16 --backend cutlass
(from #174)
results in
RuntimeError: _sparse_semi_structured_linear: CUTLASS not supported
@jcaip believes it's an issue with Windows, and the best work around would be to dual-boot Linux (which I'll try today!)
Full output:
PS C:\Users\phili\dev>
Started benchmark: nvidia-fixed-k | dtype: bf16
0%| | 0/18 [00:00<?, ?it/s]C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\sparse\semi_structured.py:111: UserWarning: The PyTorch API of SparseSemiStructuredTensor is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.sparse module for further information about the project.
warnings.warn(
Traceback (most recent call last):
File "C:\Users\phili\dev\pytorch\benchmarks\sparse\benchmark_semi_structured_sparsity.py", line 247, in <module>
df = pd.DataFrame.from_records(results)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\pandas\core\frame.py", line 2450, in from_records
first_row = next(data)
^^^^^^^^^^
File "C:\Users\phili\dev\pytorch\benchmarks\sparse\benchmark_semi_structured_sparsity.py", line 220, in <genexpr>
eval_fn(mn, 10240, mn, dtype, args.contiguous, args.backend)
File "C:\Users\phili\dev\pytorch\benchmarks\sparse\benchmark_semi_structured_sparsity.py", line 123, in test_tensor
sparse_output = torch.mm(sA, B)
^^^^^^^^^^^^^^^
File "C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\sparse\semi_structured.py", line 199, in __torch_dispatch__
return cls.SPARSE_DISPATCH[func._overloadpacket](func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\sparse\_semi_structured_ops.py", line 115, in semi_sparse_mm
res = A._mm(B_padded)
^^^^^^^^^^^^^^^
File "C:\Users\phili\AppData\Local\Programs\Python\Python312\Lib\site-packages\torch\sparse\semi_structured.py", line 439, in _mm
res = torch._sparse_semi_structured_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: _sparse_semi_structured_linear: CUTLASS not supported
0%| | 0/18 [00:07<?, ?it/s]
Focus - benchmarking, documentation, tutorials, prototype to beta
Due date: June 13 2024
implements
decorator for affine quantization dtypeif you compare peak memory usage for gpt-fast and torchao's quantization APIs, the torchAO APIs have extremely high peak memory usage (though perf is the same)
with the new benchmark code we can see https://github.com/pytorch/ao/blob/main/torchao/_models/llama/benchmark_results.txt
Lines 5,7 and 6 correspond to no, int8wo, and int4wo quant on llama2-7b which show a peak_mems of 13.88, 14.50 and 15.92 GB respectfully (perf: 105.02, 147.03, 199.81 tok/s)
meanwhile for gpt-fast if i run
export MODEL_REPO=meta-llama/Llama-2-7b-chat-hf
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --compile
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 --groupsize 64
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int4.g64.pth --compile
python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth --compile
we see for no, int8wo and int4wo quant on llama2-7b peak mem: 13.88, 7.74, 4.48 GB (perf: 105.09, 150.58, 204 tok/s)
When I tried out scripts/hf_eval.py
, I faced some problems (with 16GB VRAM GPU):
--quantize_on_cpu
flag (or something similar)? Quantize on CPU is slower than CUDA, so we probably don't want quantize on CPU being the default.AutoModelForCausalLM.from_pretrained()
.hf_eval.py
HFLM
, for example batch_size
, max_length
.
batch_size
can help speed up inference for some tasks, like hellaswag
max_length
--task_list
to --tasks
to align with lm_eval CLI; --device
flag is currently not used (but I don't think anyone will do LLM eval on CPU anyway?)Line 5 in 42c2376
https://arxiv.org/abs/2401.14112
I think you guys are really going to like this.
The deepspeed developers introduce FP6 datatype on cards without fp8 support, while maintaining full tensor core suppourt using a kernel they created called tc-fpX. Tests were done on a a100! And they achieved 1.69x-2.65x inference performance! And I assume this can be transferred over to training (with the exception of possibly the KV cache, and embedding module). This is really exiting, this will breathe new life into the rapidly aging a100 due to the introduction of the h100’s fp8.
It was merged into deepspeed in this commit:
microsoft/DeepSpeed@ccfdb84
Getting this pushed into the Pytorch as a dtype, that would be a major win. These are the benefits FP6 provides:
These kernels shouldn’t be limited by only the a100, they theoretically could work on any card with uint8_t and fp16 support. Provided these kernels were only written for a100 so without modification it might only work on ampere cards.
The tc-FPx kernel essentially takes 4fp16 values, quantizes them to fp6 with some place holders. Then they get pushed into an array built of 3x Uint8_t. As shown here:
cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel
Last year, we released pytorch-labs/torchao to provide acceleration of Generative AI models using native PyTorch techniques. Torchao added support for running quantization on GPUs, including int8 dynamic quantization (W8A8) and weight-only quantization (int8 and int4) that were composable with torch.compile. Combined, the APIs launched in torchao were able to power SOTA generative AI models across multiple modalities: Segment Anything, Stable Diffusion, and LLaMa.
The results were showcased in these blog posts -
https://pytorch.org/blog/accelerating-generative-ai/,
https://pytorch.org/blog/accelerating-generative-ai-2/,
https://pytorch.org/blog/accelerating-generative-ai-3/
Our investment in torchao is to accelerate Generative AI, using native PyTorch features, ensuring composability with torch.compile.
In 2024, we plan to adopt the following strategy for development of torchao
Let’s dive deeper into some of the coverage areas mentioned above.
Dtypes like NF4, MX4, groupwise quantized int4 are used for implementing various optimization techniques in the models. Last year, we posted a plan on how we wish to support these dtypes in PyTorch. In torchao, we will host tensor subclass based implementation of dtypes, existing examples include uint4 and NF4 that users can use for their own quantization techniques or override the implementation to support other dtypes that might be useful.
Moreover, users don’t need to write triton or cuda kernels for their custom dtypes. The implementation can be in python and torch.compile will take care of generating performant kernels under the hood.
Quantization can be done on only weights or weights+activations. Typically LLM quantization techniques for BS 1 (memory BW bound) use weight-only quantization techniques. But for larger batch sizes, or longer context length cases or for general throughput bound models quantizing the activations is also beneficial. Quantization, however, impacts the model accuracy and researchers have published techniques to mitigate this accuracy impact which currently exist externally as one repository per technique.
In torchao, we will plan to support the following class of techniques using PyTorch, made available via a simple UX and following the one-file-per-technique principle.
LLM weight only quantization techniques
Post training quantization
The two most popular techniques externally are GTPQ and AWQ, available via AutoGPTQ and AutoAWQ which include the technique as well as the performant kernels for faster quantization ops.
To that end, we will start by re-implementing the GPTQ and AWQ techniques into torchao using PyTorch via a simple/intuitive UX that supports saving/loading of quantized models, while realizing the memory savings on disk. Some open questions we need to address here include -
How much VRAM will be required for different quantization techniques
How do we convert to-from weights quantized for different backends (cpu and gpu today use different weight packing format)
In the future, as more interesting and cutting edge techniques are introduced, researchers can directly implement them in torchao or our team can re-implement them in PyTorch.
Weight and activation quantization techniques
Post training quantization
We’ve already implemented W8A8 quantization via the int_mm kernel in core. This has shown speedup on models like SAM, SDXL without any impact to model accuracy and can be turned on via a simple one-line UX implemented via module swap or tensor subclass.
However the challenge here is that some smaller layer shapes might not benefit from quantization due to the overhead in quantizing and dequantizing the activation tensors. Users can either statically ignore quantizing these layers or have a higher level API that figures out which layers are sensitive to quantization. We plan to provide a higher level API via the auto quantizer that applies this technique to the layers that stand to benefit the most to provide the benefits of quantization without having to worry too much about the configs to use.
Quantization aware training
Techniques here require access to fine-tuning, to tune the model to reduce accuracy impact of quantization. Recently, research like LLM-QAT is promising, showing that we can go down to W4A8 and 4-bit KV cache for LLMs. Moreover, newer lower bit techniques like AQLM, Quip# also include a component of fine-tuning to improve the model accuracy.
We will include the APIs and workflow to enable users to do QAT on LLMs, starting with implementing the LLM-QAT paper in torchao and further extending it to support other dtypes like MX4.
Kernels
Optimized kernels are key to making models run faster during inference. Today, in core we already have performant kernels like int_mm
and 4-bit weight quantization kernels for cpu (via intel) and gpu (via tinygemm). torchao will host performant kernels that will work with different backends with a guide on how to plug in these kernels into PyTorch models via the custom ops API. These kernels will compose with torch.compile, with the expectation that the user is expected to write a meta kernel implementation for this. For executorch, the expectation is that if the user provides a kernel that works with executorch then it should also work in eager mode.
We will also directly engage with the community, to upstream their performant kernels into torchao.
Autotuner
In order to use any CUDA kernel efficiently, we'll need to pick the right kernel hyperparameters. For an eager mode kernel, the same is true as well. A kernel autotuner will help here. We expect that the auto quantizer along with the kernel autotuner will make int8 dynamic quantization and int8/int4 weight-only quantization more usable and performant. A WIP example of what this might look like can be found here.
Release engineering
Shipping optimized, custom kernels requires extensibility mechanisms and release channels. We have custom operator support that integrates broadly, but our release mechanism might need to be optimized. It can be quite difficult to ship custom binaries across a broad range of operating systems and accelerators.
We can add a conversion util from popular model storage formats like gguf into PyTorch’s state_dict format. This will enable users to take a pre-existing quantized model from llama.cpp and have it run via PyTorch eager mode for desktop cpu/gpu and executorch for on-device cases. We’ll share more details here soon.
In addition to quantization, we’ve seen promising results with sparsity as well on GPUs. We will share more updates on what torchao will host for the space of sparsity/pruning in the near future.
We'd love to hear any feedback or questions from the OSS community on this RFC. Thank you!
cc @msaroufim @cpuhrsch @jerryzh168 @HDCharles @andrewor14 @jcaip @jisaacso
After some talks with Saroufim and the cuda mode team working on bitnet, we've outlined a strategy for implementing bitnet 1.58 method into torch. This issue lays the groundwork for 2-bit trinary tensor quantization and bitnet linear work for Bitnet 1.58
I've set up a staging repo Staging with a number of items:
This covers the initial groundwork for getting working trinary networks into torch.
when use new sdpa,why does it dont need to be trained,now paras is added ?
To reproduce
conda create -n test_ao python=3.10
conda activate test_ao
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install git+https://github.com/pytorch/ao -v
Outputs (the packages are already cached in this case)
Using pip 24.0 from /home/ubuntu/miniconda3/envs/test_ao/lib/python3.10/site-packages/pip (python 3.10)
Collecting git+https://github.com/pytorch/ao
Cloning https://github.com/pytorch/ao to /tmp/pip-req-build-bcuh0mqg
Running command git version
git version 2.34.1
Running command git clone --filter=blob:none https://github.com/pytorch/ao /tmp/pip-req-build-bcuh0mqg
Cloning into '/tmp/pip-req-build-bcuh0mqg'...
Running command git rev-parse HEAD
b91b6be24afd1220331790ff0866f5b091165cd5
Resolved https://github.com/pytorch/ao to commit b91b6be24afd1220331790ff0866f5b091165cd5
Running command git rev-parse HEAD
b91b6be24afd1220331790ff0866f5b091165cd5
Running command pip subprocess to install build dependencies
Collecting setuptools
Using cached setuptools-69.5.1-py3-none-any.whl.metadata (6.2 kB)
Collecting wheel
Using cached wheel-0.43.0-py3-none-any.whl.metadata (2.2 kB)
Collecting ninja
Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting torch
Using cached torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting filelock (from torch)
Using cached filelock-3.14.0-py3-none-any.whl.metadata (2.8 kB)
Collecting typing-extensions>=4.8.0 (from torch)
Using cached typing_extensions-4.11.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch)
Using cached sympy-1.12-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
Using cached networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Collecting jinja2 (from torch)
Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting fsspec (from torch)
Using cached fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch)
Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch)
Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-nccl-cu12==2.20.5 (from torch)
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.1.105 (from torch)
Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)
Collecting triton==2.3.0 (from torch)
Using cached triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch)
Using cached nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
Using cached MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting mpmath>=0.19 (from sympy->torch)
Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Using cached setuptools-69.5.1-py3-none-any.whl (894 kB)
Using cached wheel-0.43.0-py3-none-any.whl (65 kB)
Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
Using cached torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
Using cached triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
Using cached typing_extensions-4.11.0-py3-none-any.whl (34 kB)
Using cached filelock-3.14.0-py3-none-any.whl (12 kB)
Using cached fsspec-2024.3.1-py3-none-any.whl (171 kB)
Using cached jinja2-3.1.4-py3-none-any.whl (133 kB)
Using cached networkx-3.3-py3-none-any.whl (1.7 MB)
Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Using cached MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Using cached nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
Installing collected packages: ninja, mpmath, wheel, typing-extensions, sympy, setuptools, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
Successfully installed MarkupSafe-2.1.5 filelock-3.14.0 fsspec-2024.3.1 jinja2-3.1.4 mpmath-1.3.0 networkx-3.3 ninja-1.11.1.1 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105 setuptools-69.5.1 sympy-1.12 torch-2.3.0 triton-2.3.0 typing-extensions-4.11.0 wheel-0.43.0
Installing build dependencies ... done
On my machine, it takes 30s just to install the cached packages. Note that this is done every time I re-install torchao. During development, it's quite annoying to have this extra 30s every time I need to re-compile CUDA/C++ code (pip install -e .
only works for python code).
If this is the first time installing torchao from source, there will be extra time downloading the packages (which are huge).
The culprit seems to be torch
being a build-system.requires
in pyproject.toml
. Perhaps this is a limitation of pip
not being able to recognize torch from conda? During the build process, it's also not clear if it is using existing torch (from conda) or pip-torch (may cause issues if the two versions mismatch? I'm using the latest version so issues may not arise).
2:4 sparisty is only supported on Ampere+ , we've only run benchmarks with A100s, but Phil (@philipbutler) has access to consumer GPUs that could also take advantage of sparse acceleration as well.
import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())
python benchmarks/sparse/benchmark_semi_structured_sparsity.py --mode nvidia-fixed-k --dtype bfloat16 --backend cutlass
python benchmarks/sparse/benchmark_semi_structured_sparsity.py --mode nvidia-fixed-mn --dtype bfloat16 --backend cutlass
Afterwards, it would be great to get benchmarks for the ViT-B shapes found here: https://github.com/pytorch/ao/blob/main/benchmarks/sam_vit_b_shapes.csv
Referring this https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats/custom_cast.py
Although it was originally meant for MX dtypes only (FP4 E2M1, FP6 E2M3, FP6 E3M2), expanding its functionality to support any custom FPx dtype would be useful for developing and experimenting with custom FPx kernels.
Case in point, FP6-LLM upstream added support for FP5 E2M2 (https://github.com/usyd-fsalab/fp6_llm). This is what I need to write to support FP32->FP5 E2M2.
# define constants for F32 <-> F5_E2M2
F5_E2M2_MAX = 7.0 # (2 ** (0b11 - 0b01)) * (1 + 0.5 + 0.25)
F5_E2M2_MIN_NORMAL = 1.0 # (2 ** (0b01 - 0b01))
EBITS_F5_E2M2 = 2
MBITS_F5_E2M2 = 2
F5_E2M2_EXP_BIAS = 0b01
F5_E2M2_MAX_INT = (1 << 4) - 1
SIGN_MASK_F5_E2M2 = 1 << 4
MAGIC_ADDER_F5_E2M2 = (1 << (MBITS_F32 - EBITS_F5_E2M2)) - 1
DENORM_F32TOF5_E2M2_EXP = (
# exp bias conversion between formats
(F32_EXP_BIAS - F5_E2M2_EXP_BIAS)
# mantissa length difference between formats
+ (MBITS_F32 - MBITS_F5_E2M2)
# add one to encoded exponent for denormalized numbers
+ 1
)
DENORM_F32TOF5_E2M2_MASK_INT = DENORM_F32TOF5_E2M2_EXP << MBITS_F32
# reinterpret int32 as float32 in Python
# see https://stackoverflow.com/a/34446112/1058521
DENORM_F32TOF5_E2M2_MASK_FLOAT = struct.unpack("!f", struct.pack("!I", DENORM_F32TOF5_E2M2_MASK_INT))[0]
def f32_to_f5_e2m2_unpacked(x: Tensor):
return _f32_to_f4_or_f6_unpacked(
x,
F5_E2M2_MAX,
F5_E2M2_MIN_NORMAL,
DENORM_F32TOF5_E2M2_MASK_FLOAT,
DENORM_F32TOF5_E2M2_MASK_INT,
EBITS_F5_E2M2,
MBITS_F5_E2M2,
F5_E2M2_EXP_BIAS,
MAGIC_ADDER_F5_E2M2,
F5_E2M2_MAX_INT,
SIGN_MASK_F5_E2M2,
)
Ideally, we shouldn't need to calculate all the constants by ourselves, only provide number of E and M bits, and these constants should be calculated within the function (or cache them somewhere, though I think re-calculating these constants shouldn't take much time).
The other direction (FPx->FP32) is a bit trickier when handling denormal FPx, but should still be possible to make it more generic.
Proposed changes
_f32_to_f4_or_f6_unpacked()
and _f4_or_f6_unpacked_to_f32()
to _f32_to_fpx_unpacked(x, n_ebits, n_mbits)
and _fpx_unpacked_to_f32(x, n_ebits, n_mbits)
(packed format is out of scope, should be handled separately for each case)custom_cast.py
to an upper level e.g. prototype/fp_cast_utils.py
(e.g. functions for packed fp4, custom triton kernels should stay in custom_cast.py
)Tagging @vkuzo and @msaroufim for discussion and opinion.
The dequantize_affine
function modifies the input
in place, resulting in different outputs when a model is run multiple times with the same inputs.
Should we consider cloning the input
, scale
, and zero_point
at the beginning of the function to prevent this issue?
# demo.py
# modified from https://github.com/pytorch/ao/blob/cbc74ee6a3dc0bae367db5b03bc58896fffe3ae0/test/quantization/test_quant_api.py#L449-L467
import unittest
import torch
from torchao.dtypes import to_aq, AffineQuantizedTensor
from torchao.quantization.quant_primitives import MappingType, ZeroPointDomain
from torchao.quantization.quant_api import quantize
from torchao.quantization.utils import TORCH_VERSION_AFTER_2_4
import copy
class ToyLinearModel(torch.nn.Module):
def __init__(self, m=64, n=32, k=64):
super().__init__()
self.linear1 = torch.nn.Linear(m, n, bias=False).to(torch.float)
self.linear2 = torch.nn.Linear(n, k, bias=False).to(torch.float)
def example_inputs(self, batch_size=1):
return (torch.randn(batch_size, self.linear1.in_features).to(torch.float),)
def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x
@unittest.skipIf(not TORCH_VERSION_AFTER_2_4, "Test only enabled for 2.4+")
@unittest.skipIf(not torch.cuda.is_available(), "Need CUDA available")
def test_quantized_tensor_subclass_int4():
# weight settings
groupsize = 32
mapping_type = MappingType.ASYMMETRIC
block_size = (1, groupsize)
target_dtype = torch.int32
quant_min = 0
quant_max = 15
eps = 1e-6
zero_point_dtype = torch.bfloat16
# !! For using `AffineQuantizedTensor.dequantize`
preserve_zero = True
zero_point_domain = ZeroPointDomain.INT
# use 1024 so that we don't need padding
m = ToyLinearModel(1024, 1024, 1024).eval().to(torch.bfloat16).to("cuda")
m_copy = copy.deepcopy(m)
example_inputs = tuple(map(lambda x: x.to(torch.bfloat16).to("cuda"), m.example_inputs()))
def apply_weight_quant(weight):
return to_aq(
weight,
mapping_type,
block_size,
target_dtype,
quant_min,
quant_max,
eps,
zero_point_dtype=zero_point_dtype,
preserve_zero=preserve_zero,
zero_point_domain=zero_point_domain,
)
m = quantize(m, apply_weight_quant)
assert isinstance(m.linear1.weight, AffineQuantizedTensor)
assert isinstance(m.linear2.weight, AffineQuantizedTensor)
res = m(*example_inputs)
res2 = m(*example_inputs)
assert torch.equal(res, res2), f"The results of the model are not consistent"
# pytest -sv demo.py
CI config and unit test template: #190
import torch
import torchvision.models.vision_transformer as models
# Load Vision Transformer model
model = models.vit_b_16(pretrained=True)
import torchao
model.eval().cuda().to(torch.bfloat16)
from torchao.quantization import apply_dynamic_quant
apply_dynamic_quant(model)
from torch._inductor import config as inductorconfig
inductorconfig.force_fuse_int_mm_with_mul = True
model = torch.compile(model, mode='max-autotune')
input_tensor = torch.randn(1, 3, 224, 224, dtype=torch.bfloat16, device='cuda')
model(input_tensor)
causes crash
[...]
self.out_proj.weight,
File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1704, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
torch._dynamo.exc.TorchRuntimeError: Failed running call_module L__self___encoder_layers_encoder_layer_0_self_attention(*(FakeTensor(..., device='cuda:0', size=(1, 197, 768), dtype=torch.bfloat16,
grad_fn=<NativeLayerNormBackward0>), FakeTensor(..., device='cuda:0', size=(1, 197, 768), dtype=torch.bfloat16,
grad_fn=<NativeLayerNormBackward0>), FakeTensor(..., device='cuda:0', size=(1, 197, 768), dtype=torch.bfloat16,
grad_fn=<NativeLayerNormBackward0>)), **{'need_weights': False}):
'DynamicallyPerAxisQuantizedLinear' object has no attribute 'weight'
from user code:
File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torchvision/models/vision_transformer.py", line 298, in forward
x = self.encoder(x)
File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torchvision/models/vision_transformer.py", line 157, in forward
return self.ln(self.layers(self.dropout(input)))
File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
return forward_call(*args, **kwargs)
File "/scratch/cpuhrsch/miniconda3/envs/nightly20240318py310/lib/python3.10/site-packages/torchvision/models/vision_transformer.py", line 113, in forward
x, _ = self.self_attention(x, x, x, need_weights=False)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
The code is out, it's quite simple and short
Opening this so I can track how to add this to ao and make sure it works well with torch.compile(). This will likely need blackwell to perform decently
https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf
In order to support sub-byte dtypes for quantization, I (and many others) believe that it is better to pack these smaller dtypes into existing pytorch dtypes in order to reduce memory bandwidth contention for a bit of increased computation. Here is a preliminary algorithm in pytorch for doing this. It supports many types of conversions as seen in the tests.
Inspecting the compiled Triton code seems promising because it only launches one kernel and one buffer. Here is a snippit
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex % 4
x1 = (xindex // 4)
x2 = xindex
tmp0 = x0
tmp1 = tl.full([1], 0, tl.int64)
tmp2 = tmp0 >= tmp1
tmp3 = tl.full([1], 1, tl.int64)
tmp4 = tmp0 < tmp3
tmp5 = tl.load(in_ptr0 + (x1), tmp4 & xmask, eviction_policy='evict_last', other=0.0)
tmp6 = tl.full([1], 6, tl.uint8)
tmp7 = tmp5 >> tmp6
tmp8 = tl.full([1], 3, tl.uint8)
tmp9 = tmp7 & tmp8
tmp10 = tl.full(tmp9.shape, 0.0, tmp9.dtype)
tmp11 = tl.where(tmp4, tmp9, tmp10)
tmp12 = tmp0 >= tmp3
tmp13 = tl.full([1], 2, tl.int64)
tmp14 = tmp0 < tmp13
tmp15 = tmp12 & tmp14
tmp16 = tl.load(in_ptr0 + (x1), tmp15 & xmask, eviction_policy='evict_last', other=0.0)
tmp17 = tl.full([1], 4, tl.uint8)
tmp18 = tmp16 >> tmp17
tmp19 = tmp18 & tmp8
tmp20 = tl.full(tmp19.shape, 0.0, tmp19.dtype)
tmp21 = tl.where(tmp15, tmp19, tmp20)
tmp22 = tmp0 >= tmp13
tmp23 = tl.full([1], 3, tl.int64)
tmp24 = tmp0 < tmp23
tmp25 = tmp22 & tmp24
tmp26 = tl.load(in_ptr0 + (x1), tmp25 & xmask, eviction_policy='evict_last', other=0.0)
tmp27 = tl.full([1], 2, tl.uint8)
tmp28 = tmp26 >> tmp27
tmp29 = tmp28 & tmp8
tmp30 = tl.full(tmp29.shape, 0.0, tmp29.dtype)
tmp31 = tl.where(tmp25, tmp29, tmp30)
tmp32 = tmp0 >= tmp23
tmp33 = tl.full([1], 4, tl.int64)
tmp34 = tmp0 < tmp33
tmp35 = tl.load(in_ptr0 + (x1), tmp32 & xmask, eviction_policy='evict_last', other=0.0)
tmp36 = tl.full([1], 0, tl.uint8)
tmp37 = tmp35 >> tmp36
tmp38 = tmp37 & tmp8
tmp39 = tl.full(tmp38.shape, 0.0, tmp38.dtype)
tmp40 = tl.where(tmp32, tmp38, tmp39)
tmp41 = tl.where(tmp25, tmp31, tmp40)
tmp42 = tl.where(tmp15, tmp21, tmp41)
tmp43 = tl.where(tmp4, tmp11, tmp42)
tl.store(out_ptr0 + (x2), tmp43, xmask)
''', device_str='cuda')
def call(args):
arg0_1, arg1_1, arg2_1, arg3_1 = args
args.clear()
s0 = arg0_1
s1 = arg1_1
s2 = arg2_1
assert_size_stride(arg3_1, (s0, s1, s2), (s1*s2, s2, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((s0, s1, s2, 4), (4*s1*s2, 4*s2, 4, 1), torch.uint8)
# Source Nodes: [stack], Original ATen: [aten.stack]
triton_poi_fused_stack_0_xnumel = 4*s0*s1*s2
stream0 = get_raw_stream(0)
triton_poi_fused_stack_0.run(arg3_1, buf0, triton_poi_fused_stack_0_xnumel, grid=grid(triton_poi_fused_stack_0_xnumel), stream=stream0)
del arg3_1
return (reinterpret_tensor(buf0, (s0, s1, 4*s2), (4*s1*s2, 4*s2, 1), 0), )
We're putting together a loose RFC for our general plans that should be out shortly.
However we know that we want to work with the researchers / OSS to land advanced pruning algorithms into torchao.
These pruning algorithms should extend the Sparsifier
class found in torch.ao.pruning
.
We're also interested in potentially adding additional fast sparse kernels into torchao
. This requires some additional discussion, in particular if we want to land these kernels as is for eager mode support, or to try and generate these kernels with triton.
To reproduce, install torchao from main, then import torchao
pip install git+https://github.com/pytorch/ao
python -c "import torchao"
Error
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import torchao
File ~/code/ao/torchao/__init__.py:8
6 from . import dtypes
7 import torch
----> 8 from torch.testing._internal.common_utils import IS_FBCODE
9 if not IS_FBCODE:
10 from . import _C
File ~/miniconda3/envs/dev2.3/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py:62
47 from typing import (
48 Any,
49 Callable,
(...)
58 Union,
59 )
60 from unittest.mock import MagicMock
---> 62 import expecttest
63 import numpy as np
65 import __main__ # type: ignore[import]
ModuleNotFoundError: No module named 'expecttest'
The problem is self-explanatory. 2 possible solutions:
requirements.txt
Right now if I pip install torchao
on my M1 laptop it'll install 0.1 binaries and not the latest release for 0.2
Similarly if I try to install the nightly binaries I get a hard error
(ao) pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cpu
Looking in indexes: https://download.pytorch.org/whl/nightly/cpu
ERROR: Could not find a version that satisfies the requirement torchao-nightly (from versions: none)
ERROR: No matching distribution found for torchao-nightly
(ao) pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/
Looking in indexes: https://download.pytorch.org/whl/nightly/
Requirement already satisfied: torchao in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (0.1)
Requirement already satisfied: torch in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torchao) (2.2.2)
Requirement already satisfied: numpy in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torchao) (1.26.4)
Requirement already satisfied: sentencepiece in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torchao) (0.2.0)
Requirement already satisfied: packaging in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torchao) (24.0)
Requirement already satisfied: filelock in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (3.13.4)
Requirement already satisfied: typing-extensions>=4.8.0 in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (4.11.0)
Requirement already satisfied: sympy in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (1.12)
Requirement already satisfied: networkx in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (3.3)
Requirement already satisfied: jinja2 in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (3.1.3)
Requirement already satisfied: fsspec in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from torch->torchao) (2024.3.1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from jinja2->torch->torchao) (2.1.5)
Requirement already satisfied: mpmath>=0.19 in /opt/anaconda3/envs/ao/lib/python3.10/site-packages (from sympy->torch->torchao) (1.3.0)
(ao) pip list
Package Version
---------------------- -----------
attrs 23.2.0
bashlex 0.18
bitsandbytes 0.42.0
bracex 2.4
certifi 2024.2.2
charset-normalizer 3.3.2
cibuildwheel 2.17.0
contourpy 1.2.1
cycler 0.12.1
exceptiongroup 1.2.1
expecttest 0.2.1
filelock 3.13.4
fonttools 4.51.0
fsspec 2024.3.1
git-filter-repo 2.38.0
huggingface-hub 0.23.0
hypothesis 6.100.5
idna 3.7
iniconfig 2.0.0
Jinja2 3.1.3
kiwisolver 1.4.5
lxml 5.2.1
MarkupSafe 2.1.5
matplotlib 3.8.4
mpmath 1.3.0
networkx 3.3
ninja 1.11.1.1
numpy 1.26.4
packaging 24.0
pandas 2.2.2
parameterized 0.9.0
pillow 10.3.0
pip 23.3.1
platformdirs 4.2.0
pluggy 1.5.0
pyparsing 3.1.2
pytest 7.4.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
qtorch 0.3.0
regex 2024.4.28
requests 2.31.0
safetensors 0.4.3
scipy 1.13.0
sentencepiece 0.2.0
setuptools 68.2.2
six 1.16.0
sortedcontainers 2.4.0
sympy 1.12
tokenizers 0.19.1
tomli 2.0.1
torch 2.2.2
torchao 0.1
tqdm 4.66.4
transformers 4.40.2
typing_extensions 4.11.0
tzdata 2024.1
unittest-xml-reporting 3.2.0
urllib3 2.2.1
wheel 0.41.2
(ao)
i'm collecting a few issues I've seen, I have no clear picture of how to solve them as of this moment but aggregating them in the hopes that inspiration will strike
The below issue is solved by installing ao and then cd
out of the ao directory. IIRC PyTorch has a similar problem in a repro shared by @jerryzh168
Traceback (most recent call last):
File "/home/jerryzh/ao/example.py", line 2, in <module>
from torchao.quantization.quant_primitives import MappingType, ZeroPointDomain
File "/home/jerryzh/ao/torchao/__init__.py", line 8, in <module>
from . import _C
ImportError: cannot import name 'C' from partially initialized module 'torchao' (most likely due to a circular import) (/home/jerryzh/ao/torchao/__init_.py)
Another issue here is building the fp6 kernels is failing https://hastebin.com/share/riridivafa.rust but the nvcc and gcc versions seem fine in a repro shared by @CoffeeVampir3
This error shows up when you either pip install ao or build it with a mismatch in cuda versions in a repro shared by @vayuda
python test/quantization/test_quant_api.py
Traceback (most recent call last):
File "/u/pj8wfq/ao/test/quantization/test_quant_api.py", line 21, in <module>
from torchao.dtypes import (
File "/u/pj8wfq/ao/torchao/__init__.py", line 8, in <module>
from . import _C
ImportError: /u/pj8wfq/ao/torchao/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
pypi binaries are crashing on non CUDA devices
File "/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/torchao/init.py", line 14, in
from . import _C
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
We need graceful solutions but in the meantime I'm embarassed to say I've been recommending a nuclear option which is to disable C extensions
Specifically in torchao/__init__.py
delete
if not _IS_FBCODE:
from . import _C
from . import ops
And in setup.py
delete
ext_modules=get_extensions(),
Does Torch already support int8 mm on cuda? Also, what version of torch can run torch now?
This issue tracks outstanding issues for a torchao 0.1 release
New Functionality
Tutorials/BE
If time permits (or v0.2)
This issue tracks outstanding feature requests for torchao. If you'd like a specific feature to be added to torchao, please comment directly here.
Quantization Techniques (based on planned, new requests)
DTypes
Sparsity APIs
Kernels
cc @cpuhrsch
A usecase: storing a full backtracking pointer matrix can be okay for needleman/ctc alignment (4x memory saving compared to uint8 representation), if 2bit data type is used. Currently it's possible to do this with bit manipulation magic, but probably not very efficient (store and load will require masking and shifting, not fused)
Another usecase: compressed BoolTensor for binary neural networks
Another usecase: extremely low-bit quantized representations.
Is something like this already implemented for quantization? Probably a simple version of this feature could be providing some explicitly utility functions like calculating size of the holder uint8
tensor, fused store and load functions (potentially explicitly batched, e.g. actual store is delayed until some aligned number of memory lines has arrived)
In NumPy the related functionality is np.packbits
and np.unpackbits
, however these are designed to work only with 1-bit contained type. 2-bit/4-bit would be cool as well.
On 1-bit side, another related project is RoaringBitmap https://github.com/RoaringBitmap/RoaringBitmap (http://roaringbitmap.org/) - for compressed bitsets for set operations.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @izdeby
@ebsmothers shared this interesting issue pytorch/torchtune#1020
And it got me thinking a bit more about how we validate numerical correctness.
On one hand we do rely on the compiler a lot so we can do numerical checks by comparing eager vs the compiler
But on the other hand as people start to use ao more they're likely to compare it to existing libraries with expert written kernels so in this case there's a few layers of validation we can do and taking the nf4 tensor from qlora as an example we can
Opening this issue so we can discuss more, my 2c is 1 and 2 are sufficient for CI and 3 can be a 1 time thing
Sparsity, like quantization, offers increased model performance at the expense of some model quality. However, it is not as widely used / researched as a technique, despite offering similar performance benefits. With the recent explosion in model sizes in GenAI, and with quantization pushing 1-bit limits, there has been renewed interest in sparsity, specifically for GPU backend sparsity patterns.
The parallel nature of GPU backends makes accelerating unstructured sparsity difficult. However, there exist specific sparsity patterns (block-wise, semi-structured) that are more amenable to acceleration on GPUs. Over the last year, we’ve integrated these fast sparse kernels into PyTorch Core, so that all users can show up to a with just a few lines of code:
Our goal for torchao.sparsity is to drive research / adoption of these GPU sparsity patterns.
We feel that the main problem current researchers / users face is fragmentation. Researchers rightfully aim to show end-to-end results, but this means a lot of time is spent figuring out how to integrate with PyTorch and implementation questions like: When should I mask? When/how should I store the compressed representation? Do I want in-place or out-of-place mask updates? How can I call sparse matmul instead of dense?
We hope to change that by providing tutorials and APIs for both sparse kernels (tensor subclassing) and pruning algorithms (torch.ao.pruning.Sparsifier) that users can extend. We feel like the above problems can be solved once, by torchao
, letting researchers focus on pushing sparse kernel performance or more accurate pruning algorithms.
We're also hoping to create a new extension point by releasing the workflows we have designed with xFormers that enable accelerated sparse training, not just sparse inference.
As such, we plan on launching torchao.sparsity with the following features in v0.2:
However, we’d like feedback from the community to set the longer-tem vision of sparsity. Also fee free to chime in with any other thoughts you want to share!
We plan to host a set of OSS pruning algorithms in torchao. These pruning algorithms should extend the torch.ao.pruning.BaseSparsifier class, like WandaSparsifier. We welcome community contributions for pruning algorithms, provided they extend the BaseSparsifier.
We have often found pruning to be very model specific, with little generalization across domains. As such we hope to land sparse training recipes for specific models / datasets, showing how different pruning algorithms can be used. We are specifically interested in recipes that compose with quantization.
Additionally, we hope that these benchmark numbers can help first-time users of sparsity better understand the tradeoffs involved and encourage researchers to contribute SOTA pruning algorithms.
While much work has been done on sparsity for inference, sparsity for training has remained much more challenging. Thanks to the work done by xFormers, we’ve upstreamed fast runtime semi-structured sparsification kernels into PyTorch Core, which allow for prune -> compress -> sparse_mm to happen faster than dense matmul. We also aim to release an example of accelerated sparse training for the OSS community to extend.
There are additional sparsity patterns that may be supported on GPUs, which would require additional fast sparse kernels. We hope that torchao can be a staging ground for these kernels. We plan to upstream these kernels to Core as we see fit, depending on adoption.
Some initial options are:
torch.compile
rather than hand writing them?cc @supriyar @cpuhrsch @msaroufim @pytorch-labs/team-superblock @danthe3rd @mklasby @ngc92 @hgyhungry
Currently we need to do:
from torchao.quantization.utils import unwrap_tensor_subclass
m_unwrapped = unwrap_tensor_subclass(m)
# export
m = torch.export.export(m_unwrapped, example_inputs).module()
# aot_compile
torch._export.aot_compile(m_unwrapped, example_inputs)
to make tensor subclass work with export/aot_compile, this should be added to default export path directly
Following the recent success of the LLM-QAT paper, our high-level goal is to provide a PyTorch native workflow for LLM quantization-aware training (QAT) leveraging the quantization primitives and kernels provided by torchao, which is planned to become the de facto OSS library for AO techniques and kernels in PyTorch across different platforms (#47). We also hope to eventually integrate with TorchTune, a recently open-sourced library for fine-tuning and experimenting with LLMs, to provide an end-to-end flow that supports both finetuning and QAT.
Executorch provides a mechanism for quantizing Llama2 using post-training quantization (PTQ) techniques such as GPTQ, and lowering it to backends like XNNPACK. The main goal of this workstream is to provide a QAT drop-in replacement for GPTQ but with superior accuracy, starting with Llama2 7b using the following quantization/training configurations:
We plan to adopt the same eager mode quantization implementation used by the PTQ flow. In the future, if we decide to experiment with static quantization for activations for example, then we can explore using the PT2 Export QAT flow.
This workstream is largely backend agnostic; our goal is to motivate the backends (mobile or server CPU/GPU) to build the relevant kernels once we have demonstrated the initial success of a particular quantization configuration. There is a large design space we can experiment with summarized below. The suggested quantization and training techniques are primarily motivated by the LLM-QAT paper, but also by ongoing developments across the industry.
We can start with the following dimensions:
This is an extension of the recent gpt-fast efforts to quantize Llama but for QAT. An important goal here is to reuse the same quantization primitives as Workstream 1 to unify the two flows as much as possible. We can start with the following quantization configurations:
Traceback (most recent call last):
File "C:/Program Files/JetBrains/PyCharm Community Edition 2023.2.1/plugins/python-ce/helpers/pydev/pydevd.py", line 1527, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2023.2.1\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:\code\foobar\scripts\quantize.py", line 37, in <module>
swap_linear_with_smooth_fq_linear(model)
File "C:\code\py-envs\foobar\lib\site-packages\torchao\quantization\smoothquant.py", line 219, in swap_linear_with_smooth_fq_linear
swap_linear_with_smooth_fq_linear(child, skip_fqn_list, new_fqn, alpha)
File "C:\code\py-envs\foobar\lib\site-packages\torchao\quantization\smoothquant.py", line 219, in swap_linear_with_smooth_fq_linear
swap_linear_with_smooth_fq_linear(child, skip_fqn_list, new_fqn, alpha)
File "C:\code\py-envs\foobar\lib\site-packages\torchao\quantization\smoothquant.py", line 219, in swap_linear_with_smooth_fq_linear
swap_linear_with_smooth_fq_linear(child, skip_fqn_list, new_fqn, alpha)
[Previous line repeated 1 more time]
File "C:\code\py-envs\foobar\lib\site-packages\torchao\quantization\smoothquant.py", line 215, in swap_linear_with_smooth_fq_linear
target_cls = source_cls_to_target_cls[type(child)]
KeyError: <class 'torch.nn.modules.linear.NonDynamicallyQuantizableLinear'>
python-BaseException
Expected: NonDynamicallyQuantizableLinear
layer is skipped (possibly with a warning), or properly handled.
Actual: exception.
It sounds like HDCharles was planning on fixing this more generally: pytorch/pytorch#58969
Hi, I've been playing around with QLoRA using the NF4Tensor class from this great library. But i noticed that the NF4 data type is using 8 bits of memory, where it should be using ~4.1 bits according to the paper. I verified this by initializing a single tensor
>>> t4 = torchao.dtypes.nf4tensor.NF4Tensor.from_tensor(torch.rand([1024, 4096], dtype=torch.bfloat16), 64, 256)
>>> torch.cuda.memory_allocated(), torch.cuda.memory_reserved()
(0, 0)
>>> t4 = t4.cuda()
>>> torch.cuda.memory_allocated(), torch.cuda.memory_reserved()
(4194304, 20971520)
which is 4194304 / (1024 * 4096) * 8 = 8 bits per parameter
I was wondering if this a bug, or is there some intrinsic limitation here? Thanks
I have encountered a problem when using QLoRa from the LoRALinear class you have in Torchtune. Apparently, when the ‘quantize_base’ parameter is set to True, the NF4Tensor class is called, which transforms the tensor to 4 bits from the to_nf4 function.
This function set the tensor in bf16 without specifying the device, giving the problem of tensors on two different devices, CPU and GPU. When the tensor is set, by default it is assigned to CPU unless otherwise specified. I got this error trying to do QLoRA on my GPU.
I leave here the original function vs the one I modified to get out of it:
Original function according to torchao.dtypes.nf4tensor:
def to_nf4(tensor, block_size: int = 64, scaler_block_size: int = 256): tensor1 = tensor.to(torch.bfloat16) return NF4Tensor.from_tensor(tensor1, block_size, scaler_block_size)
Modified function for dealing with the problem:
def to_nf4(tensor, device, block_size: int = 64, scaler_block_size: int = 256): tensor1 = tensor.to(device=device, dtype=torch.bfloat16) return NF4Tensor.from_tensor(tensor1, block_size, scaler_block_size)
We plan to add QAT for LLMs to torchao (as mentioned in the original RFC here #47)
For this to run efficiently on the GPU we'd need kernel support for W4A8 quantization (int4 weights, int8 activations).
Other places where this has been raised before
NVIDIA/cutlass#1316,
NVIDIA/cutlass#1370
cc @andrewor14
Just listing out all the issues I'm seeing with our docs, feel free to pick something up and fix it. First step just add your documentation directly in a relevant subfolder in the repo directly and tag me to review
For API docstrings and end usage instructions that won't change a lot please put them here https://github.com/pytorch/ao/tree/main/docs so they get rendered on pytorch.org/docs
The repo is primarily about performance so we should share performance tables directly in the README until we figure out a dashboard like solution
For each sparsity or quantization technique you're working on feel free to add another subsection
torch.ao.pruning
accuracy benchmarks on llama2 or llama3Confirm they're visible on pytorch.org
PR is here, please feel free to comment in PR directly: #159
Currently there are many q/dq functions in torchao and pytorch, they mainly differ in the following dimensions:
Ideally, I think we should unify them, it might complicate the operator pattern that’s used by backends like xnnpack, but the code sharing and simplification of the representation it brings will be beneficial in the long term.
We defined three functions: choose_qparams_affine_per_block, quantize_affine_per_block, dequantize_affine_per_block, please checkout the docstrings of these functions in the PR for the definitions
get_block_size(input, {"quant_type": "per_channel_group", "group_size": 32, "axis": -1})
This is the error https://github.com/pytorch/ao/actions/runs/8977543410/job/24656486432?pr=216
Don't have time to debug tonight but cc @svekars who might have some ideas
Run cd docs
Running Sphinx v5.0.0
torchao_version_docs: refs/pull/216/merge
Version: main
making output directory... done
Using Sphinx-Gallery to convert rst text blocks to markdown for .ipynb files.
[autosummary] generating autosummary for: api_ref_dtypes.rst, api_ref_intro.rst, api_ref_kernel.rst, api_ref_quantization.rst, api_ref_sparsity.rst, dtypes.rst, getting-started.rst, index.rst, overview.rst, performant_kernels.rst, quantization.rst, sparsity.rst
Extension error (sphinx.ext.autosummary):
Handler <function process_generate_options at 0x7f0bd5e8cfe0> for event 'builder-inited' threw an exception (exception: no module named torchao.sparsity)
make: *** [Makefile:41: html] Error 2
Error: Process completed with exit code 2.
Are there any runnable demos of using Sparse-QAT/PTQ (2:4) to accelerate inference, such as applying PTQ to a 2:4 sparse LLaMA for inference acceleration? I am curious about the potential speedup ratio this could achieve.
The overall pipeline might be: compressing the Weight matrix using 2:4 sparsity and quantizing it to INT8 format through PTQ/QAT. The Activation matrix should also be quantized to INT8 format through PTQ/QAT. After such processing, the main type of computation would be INT8*INT8.
I would like to know if there is a tutorial document available, as I am a beginner in the field of quantization.
Thx!
construct NF4 tensors in chunks and check memory traces: #196
A16W4 axis=1
A16W4+ axis=1
A8W4 axis=1
A16W3 and A16W5
Mark said you should be familiar with this @vkuzo
https://github.com/pytorch/ao/actions/runs/9487632431/job/26150568432?pr=347
=================================== FAILURES ===================================
___________________ test_inference_compile_simple[fp4_e2m1] ____________________
elem_dtype = 'fp4_e2m1'
@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
@pytest.mark.parametrize("elem_dtype", SUPPORTED_ELEM_DTYPES)
def test_inference_compile_simple(elem_dtype):
"""
Smoke test for inference compile
"""
if elem_dtype in (torch.float8_e4m3fn, torch.float8_e5m2):
if not IS_CUDA_GE_89:
pytest.skip("CUDA capability >= 8.9 required for float8 in triton")
m = nn.Sequential(nn.Linear(4, 6, bias=False, dtype=torch.bfloat16))
m = m.cuda()
m_mx = copy.deepcopy(m)
block_size = 2
swap_linear_with_mx_inference_linear(m_mx, elem_dtype, block_size)
m_mx = torch.compile(m_mx, fullgraph="true")
x = torch.randn(2, 4, device="cuda", dtype=torch.bfloat16)
y_ref = m(x)
y_mx = m_mx(x)
sqnr = compute_error(y_ref, y_mx)
if elem_dtype is torch.float8_e4m3fn:
assert sqnr >= 20.0
else:
> assert sqnr >= 14.0
E AssertionError: assert tensor(13.1250, device='cuda:0', dtype=torch.bfloat16, grad_fn=<MulBackward0>) >= 14.0
test/prototype/mx_formats/test_mx_linear.py:192: AssertionError
Hi folks, thanks for the great work.
With #135 merged, vLLM could see benefit from torch.compile backend given compiler-native integration with PagedAttention kernels.
Is there an easy way to see what the latest/nightly MBU is for torch compile on say, H100 / Llama3 70B?
Also interested in cold start compile time
cc @msaroufim
We'd like to make it really easy for people to add support for custom CUDA extensions in ao and there's a few pieces of work we need to do to get there
Follow up work in a separate issue
First of all, thank you for the great library! It makes quantization really easy.
Is it possible to run autoquant once and later applying the same quantization plan again? Or would I need to manually look at logs right now to see what autoquant came up with so I can apply the same quantization later?
// I see there's AUTOQUANT_CACHE
that gets used to save the timings, maybe just saving/loading that will do?
// Seems like ^ works!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.