Comments (9)
In oneDNN log I observed only 2D matmuls are using quantized kernels and 3D matmuls are using FP32 kernels. How we can enable these kernels for int8.
The 3D matmuls are about bmms in the attention, right? First of all, the enabling of these ops depends on the quantization recipe, i.e., in the model conversion phase of quantization, we need to insert quant ops before these bmms. I'm not sure about "graviton3" but for x86, we are enabling this. cc @leslie-fang-intel @Valentine233
from pytorch.
Hi @akote123, thanks for the question. Yes, matmul
quantization recipe is supported in X86InductorQuantizer
https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/quantizer/x86_inductor_quantizer.py#L786 with PT2E Quantization flow (Refer to this tutorial for details of how to use this flow).
As for the backend optimization: If you are talking about the bmm optimization in attention, we are actually working on a customized SDPA kernel which will running the bmm with int8 data type. It depends on oneDNN BRGEMM optimization, so is still under active development.
from pytorch.
Somewhat related on dynamic quantization support - but on CUDA:
from pytorch.
As for the backend optimization: If you are talking about the bmm optimization in attention, we are actually working on a customized SDPA kernel which will running the bmm with int8 data type. It depends on oneDNN BRGEMM optimization, so is still under active development.
Thank you . In the non compiler mode these mm ops and bmm ops are are handled by mkl and not enrouted to onednn brgemm. So with enablement of SDPA kernels these ops are handled by oneDNN instead of mkl?
from pytorch.
Yes, with enablement of SDPA kernels these ops are handled by oneDNN BRGemm.
from pytorch.
@leslie-fang-intel ,With non compile mode also these(matmul and bmm) are directed to oneDNN brgemm in the future?
from pytorch.
@leslie-fang-intel ,With non compile mode also these(matmul and bmm) are directed to oneDNN brgemm in the future?
We don't have plan to support non-compile mode yet. Can you give more background or details for the request? I can sync with team and feedback to you here.
from pytorch.
@leslie-fang-intel , Here I just wanted to understand why the oneDNN brgemm path is not followed in pytorch for matmul is it because of the reorder overhead .But in tensorflow the matmuls are handled by oneDNN.
from pytorch.
Hi @akote123, we have evaluated oneDNN Quantized Matmul path previously cc @Xia-Weiwen, the performance has some overhead as
- The B matrix has pack overhead during runtime time.
- U8U8 activation only supported with latest X86 CPU as SPR which means we need to covert B matrix from U8 to S8 in runtime otherwise.
from pytorch.
Related Issues (20)
- [inductor-cpu] Support dynamic shapes for test test_quantized_linear_amx HOT 9
- DISABLED test_linalg_lu_family_cuda_float64 (__main__.TestLinalgCUDA) HOT 1
- Start processes in _MultiProcessingDataLoaderIter on a ThreadPool
- CUDA failure with deterministic fancy indexed assignment with broadcasting
- [Feature Request] Add support for eager warmup run to Dynamo
- [export] Schematize nn_module_stack serialization HOT 3
- Inconsistency between `torch.get_device` and `torch.Tensor.get_device` with `__torch_function__` HOT 4
- `__torch_function__` does not work for functions that called within other overrided functions HOT 1
- torch.distributed hangs at first barrier call after upgrading to 2.4 HOT 1
- TORCH_COMPILE_CPROFILE=1 broken (strobelight might always be on internally?) HOT 9
- [DEBUG] Strange behavior observed with PyTorch 2.4.0 + Windows + CPU inference HOT 28
- Distributed tests failing on Amazon2023 AMI
- [FSDP2] root moduel parameters stays unsharded after forward before backward HOT 3
- Layer normalization on Nested Tensor ragged dimension fails when `lengths is not None`
- DISABLED [object Object] HOT 1
- DISABLED test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16 HOT 1
- backward of adaptive max pool (adaptive_max_pool2d_backward_cuda) doesn't have a deterministic implementation HOT 4
- Missing grad_fn information while torch.compile with customized gradient function HOT 1
- torch.compile should not recompiles when `.requires_grad=True` under `torch.no_grad()` context HOT 2
- DistributedCheckpoint's async_save doesn't work with 0-dimensional tensors under FSDP HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch.