I am trying dynamic quantization for Hugging face T5-small model in graviton3 .I have

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

T5 -small Dynamic quantization in graviton3 about pytorch HOT 9 OPEN

akote123 commented on September 25, 2024

T5 -small Dynamic quantization in graviton3

from pytorch.

Comments (9)

jgong5 commented on September 25, 2024 1

In oneDNN log I observed only 2D matmuls are using quantized kernels and 3D matmuls are using FP32 kernels. How we can enable these kernels for int8.

The 3D matmuls are about bmms in the attention, right? First of all, the enabling of these ops depends on the quantization recipe, i.e., in the model conversion phase of quantization, we need to insert quant ops before these bmms. I'm not sure about "graviton3" but for x86, we are enabling this. cc @leslie-fang-intel @Valentine233

from pytorch.

leslie-fang-intel commented on September 25, 2024 1

Hi @akote123, thanks for the question. Yes, matmul quantization recipe is supported in X86InductorQuantizer https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/quantizer/x86_inductor_quantizer.py#L786 with PT2E Quantization flow (Refer to this tutorial for details of how to use this flow).

As for the backend optimization: If you are talking about the bmm optimization in attention, we are actually working on a customized SDPA kernel which will running the bmm with int8 data type. It depends on oneDNN BRGEMM optimization, so is still under active development.

from pytorch.

vadimkantorov commented on September 25, 2024

Somewhat related on dynamic quantization support - but on CUDA:

#69364

from pytorch.

akote123 commented on September 25, 2024

@leslie-fang-intel

As for the backend optimization: If you are talking about the bmm optimization in attention, we are actually working on a customized SDPA kernel which will running the bmm with int8 data type. It depends on oneDNN BRGEMM optimization, so is still under active development.

Thank you . In the non compiler mode these mm ops and bmm ops are are handled by mkl and not enrouted to onednn brgemm. So with enablement of SDPA kernels these ops are handled by oneDNN instead of mkl?

from pytorch.

leslie-fang-intel commented on September 25, 2024

Yes, with enablement of SDPA kernels these ops are handled by oneDNN BRGemm.

from pytorch.

akote123 commented on September 25, 2024

@leslie-fang-intel ,With non compile mode also these(matmul and bmm) are directed to oneDNN brgemm in the future?

from pytorch.

leslie-fang-intel commented on September 25, 2024

@leslie-fang-intel ,With non compile mode also these(matmul and bmm) are directed to oneDNN brgemm in the future?

We don't have plan to support non-compile mode yet. Can you give more background or details for the request? I can sync with team and feedback to you here.

from pytorch.

akote123 commented on September 25, 2024

@leslie-fang-intel , Here I just wanted to understand why the oneDNN brgemm path is not followed in pytorch for matmul is it because of the reorder overhead .But in tensorflow the matmuls are handled by oneDNN.

from pytorch.

leslie-fang-intel commented on September 25, 2024

Hi @akote123, we have evaluated oneDNN Quantized Matmul path previously cc @Xia-Weiwen, the performance has some overhead as

The B matrix has pack overhead during runtime time.
U8U8 activation only supported with latest X86 CPU as SPR which means we need to covert B matrix from U8 to S8 in runtime otherwise.

from pytorch.

T5 -small Dynamic quantization in graviton3 about pytorch HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent