We plan to add QAT for LLMs to torchao (as mentioned in the original RFC here <a class

Working on this: <a class="issue-link js-issue-link" data-error-text="Failed to load t

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[New Feature] CUTLASS kernels for w4a8 quantization about ao HOT 4 OPEN

supriyar commented on July 18, 2024 2

[New Feature] CUTLASS kernels for w4a8 quantization

from ao.

Comments (4)

alexsamardzic commented on July 18, 2024 1

Working on this: NVIDIA/cutlass#1413.

from ao.

supriyar commented on July 18, 2024

cc @alexsamardzic @cpuhrsch

from ao.

jeromeku commented on July 18, 2024

@alexsamardzic

Great work so far on integrating w4a8 GEMM in Cutlass!

Do you have plans on re-implementing this functionality in pre-Hopper architectures using Cutlass 3.x / CuTe rather the Cutlass 2.x apis that seem to be deprecated?

The 3.x interface has some convenient sub-byte primitives for slicing 4b tensors but warp-level shuffling would still be needed for efficient tensor core loading and mma.

Would be happy to help adapt 4b mixed type gemm using CuTe for Ampere.

from ao.

alexsamardzic commented on July 18, 2024

Do you have plans on re-implementing this functionality in pre-Hopper architectures using Cutlass 3.x / CuTe rather the Cutlass 2.x apis that seem to be deprecated?

(Please send further comments to the PR mentioned above - I think it makes most sense to discuss CUTLASS features on CUTLASS GitHub pages.)

As it could be seen from my PR, this feature is implemented the same way as F16/S8, and alike. For my purpose, and that is adding support for this operation into PyTorch, for Ampere architecture and for both eager and compiled mode, this is good enough. I'm not sure in which way my changes could be made more 3.x-y, as the functionality is implemented on the warp level, but if you have any suggestions, please post them either into this, or in separate PR.

from ao.

Recommend Projects

[New Feature] CUTLASS kernels for w4a8 quantization about ao HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent