Comments (4)
Working on this: NVIDIA/cutlass#1413.
from ao.
from ao.
Great work so far on integrating w4a8
GEMM in Cutlass
!
Do you have plans on re-implementing this functionality in pre-Hopper architectures using Cutlass 3.x / CuTe
rather the Cutlass 2.x
apis that seem to be deprecated?
The 3.x
interface has some convenient sub-byte
primitives for slicing 4b
tensors but warp-level shuffling would still be needed for efficient tensor core loading and mma
.
Would be happy to help adapt 4b
mixed type gemm using CuTe
for Ampere
.
from ao.
Do you have plans on re-implementing this functionality in pre-Hopper architectures using
Cutlass 3.x / CuTe
rather theCutlass 2.x
apis that seem to be deprecated?
(Please send further comments to the PR mentioned above - I think it makes most sense to discuss CUTLASS features on CUTLASS GitHub pages.)
As it could be seen from my PR, this feature is implemented the same way as F16/S8
, and alike. For my purpose, and that is adding support for this operation into PyTorch, for Ampere architecture and for both eager and compiled mode, this is good enough. I'm not sure in which way my changes could be made more 3.x-y, as the functionality is implemented on the warp level, but if you have any suggestions, please post them either into this, or in separate PR.
from ao.
Related Issues (20)
- `torchao/_models/llama/eval.py` does not work with latest `lm_eval` HOT 4
- Nightly job keeps getting canceled
- Remove docstring duplications
- Enable docstring and README code testing
- The next tutorials HOT 5
- MX tests failing HOT 2
- Understanding 8da4w HOT 2
- NotImplementedError: Bitnet dispatch: attempting to run aten.to.dtype_layout, this is not supported
- Error message missing format string specifier
- what if below condition? about OCP Microscaling HOT 4
- [RFC] Intx Tensor Subclasses Quantization HOT 2
- What should `.dtype` for tensor subclass return? HOT 13
- HF checkpoint integration story HOT 5
- Any palns for surpporting more conv kernel? HOT 11
- [Tracker] WIP features for torchao 0.4
- Shard tests over multiple machines
- 7.16 nightly binaries are broken
- Fix skipped tests due to int4 weight packing op changes
- unwrap_tensor_subclass and nested tensor subclasses issue HOT 1
- Paged Low Bit Optimizers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ao.