Comments (15)
This is expected on H100 as we have optimized H100 kernels for dense Gemm but not sparse yet. The sparse kernels you're running are most certainly Ampere ones recompiled for hopper. We have hopper sparse on our roadmap for sometime middle of 2024
from cutlass.
Thanks for the response, that makes sense. I've rerun the benchmarks on A100 and the results seem to be closer, however there are still some areas where the sparse kernels are about 30-50% slower than the dense ones.
E.g. take the settings above, take M=768, N=4096, K=4096. My best run time for dense is 0.105734 and my best run time for sparse is 0.150016. This is on A100. Is this still the expected result?
from cutlass.
from cutlass.
Could you please paste the logs of the best kernel of sparse and dense on A100?
from cutlass.
I have the result from the profiler in a csv in this gist for the best one for M=768, N=4096, K=4096.
from cutlass.
your sparse one is running a small tile 64x128 which your dense one is running a big 256x128? have you tried 256x128 or 128x256 or 128x128 for sparse?
from cutlass.
The tile size is referring to cta_{m, n, k}, correct? If so, my first thought is that the cutlass_profiler should be running all possible configurations it has given the settings I selected, and since I didn't specify explicitly any of those it should run those - however it doesn't. I went back to try manually setting e.g. --cta_m=128
, but that ended not running anything (which I believe means that the kernel/doesn't exist or can't be found?). This is the same for f16 and bf16.
I believe I should have installed all the possible kernels, since my install process was the following:
cmake .. -DCUTLASS_NVCC_ARCHS=80
make cutlass_profiler -j12
and I didn't select any specific kernels to exclude/include.
from cutlass.
you could add -DCUTLASS_LIBRARY_KERNELS=bf16_s16832spgemm_bf16_*_tt_align8
to cmake to enable all sparse gemm tile sizes.
from cutlass.
Using that flag produces no new kernels, which is verified by no additional results in the cutlass_profiler.
from cutlass.
what do you see when you run this command
ls $BUILD/tools/library/generated/gemm/80/bf16_s16832spgemm_bf16/
this is what I have
[haichengw@computelab-build-2 build_90a]$ ls tools/library/generated/gemm/80/bf16_s16832spgemm_bf16/
all_sm80_bf16_s16832spgemm_bf16_gemm_operations.cu cutlass_tensorop_bf16_s16832spgemm_bf16_128x64_64x3_tt_align8.cu cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_tt_align8.cu
cutlass_tensorop_bf16_s16832spgemm_bf16_128x128_128x3_tt_align8.cu cutlass_tensorop_bf16_s16832spgemm_bf16_256x128_64x3_tt_align8.cu cutlass_tensorop_bf16_s16832spgemm_bf16_64x256_64x4_tt_align8.cu
cutlass_tensorop_bf16_s16832spgemm_bf16_128x128_64x3_tt_align8.cu cutlass_tensorop_bf16_s16832spgemm_bf16_256x64_128x3_tt_align8.cu cutlass_tensorop_bf16_s16832spgemm_bf16_64x64_128x3_tt_align8.cu
cutlass_tensorop_bf16_s16832spgemm_bf16_128x256_64x3_tt_align8.cu cutlass_tensorop_bf16_s16832spgemm_bf16_256x64_64x3_tt_align8.cu cutlass_tensorop_bf16_s16832spgemm_bf16_64x64_64x4_tt_align8.cu
cutlass_tensorop_bf16_s16832spgemm_bf16_128x64_128x4_tt_align8.cu cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_128x3_tt_align8.cu
from cutlass.
I have the same contents as you. However the profiler only seem to run the 64x128, even when I don't specify cta.
I am running exactly
../tools/profiler/cutlass_profiler --operation=spgemm --m=768 --n=4096 --k=4096 --A=bf16 --B=bf16 --C=bf16 --D=bf16 --batch_count=1 --profiling-iterations=5 --warmup-iterations=1 --output=report.csv
as per my original post
from cutlass.
what if you just do ../tools/profiler/cutlass_profiler --m=768 --n=4096 --k=4096
from cutlass.
When I run that the only sparse GEMMs that don't have --cta_m=64 --cta_n=128
are those which are f32 or s4 or s8 in its inputs.
from cutlass.
you need to use tn (row x col) layout for f32, s4, s8.
from cutlass.
@jimwu6 have your issues been resolved?
from cutlass.
Related Issues (20)
- [QST] could you please help me understand how right_inverse work? HOT 2
- [QST] `retile_D` in mainloop HOT 2
- [QST] `TiledCopy` HOT 2
- CuTe documentation -- 02_layout_algebra.md has a wrong example in Complement section HOT 3
- [QST] How to pass a cute::Tensor as parameter to a device function? HOT 4
- [QST] `SmemLayoutAtom` Layouts HOT 3
- [QST] MMA_Traits shape mismatch HOT 9
- [QST] `Cutlass 3.0` `GemmUniversal` Interface HOT 6
- [BUG] Error Internal with large batch size in gemm::device::GemmBatched HOT 3
- [QST] how can i do w4a8 (int4 * int8) using cutlass? HOT 4
- [QST] Checking Tensor Shape returns 0 HOT 4
- [BUG] Build failed with nvcc "-G" on H100 HOT 3
- [cute gemm] HOT 2
- [QST] Constraint of Tensor Layout in FP8 GEMM Kernel HOT 4
- [QST] Build hangs on AWS p3.2xlarge Ubuntu HOT 7
- [QST] Epilogue Swizzle HOT 3
- [QST] Confusing stride definition in example_57
- [QST] Understanding `retile_*` HOT 4
- [BUG] Failure in TensorRef.rank2_column_major_interleaved test with Intel LLVM compiler HOT 3
- [QST] `uint4b / int4` Mixed Type GEMM using `Cutlass 3.x` for `Ampere`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cutlass.