I am benchmarking sparse and dense GEMMs through the cutlass profiler. I am seeing tha

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have the result from the profiler in a csv in this <a href="https://gist.github.com/

[QST] Sparse GEMM runs much worse than Dense GEMM in some cases about cutlass HOT 15 OPEN

jimwu6 commented on June 10, 2024

[QST] Sparse GEMM runs much worse than Dense GEMM in some cases

from cutlass.

Comments (15)

thakkarV commented on June 10, 2024

This is expected on H100 as we have optimized H100 kernels for dense Gemm but not sparse yet. The sparse kernels you're running are most certainly Ampere ones recompiled for hopper. We have hopper sparse on our roadmap for sometime middle of 2024

from cutlass.

jimwu6 commented on June 10, 2024

Thanks for the response, that makes sense. I've rerun the benchmarks on A100 and the results seem to be closer, however there are still some areas where the sparse kernels are about 30-50% slower than the dense ones.

E.g. take the settings above, take M=768, N=4096, K=4096. My best run time for dense is 0.105734 and my best run time for sparse is 0.150016. This is on A100. Is this still the expected result?

from cutlass.

thakkarV commented on June 10, 2024

@hwu36

from cutlass.

hwu36 commented on June 10, 2024

Could you please paste the logs of the best kernel of sparse and dense on A100?

from cutlass.

jimwu6 commented on June 10, 2024

I have the result from the profiler in a csv in this gist for the best one for M=768, N=4096, K=4096.

from cutlass.

hwu36 commented on June 10, 2024

your sparse one is running a small tile 64x128 which your dense one is running a big 256x128? have you tried 256x128 or 128x256 or 128x128 for sparse?

from cutlass.

jimwu6 commented on June 10, 2024

The tile size is referring to cta_{m, n, k}, correct? If so, my first thought is that the cutlass_profiler should be running all possible configurations it has given the settings I selected, and since I didn't specify explicitly any of those it should run those - however it doesn't. I went back to try manually setting e.g. --cta_m=128, but that ended not running anything (which I believe means that the kernel/doesn't exist or can't be found?). This is the same for f16 and bf16.

I believe I should have installed all the possible kernels, since my install process was the following:

cmake .. -DCUTLASS_NVCC_ARCHS=80
make cutlass_profiler -j12

and I didn't select any specific kernels to exclude/include.

from cutlass.

hwu36 commented on June 10, 2024

you could add -DCUTLASS_LIBRARY_KERNELS=bf16_s16832spgemm_bf16_*_tt_align8 to cmake to enable all sparse gemm tile sizes.

from cutlass.

jimwu6 commented on June 10, 2024

Using that flag produces no new kernels, which is verified by no additional results in the cutlass_profiler.

from cutlass.

hwu36 commented on June 10, 2024

what do you see when you run this command

ls $BUILD/tools/library/generated/gemm/80/bf16_s16832spgemm_bf16/

this is what I have

[haichengw@computelab-build-2 build_90a]$ ls tools/library/generated/gemm/80/bf16_s16832spgemm_bf16/
all_sm80_bf16_s16832spgemm_bf16_gemm_operations.cu		    cutlass_tensorop_bf16_s16832spgemm_bf16_128x64_64x3_tt_align8.cu   cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_tt_align8.cu
cutlass_tensorop_bf16_s16832spgemm_bf16_128x128_128x3_tt_align8.cu  cutlass_tensorop_bf16_s16832spgemm_bf16_256x128_64x3_tt_align8.cu  cutlass_tensorop_bf16_s16832spgemm_bf16_64x256_64x4_tt_align8.cu
cutlass_tensorop_bf16_s16832spgemm_bf16_128x128_64x3_tt_align8.cu   cutlass_tensorop_bf16_s16832spgemm_bf16_256x64_128x3_tt_align8.cu  cutlass_tensorop_bf16_s16832spgemm_bf16_64x64_128x3_tt_align8.cu
cutlass_tensorop_bf16_s16832spgemm_bf16_128x256_64x3_tt_align8.cu   cutlass_tensorop_bf16_s16832spgemm_bf16_256x64_64x3_tt_align8.cu   cutlass_tensorop_bf16_s16832spgemm_bf16_64x64_64x4_tt_align8.cu
cutlass_tensorop_bf16_s16832spgemm_bf16_128x64_128x4_tt_align8.cu   cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_128x3_tt_align8.cu

from cutlass.

jimwu6 commented on June 10, 2024

I have the same contents as you. However the profiler only seem to run the 64x128, even when I don't specify cta.
I am running exactly

../tools/profiler/cutlass_profiler --operation=spgemm --m=768 --n=4096 --k=4096 --A=bf16 --B=bf16 --C=bf16 --D=bf16 --batch_count=1 --profiling-iterations=5 --warmup-iterations=1 --output=report.csv

as per my original post

from cutlass.

hwu36 commented on June 10, 2024

what if you just do ../tools/profiler/cutlass_profiler --m=768 --n=4096 --k=4096

from cutlass.

jimwu6 commented on June 10, 2024

When I run that the only sparse GEMMs that don't have --cta_m=64 --cta_n=128 are those which are f32 or s4 or s8 in its inputs.

from cutlass.

hwu36 commented on June 10, 2024

you need to use tn (row x col) layout for f32, s4, s8.

from cutlass.

mnicely commented on June 10, 2024

@jimwu6 have your issues been resolved?

from cutlass.

[QST] Sparse GEMM runs much worse than Dense GEMM in some cases about cutlass HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent