Git Product home page Git Product logo

cuda-gemm-optimization's Introduction

CUDA GEMM Optimization

Introduction

This repository contains the CUDA kernels for general matrix-matrix multiplication (GEMM) and the corresponding performance analysis. The correctness of the CUDA kernels is guaranteed for any matrix size. The parameters of the CUDA kernels are slightly turned for GEMM 4096 x 4096 x 4096 on an NVIDIA GeForce RTX 3090 GPU. The CUDA kernels should be compatible with any NVIDIA GPUs with compute capability 7.0 or higher.

Usages

Docker is used to build and run the CUDA kernels. The custom Docker container is built based on the NVIDIA NGC CUDA 12.2.2 Docker container.

Please adjust the base Docker container CUDA version if the host computer has a different CUDA version. Otherwise, weird compilation errors and runtime errors may occur.

Build Docker Images

To build the custom Docker image, please run the following command.

$ docker build -f docker/gemm-cuda.Dockerfile --no-cache --tag=gemm-cuda:12.2.2 .

Run Docker Container

To run the custom Docker container, please run the following command.

$ docker run -it --rm --gpus device=0 -v $(pwd):/mnt gemm-cuda:12.2.2

If we want to profile the CUDA kernels using NVIDIA Nsight Compute, we need to add additional flags --cap-add=SYS_ADMIN and --security-opt seccomp=unconfined when we run the Docker container.

Build CUDA Kernels

To build the CUDA kernels, please run the following commands inside the Docker container.

$ cmake -B build
$ cmake --build build --config Release --parallel
$ cmake --install build

Run CUDA Kernels

To run the FP32 and FP16 GEMM CUDA kernels, please run the following commands inside the Docker container.

$ ./build/src/profile_cuda_gemm_fp32
$ ./build/src/profile_cuda_gemm_fp16

Performances

All the experiments are conducted on a single NVIDIA GeForce RTX 3090 GPU. The performance can vary, sometimes up to 25%, from one measurement to another.

FP32 GEMM

All the FP32 GEMM kernels cannot utilize the NVIDIA Tensor Cores.

GEMM Kernel TFLOPS Kernel Description
cuBLAS GEMM Kernel 24.5971 cuBLAS implementation
Custom GEMM Kernel V00 0.278129 Non-coalesced global memory access
Custom GEMM Kernel V01 1.7218 Coalesced global memory access
Custom GEMM Kernel V02 2.66157 2D block tiling
Custom GEMM Kernel V02 Vectorized 1.90514 2D block tiling with vectorized memory access
Custom GEMM Kernel V03 8.91318 2D block tiling and 1D thread tiling
Custom GEMM Kernel V03 Vectorized 4.04796 2D block tiling and 1D thread tiling with vectorized memory access
Custom GEMM Kernel V04 13.0247 2D block tiling and 2D thread tiling
Custom GEMM Kernel V04 Vectorized 15.027 2D block tiling and 2D thread tiling with vectorized memory access
Custom GEMM Kernel V05 11.1448 2D block tiling and 2D thread tiling and matrix transpose
Custom GEMM Kernel V05 Vectorized 19.6688 2D block tiling and 2D thread tiling and matrix transpose with vectorized memory access
Custom GEMM Kernel V06 11.0703 2D block tiling and 2D warp tiling and 2D thread tiling and matrix transpose
Custom GEMM Kernel V06 Vectorized 20.1649 2D block tiling and 2D warp tiling and 2D thread tiling and matrix transpose with vectorized memory access

FP16 GEMM

The FP16 custom GEMM kernels V00 to V06 do not utilize the NVIDIA Tensor Cores. The FP16 cuBLAS GEMM kernel and custom GEMM kernels V07 utilize the NVIDIA Tensor Cores.

GEMM Kernel TFLOPS Kernel Description
cuBLAS GEMM Kernel 138.955 cuBLAS implementation
Custom GEMM Kernel V00 0.284095 Non-coalesced global memory access
Custom GEMM Kernel V01 1.7316 Coalesced global memory access
Custom GEMM Kernel V02 2.46677 2D block tiling GEMM
Custom GEMM Kernel V02 Vectorized 1.93088 2D block tiling with vectorized memory access
Custom GEMM Kernel V03 8.67563 2D block tiling and 1D thread tiling GEMM
Custom GEMM Kernel V03 Vectorized 2.14047 2D block tiling and 1D thread tiling with vectorized memory access
Custom GEMM Kernel V04 20.2746 2D block tiling and 2D thread tiling GEMM
Custom GEMM Kernel V04 Vectorized 22.9001 2D block tiling and 2D thread tiling with vectorized memory access
Custom GEMM Kernel V05 18.3736 2D block tiling and 2D thread tiling and matrix transpose GEMM
Custom GEMM Kernel V05 Vectorized 27.962 2D block tiling and 2D thread tiling and matrix transpose with vectorized memory access
Custom GEMM Kernel V06 14.7622 2D block tiling and 2D warp tiling and 2D thread tiling and matrix transpose GEMM
Custom GEMM Kernel V06 Vectorized 28.4588 2D block tiling and 2D warp tiling and 2D thread tiling and matrix transpose with vectorized memory access
Custom GEMM Kernel V07 33.808 2D block tiling and 2D warp tiling and WMMA and matrix transpose
Custom GEMM Kernel V07 Vectorized 46.7866 2D block tiling and 2D warp tiling and WMMA and matrix transpose and vectorized memory access.

References

cuda-gemm-optimization's People

Contributors

leimao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.