Git Product home page Git Product logo

msplitgemm's Introduction

MSplitGEMM

Large matrix multiplication in CUDA

Preface/Disclaimer:

This repository is a set of algorithms that perform multiplication of very large matrices using the cuBLAS library in CUDA. These algorithms would be particularly useful for multiplication where the multiplicand and product matrices are too large to fit on the GPGPU.

Currently, this project has only been tested with square matrices, but it should be possible to perform multiplication of matrices of any legal input dimensions with a bit of fine tuning.

Introduction:

For normal matricies, using cuBLAS alone would be sufficient. For extremely large matrices, whose memory does not fit into the GPU global memory, an alternative method is to split the multiplicands into block matrices and perform the multiplication as shown in the figure below.

Alt text

This repository contains four algorithms that implement the above equation in different ways. The algorithms here each have their own specific use cases, based on the advantages each one provides.

To Build/Run:

 make N will generate executable for algorithm N
 make 1 will generate Algorithm 1
 make 2 will generate Algorithm 2
 make 3 will generate Algorithm 3
 make 4 will generate Algorithm 4

Run the executable, providing a single integer as a command line argument that specifies the size of the square matrices. The application supports testing non-square matrices, but these operations are not yet supported in the algorithms.

Algorithm 1: Generation of one output matrix at a time:

This algorithm allocates memory for one submatrix of the first multiplicand and one submatrix of the second. Performs the single multiply and copies the memory back to the product matrix before moving on to the next computation. It uses the least memory of the algorithms, but generates the most memory transfer operations as we split the matrix into more submatrices.

GPU Memory Usage =(1/numSubmatricies^2 + 2/numSubmatrices) * N^2 * sizeof(type)

Alt text

Algorithm 2: Load one of the multiplicands entirely, and compute each row group together.

This algorithm allocates memory for one submatrix of the first multiplicand and the entire second multiplicand. Using this additional memory, we are able to launch all of the kernels for one row group at once. This allows for better performance at the cost of additional memory requirements.

GPU Memory Usage =(1 + 2/numSubmatrices) * N^2 * sizeof(type)

Alt text

Algorithm 3: Multithreaded streaming version of Algorithm 1

Algorithm 3 is the first algorithm, but pipelined with CUDA streams, with each thread in the operation having a corresponding CUDA stream. This increases the performance of the first algorithm in most cases when the number of threads and streams is optimal for the number of submatrices and the size of the problem. A downside is that this algorithm uses more memory than the first on both the host and the GPU, because each thread or stream needs a context to work in.

Gpu and Add. Host Mem Usage =(numStreams/numSubmatricies^2 + (1 + numStreams)/numSubmatrices) * N^2 * sizeof(type)

Alt text

Algorithm 4: Streaming Version of Algorithm 1

Algorithm 4 does not use multiple threads on the CPU; instead, it exploits the natural paralellism of the CUDA streams to create the illusion of multiple CPU threads.

Gpu and Add. Host Mem Usage =(numStreams/numSubmatricies^2 + (1 + numStreams)/numSubmatrices) * N^2 * sizeof(type)

Alt text

msplitgemm's People

Contributors

zpzim avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.