eth-cscs / spla Goto Github PK

Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.

License: BSD 3-Clause "New" or "Revised" License

CMake 8.10% C++ 82.20% C 6.45% Fortran 3.25%

linear-algebra gemm cuda rocm mpi

spla's Introduction

SPLA - Specialized Parallel Linear Algebra

SPLA provides specialized functions for linear algebra computations with a C++ and C interface, which are inspired by requirements in computational material science codes.

Currently, SPLA provides functions for distributed matrix multiplications with specific matrix distributions, which cannot be used directly with a ScaLAPACK interface. All computations can optionally utilize GPUs through CUDA or ROCm, where matrices can be located either in host or device memory.

Functionality
Documentation
Installation
Implementation Details
Benchmark
Example
Acknowledgements

Functionality

Local GEMM

The function gemm(...) computes a local general matrix product, that works similar to cuBLASXt. If GPU support is enabled, the function may take any combination of host and device pointer. In addition, it may use custom multi-threading for computations on host, if the provided BLAS library does not support multi-threading.

Stripe-Stripe-Block

The pgemm_ssb(...) function computes

![ethz](docs/images/ssb_formula.svg)

where matrices A and B are stored in a "stripe" distribution with variable block length. Matrix C can be in any supported block distribution, including the block-cyclic ScaLAPACK layout. Matrix A may be read as transposed or conjugate transposed.

For computation of triangular block distributed matrices, the pgemm_ssbtr(...) function is available, allowing to specify the fill mode of C.

Stripe-Block-Stripe

The pgemm_sbs(...) function computes

where matrices A and C are stored in a "stripe" distribution with variable block length. Matrix B can be in any supported block distribution, including the block-cyclic ScaLAPACK layout.

Documentation

Documentation can be found here.

Installation

The build system follows the standard CMake workflow. Example:

mkdir build
cd build
cmake .. -DSPLA_OMP=ON -DSPLA_GPU_BACKEND=CUDA -DCMAKE_INSTALL_PREFIX=${path_to_install_to}
make -j8 install

CMake options

Option	Values	Default	Description
SPLA_OMP	ON, OFF	ON	Enable multi-threading with OpenMP
SPLA_HOST_BLAS	AUTO, MKL, OPENBLAS, BLIS, CRAY_LIBSCI, ATLAS, GENERIC	AUTO	BLAS library for computations on host
SPLA_GPU_BACKEND	OFF, CUDA, ROCM	OFF	Select GPU backend
SPLA_BUILD_TESTS	ON, OFF	OFF	Build test executables
SPLA_BUILD_EXAMPLES	ON, OFF	OFF	Build examples
SPLA_INSTALL	ON, OFF	ON	Add library to install target
SPLA_FORTRAN	ON, OFF	OFF	Build Fortan module

Implementation Details

The implementation is based on a ring communication pattern as described in the paper Accelerating large-scale excited-state GW calculations on leadership HPC systems by Mauro Del Ben Et Al. For distributed matrix-matrix multiplications with distributions as used in the pgemm_ssb function, each process contributes to the result of every element. Therefore, some form of reduction operation is required. Compared to other reduction schemes, a ring requires more communication volume. However, by splitting up the result and computing multiple reductions concurrently, all processes share the work load at every step and more opportunities for communication - computation overlap arise. Let's consider the example of computing a block cyclic distributed matrix with the pgemm_ssb function on four ranks. The following image illustrates how the matrices are distributed, with the numbers indicating the assigned rank of each block:

To compute the colored blocks using the ring communication pattern, each rank starts computing a different block. The result is then send to a neighbouring rank, while at the same time, the next block is being computed. When result of another rank is received, the local contribution is added, and the block send onwards:

Ideally, the location of a block at the last step is also where it has to be written to the output. However, internally SPLA uses independent block sizes and a final redistribution step is required. This allows more flexibility, since for the communication pattern to work optimally, the number of blocks has to be a multiple of the number of ranks. In addition, to maximize overlap, SPLA may process several groups of blocks in parallel, interleaving steps of the ring pattern for each group. For the function pgemm_sbs, the pattern is applied in reverse, i.e. a broadcast operation is performed in a ring. The benefits are similar, but a redistribution step is required at the beginning instead, to allow for optimal internal block sizes.

Benchmark

The most commonly used API for distributed matrix multiplication is based on ScaLAPACK. To allow a comparison to other libraries, parameters for the benchmark of the pgemm_ssb function are selected, such that the same operation can be expressed in a p?gemm call through a ScaLAPACK interface. Matrices A and B are set to use a fixed block size and matrix C to use the same one dimensional processor grid as A and B. Two types of compute nodes on Piz Daint at CSCS were used:

	Piz Daint - Multi-Core	Piz Daint - GPU
CPU	2 x Intel Xeon E5-2695 v4 @ 2.10GHz (2 x 18 cores)	Intel Xeon E5-2690 v3 @ 2.60GHz (12 cores)
GPU		NVIDIA Tesla P100 16GB

The CPU benchmarks were run on the multi-core partition, with two MPI ranks per node, such that each process had a single CPU socket with 18 threads available. The GPU benchmarks were run on the GPU partition, with one MPI rank per node. The matrix sizes were selected to represent a "tall and skinny" case, as typically used in computational material science simulations.

The plots show the performance per node / socket, with the CPU benchmark on the left and the GPU benchmark on the right.
On CPU, performance is much better compared to Intel MKL and similar to COSMA, which is a library based on a communication optimal algorithm. At a low number of MPI ranks, COSMA will use larger internal blocks compared to SPLA, which is constraint by the requirement to assign at least one block to each rank to form a ring communication pattern. Larger blocks allow for overall faster local computations, since a single zgemm call to the BLAS library is more efficient than multiple calls with smaller sizes. With increasing number of nodes, the computational load per node decreases, and communication costs become more pronounced. While theoretically, SPLA may have a higher communication volume compared to COSMA, it is able to hide the cost more effectively by only using direct communication with neighbours. Therefore, at higher node count, it outperforms the other libraries in this case.

On GPU, computations tend to be much faster, so communication cost is even more important. For all number of nodes in this benchmark, SPLA significantly outperforms COSMA and LIBSCI_ACC (a library provided by CRAY). Internally, SPLA is able to use multiple CUDA streams, which are only individually synchronized for communication through MPI. In contrast, COSMA uses multiple streams to compute a larger block, which are synchronized as a group for communication. Therefore, SPLA achieves much better overlap of computation and communication, in addition to overall faster individual communication steps. To compute the matrix multiplication, data on host is temporarily copied to the GPU. With increasing number of nodes, the internal block size of SPLA decreases, which then requires the same data to be copied more often. Hence, with data on host memory as input, the bandwidth between host and GPU memory becomes the limiting factor. The unique feature of SPLA is to ability to use data on device memory as input / output. In this case, the bandwidth is no longer a problem and communication cost is dominant. At 256 nodes, the message size for MPI communication falls below a threshold, below which the MPI implementation on Piz Daint uses a different protocol, leading to a significant performance jump. This is difficult to optimize for when selecting internal block sizes, but if optimum performance with fixed sizes required, a user can specify a target block size to use on their system.

Example

This an examaple in C++, for C and Fortran check the examples folder.

#include <vector>
#include <cmath>
#include <mpi.h>

#include "spla/spla.hpp"

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);
  int world_size = 1;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  int m = 100;
  int n = 100;
  int k_local = 100;

  int block_size = 256;
  int proc_grid_rows = std::sqrt(world_size);
  int proc_grid_cols = world_size / proc_grid_rows;

  std::vector<double> A(m * k_local);
  std::vector<double> B(n * k_local);
  std::vector<double> C(m * n); // Allocate full C for simplicity

  int lda = k_local;
  int ldb = k_local;
  int ldc = m;

  {
    // Create context, which holds any resources SPLA will require, allowing reuse between functions
    // calls. The given processing unit will be used for any computations.
    spla::Context ctx(SPLA_PU_HOST);

    // Create matrix distribution for C
    auto c_dist = spla::MatrixDistribution::create_blacs_block_cyclic(
        MPI_COMM_WORLD, 'R', proc_grid_rows, proc_grid_cols, block_size, block_size);
    // This is mostly equivalent to the following ScaLAPACK calls combined:
    /*
    int info = 0;
    int rsrc = 0;
    int csrc = 0;
    int blacs_ctx = Csys2blacs_handle(MPI_COMM_WORLD);
    Cblacs_gridinit(&blacs_ctx, 'R', proc_grid_rows, proc_grid_cols);
    int desc[9];
    descinit_(desc.data(), &m, &n, &block_size, &block_size, &rsrc, &csrc, &blacs_ctx, &ldc,
                  &info);
    */

    double alpha = 1.0;
    double beta = 0.0;

    // Compute parallel stripe-stripe-block matrix multiplication. To describe the stripe
    // distribution of matrices A and B, only the local k dimension is required.
    spla::pgemm_ssb(m, n, k_local, SPLA_OP_TRANSPOSE, alpha, A.data(), lda, B.data(), ldb, beta,
                    C.data(), ldc, 0, 0, c_dist, ctx);

  }  // Make sure context goes out of scope before MPI_Finalize() is called.

  MPI_Finalize();
  return 0;
}

Acknowledgements

This work was supported by:

	Swiss Federal Institute of Technology in Zurich
	Swiss National Supercomputing Centre
	MAterials design at the eXascale (Horizon2020, grant agreement MaX CoE, No. 824143)

spla's People

Contributors

Stargazers

Watchers

Forkers

adhocman simonpintarelli striverzhou myang0217 machinelearningsystem

spla's Issues

spla fails to find multi-threaded BLIS

trying to build with Spack against a threaded BLIS gives: spack/spack#24357

     12    -- Performing Test HAVE___RESTRICT__
     13    -- Performing Test HAVE___RESTRICT__ - Success
     14    -- Found MPI_CXX: /project/d110/timuel/spack/lib/spack/env/gcc/g++ (found version "3.1")
     15    -- Found MPI: TRUE (found version "3.1") found components: CXX
     16    -- Found OpenMP_CXX: -fopenmp (found version "4.5")
     17    -- Found OpenMP: TRUE (found version "4.5")
  >> 18    CMake Error at /apps/eiger/UES/jenkins/1.4.0/software/CMake/3.20.1/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
     19      Could NOT find BLIS (missing: BLIS_LIBRARIES)
     20    Call Stack (most recent call first):
     21      /apps/eiger/UES/jenkins/1.4.0/software/CMake/3.20.1/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
     22      cmake/modules/FindBLIS.cmake:74 (find_package_handle_standard_args)
     23      CMakeLists.txt:199 (find_package)
     24

I guess because it is only looking for blis rather than also blis-mt.

Tests fail to link

[ 96%] Linking CXX executable run_tests
cd /disk-samsung/freebsd-ports/math/spla/work/.build/tests && /usr/local/bin/cmake -E cmake_link_script CMakeFiles/run_tests.dir/link.txt --verbose=1
/usr/bin/c++ -O2 -pipe -fno-omit-frame-pointer -fstack-protector-strong -fno-strict-aliasing -fno-omit-frame-pointer -O2 -pipe -fno-omit-frame-pointer -fstack-protector-strong -fno-strict-aliasing -fno-omit-frame-pointer  -Wl,-rpath=/usr/local/lib/gcc11  -L/usr/local/lib/gcc11 -B/usr/local/bin -fstack-protector-strong -Wl,-rpath=/usr/local/lib/gcc11 -Wl,-rpath -Wl,/usr/local/lib -L/usr/local/lib/gcc11 -fopenmp=libomp CMakeFiles/run_tests.dir/programs/run_tests.cpp.o CMakeFiles/run_tests.dir/test_pool_allocator.cpp.o CMakeFiles/run_tests.dir/test_gemm.cpp.o CMakeFiles/run_tests.dir/test_gemm_ssb.cpp.o CMakeFiles/run_tests.dir/test_gemm_sbs.cpp.o -o run_tests  -Wl,-rpath,/disk-samsung/freebsd-ports/math/spla/work/.build/src:/usr/local/lib ../src/libspla_test.so.1.5.4 ../_deps/googletest-build/googletest/libgtest_main.a /usr/lib/libdl.so /usr/local/lib/libmpicxx.so /usr/local/lib/libmpi.so /usr/lib/libomp.so /usr/local/lib/libopenblas.so ../_deps/googletest-build/googletest/libgtest.a -pthread /usr/local/lib/libscalapack.so 
/usr/local/bin/ld: CMakeFiles/run_tests.dir/programs/run_tests.cpp.o: in function `gtest_mpi::(anonymous namespace)::PrettyMPIUnitTestResultPrinter::OnTestCaseStart(testing::TestSuite const&)':
run_tests.cpp:(.text+0xb85): undefined reference to `testing::TestSuite::test_to_run_count() const'
/usr/local/bin/ld: CMakeFiles/run_tests.dir/programs/run_tests.cpp.o: in function `gtest_mpi::(anonymous namespace)::PrettyMPIUnitTestResultPrinter::OnTestCaseEnd(testing::TestSuite const&)':
run_tests.cpp:(.text+0x2128): undefined reference to `testing::TestSuite::test_to_run_count() const'
/usr/local/bin/ld: CMakeFiles/run_tests.dir/test_gemm.cpp.o: in function `testing::internal::ParameterizedTestSuiteInfo<GemmTest<double> >* testing::internal::ParameterizedTestSuiteRegistry::GetTestSuitePatternHolder<GemmTest<double> >(char const*, testing::internal::CodeLocation)':
test_gemm.cpp:(.text._ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI8GemmTestIdEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE[_ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI8GemmTestIdEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE]+0x23b): undefined reference to `testing::internal::ReportInvalidTestSuiteType(char const*, testing::internal::CodeLocation)'
/usr/local/bin/ld: CMakeFiles/run_tests.dir/test_gemm.cpp.o: in function `testing::internal::ParameterizedTestSuiteInfo<GemmTest<std::__1::complex<double> > >* testing::internal::ParameterizedTestSuiteRegistry::GetTestSuitePatternHolder<GemmTest<std::__1::complex<double> > >(char const*, testing::internal::CodeLocation)':
test_gemm.cpp:(.text._ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI8GemmTestINSt3__17complexIdEEEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE[_ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI8GemmTestINSt3__17complexIdEEEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE]+0x23b): undefined reference to `testing::internal::ReportInvalidTestSuiteType(char const*, testing::internal::CodeLocation)'
/usr/local/bin/ld: CMakeFiles/run_tests.dir/test_gemm_ssb.cpp.o: in function `testing::internal::ParameterizedTestSuiteInfo<GemmSSBTest<double> >* testing::internal::ParameterizedTestSuiteRegistry::GetTestSuitePatternHolder<GemmSSBTest<double> >(char const*, testing::internal::CodeLocation)':
test_gemm_ssb.cpp:(.text._ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI11GemmSSBTestIdEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE[_ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI11GemmSSBTestIdEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE]+0x23b): undefined reference to `testing::internal::ReportInvalidTestSuiteType(char const*, testing::internal::CodeLocation)'
/usr/local/bin/ld: CMakeFiles/run_tests.dir/test_gemm_ssb.cpp.o: in function `testing::internal::ParameterizedTestSuiteInfo<GemmSSBTest<std::__1::complex<double> > >* testing::internal::ParameterizedTestSuiteRegistry::GetTestSuitePatternHolder<GemmSSBTest<std::__1::complex<double> > >(char const*, testing::internal::CodeLocation)':
test_gemm_ssb.cpp:(.text._ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI11GemmSSBTestINSt3__17complexIdEEEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE[_ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI11GemmSSBTestINSt3__17complexIdEEEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE]+0x23b): undefined reference to `testing::internal::ReportInvalidTestSuiteType(char const*, testing::internal::CodeLocation)'
/usr/local/bin/ld: CMakeFiles/run_tests.dir/test_gemm_sbs.cpp.o: in function `testing::internal::ParameterizedTestSuiteInfo<GemmSBSTest<double> >* testing::internal::ParameterizedTestSuiteRegistry::GetTestSuitePatternHolder<GemmSBSTest<double> >(char const*, testing::internal::CodeLocation)':
test_gemm_sbs.cpp:(.text._ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI11GemmSBSTestIdEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE[_ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI11GemmSBSTestIdEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE]+0x23b): undefined reference to `testing::internal::ReportInvalidTestSuiteType(char const*, testing::internal::CodeLocation)'
/usr/local/bin/ld: CMakeFiles/run_tests.dir/test_gemm_sbs.cpp.o:test_gemm_sbs.cpp:(.text._ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI11GemmSBSTestINSt3__17complexIdEEEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE[_ZN7testing8internal30ParameterizedTestSuiteRegistry25GetTestSuitePatternHolderI11GemmSBSTestINSt3__17complexIdEEEEEPNS0_26ParameterizedTestSuiteInfoIT_EEPKcNS0_12CodeLocationE]+0x23b): more undefined references to `testing::internal::ReportInvalidTestSuiteType(char const*, testing::internal::CodeLocation)' follow
c++: error: linker command failed with exit code 1 (use -v to see invocation)
*** [tests/run_tests] Error code 1

Version: 1.5.4
clang-14
FreeBSD 13.1

SPLA hangs on a 6x6 grid

For this test case https://github.com/electronic-structure/SIRIUS/blob/develop/apps/tests/test_wf_inner.cpp spla will hang with the following submission

#!/bin/bash -l
#SBATCH --job-name="test_scf"
#SBATCH --nodes=36
#SBATCH --time=00:04:00
#SBATCH --account=csstaff
#SBATCH -C gpu

export MPICH_MAX_THREAD_SAFETY=multiple
export CRAY_CUDA_MPS=0
export MKL_NUM_THREADS=1
export OMP_NUM_THREADS=1

srun -N 36 -n 36 --hint=nomultithread --unbuffered -c 1 ./test_wf_inner  --M=2248 --N=2248 --BS=128 --mpi_grid=6:6

Number of OMP threads does not play a role. With 12 threads it also hangs.

installation of CMake modules/ into different directory leads to configuration error with SIRIUS

same issue as with SpFFT: eth-cscs/SpFFT#46

BLAS vendor

SPLA will always link against MKL if it's present, for example also with spack install spla ^openblas. Also the spack package.py should be changed accordingly, such that the BLAS vendor is read from spec['blas'].name and forwrded to SPLA cmake.

SpLA-1.5: FindBLAS missing in installed SPLAConfig.cmake

The installed CMake configs of SpLA-1.5 are missing a FindBLAS at some point. This is what we get in our Docker build environment when trying to build SIRIUS against SpLa-1.5 (upgrading from 1.4.0) and after fixing #17 by moving the modules/ directory manually:

[...]
-- Configuring done
�[91mCMake Error at src/CMakeLists.txt:98 (add_library):
  Target "sirius" links to target "BLAS::blas" but the target was not found.
  Perhaps a find_package() call is missing for an IMPORTED target, or an
  ALIAS target is missing?


�[0m�[91mCMake Error at apps/atoms/CMakeLists.txt:1 (add_executable):
  Target "atom" links to target "BLAS::blas" but the target was not found.
  Perhaps a find_package() call is missing for an IMPORTED target, or an
  ALIAS target is missing?


�[0m�[91mCMake Error at apps/hydrogen/CMakeLists.txt:1 (add_executable):
  Target "hydrogen" links to target "BLAS::blas" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?


�[0m�[91mCMake Error at apps/dft_loop/CMakeLists.txt:1 (add_executable):
  Target "sirius.scf" links to target "BLAS::blas" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?


�[0m�[91mCMake Error at apps/utils/CMakeLists.txt:1 (add_executable):
  Target "unit_cell_tools" links to target "BLAS::blas" but the target was
  not found.  Perhaps a find_package() call is missing for an IMPORTED
  target, or an ALIAS target is missing?


�[0m-- Generating done
�[91mCMake Warning:
  Manually-specified variables were not used by the project:

    CMAKE_CXXFLAGS_RELEASE
    SCALAPACK_INCLUDE_DIR


CMake Generate step failed.  Build files cannot be regenerated correctly.