nvidia / cudalibrarysamples Goto Github PK

View Code? Open in Web Editor NEW

1.4K 33.0 289.0 21.5 MB

CUDA Library Samples

License: Other

C++ 26.09% CMake 16.20% Makefile 3.24% Cuda 33.64% C 14.63% Python 1.73% AMPL 2.33% Fortran 2.14% Dockerfile 0.01%

cudalibrarysamples's Introduction

CUDA Library Samples

CUDA Library Samples contains examples demonstrating the use of features in the

math and image processing libraries,
cuBLAS,
cuTENSOR,
cuSPARSE,
cuSOLVER,
cuFFT,
cuRAND,
NPP,
nvJPEG
...

About

The CUDA Library Samples are released by NVIDIA Corporation as Open Source software under the 3-clause "New" BSD license.

GPU Accelerated Libraries

Library Examples

Copyright

  Redistribution and use in source and binary forms, with or without modification, are permitted
  provided that the following conditions are met:
      * Redistributions of source code must retain the above copyright notice, this list of
        conditions and the following disclaimer.
      * Redistributions in binary form must reproduce the above copyright notice, this list of
        conditions and the following disclaimer in the documentation and/or other materials
        provided with the distribution.
      * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
        to endorse or promote products derived from this software without specific prior written
        permission.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
  IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
  FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
  OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
  STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

cudalibrarysamples's People

Contributors

Stargazers

Watchers

Forkers

osmium18452 parkkiung123 matth3wology onier nikhil-jain raphaelk12 jafermarq seanmao vmiheer knu-system-software mulxcode niclasesser1 huangyuren bontub fullstackhan tpoisonooo dfgomez2018 srinivasyadav18 asdlei99 boboer codecookinging jadeluo spatters baohhhhhhan obsidiansystems nxrnxr1 ozturkosu neoblizz young768 jwchoi92 gu520hz liushuan wolfworld6 sorenrasmussenai qzhang702 vominhtrius hatsu3 ajunlonglive kadircs xiaoqi35 mikechen66 yuanbo adel-dabah sk8xiao linux-cpp-lisp saristeev webmation lamhocn sheldonhowe maxwellyu1024 navdeepkk pavankumarperali jichangjichang shaunstoltz lkjacker ashishd capri2014 collector-m spaparaju beyonddiana speakspeak xppple sungsoosmess qiquanfeng v1svk qiuchaofan dc-shi sachinsachdeva95 zkh2016 limin2021 xiangchunyang googol-lab axmat akshnataraj jclpms3d zakheav galeselee qingleicao wangce888 3togo mahmoudzamani sera91 yukewang96 katherineding chanwooy-1qbit mnicely cemkil krocki lilohuang abraham-xu morad1998 raripie unravel11 dougvaderjr tianyuanzhangnb zheng-ningxin borueihong guanhuasun drmaruyama qzp-user

cudalibrarysamples's Issues

Confusion about the sddmm example in the cuSparse library

Hello, I read the example of sddmm and I feel very confused.
https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSE/sddmm_csr
There are mainly the following two problems:

The dimension of C is mn. A is a 44 matrix, and B is a 43 matrix. Why does the C get a 44 matrix instead of a 4*3 matrix?
I calculated the result of A*B, the result should be [[ 70 80 90]
[158 184 210]
[246 288 330]
[324 381 438]]. I don't know why sddmm will get the result of the example.

About cudnn_samples

the problem is like the fig.
I can't test my cudnn_sample.
the msg show me that 0 error code.

my system is Ubuntu 16.04
GPU is 2080 Ti
CUDA is CUDA 9.0
cudnn is 7.3.1 ( i had try the 7.4 & 7.0 before,and i meet the same problem. )

Anyone can help me ?

next fig is the msg after i do the command "make clean && make" at the cudnn_sample file.

LtSgemm example on non-square matrix???

I tested the example on 4by 4 matrix and it's result is same as hand-computed result.
But the problem is on non-square matrix. (eg. 2 by 3 and 3 by 4)
Computation result seems not right.
How can I compute gemm on non-square matrix??

cuTensor - Python test Failed

I test in container "nvcr.io/nvidia/pytorch:21.04-py3".

$ python cutensor/torch/einsum_test.py
..........FF.
======================================================================
FAIL: test_einsum_general_equivalent_results_1_test_1 (__main__.EinsumTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/parameterized/parameterized.py", line 533, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "cutensor/torch/einsum_test.py", line 221, in test_einsum_general_equivalent_results
    torch.testing.assert_allclose(cutensor_rslt, torch_rslt, rtol=5e-3, atol=5e-3)
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/__init__.py", line 250, in assert_allclose
    raise AssertionError(msg)
AssertionError: With rtol=0.005 and atol=0.005, found 1 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.012973159551620483 (-0.4519635736942291 vs. -0.4649367332458496), which occurred at index (15, 3).

======================================================================
FAIL: test_einsum_general_equivalent_results_2_test_2 (__main__.EinsumTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/parameterized/parameterized.py", line 533, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "cutensor/torch/einsum_test.py", line 221, in test_einsum_general_equivalent_results
    torch.testing.assert_allclose(cutensor_rslt, torch_rslt, rtol=5e-3, atol=5e-3)
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/__init__.py", line 250, in assert_allclose
    raise AssertionError(msg)
AssertionError: With rtol=0.005 and atol=0.005, found 4 element(s) (out of 400) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.024122774600982666 (-0.7208095192909241 vs. -0.6966867446899414), which occurred at index (21, 6).

----------------------------------------------------------------------
Ran 13 tests in 2.600s

FAILED (failures=2)

Sometimes, test0 failed.

How long will cuSPARSELt to support TF32? Is there any sample codes of BF16?

I checked the sample code of spMMA and read the reference of cuSPARSELt. According to the document, cuSPARSELt only supports FP16, BF16 and INT8. A100 has announced the TF32 could also utilize the sparsity of Tensor Core. How long will cuSPARSELt to support TF32?
BF16 is a new datatype for GPU,especially for Ampere. Is there any sample code that how we could use BF16 with cuSPARSELt?

Is this picture wrong?

https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSE/spmm_blockedell/spmm_blockedell.png
I think the matrix A is wrong because A doesn't has 9 there.

Extract main diagonal elements

Hello, Is there a method to extract main diagonal elements of a sparse matrix in cusparse(the format I am using in cuSparse is CSR)? I want to use it as a preconditioner in my PCG algorithm. Looking forward to your reply!

Fail to compile cusparse example dense2sparse

Hello, I met some question when tested the example. When compile the dense2sparse example, the following error would emerge:
dense2sparse_example.c:121:41: error: ‘CUSPARSE_DENSETOSPARSE_ALG_DEFAULT’ undeclared (first use in this function) CUSPARSE_DENSETOSPARSE_ALG_DEFAULT
My environment set up is:
os : ubuntu 16.04
GPU: GTX2080
CUDA version/driver : 11.1 / 455.23.05
Can you help me analyze why this error caused? Thank you very much.

cuSPARSELt spmma compute capability check

cuSPARSELt v0.1.0 intorduced SM 8.6 compatibility, maybe we should modify the condition here?

CUDALibrarySamples/cuSPARSELt/spmma/spmma_example.cpp

Line 82 in ad9b714

if (!(major_cc == 8 && minor_cc == 0)) {

cuSparseLt spmma Makefile typo

CUDALibrarySamples/cuSPARSELt/spmma/Makefile

Line 50 in ad9b714

CUSPARSELT_PATH := ${CUDA_TOOLKIT_PATH}/lib64

Maybe here should be ${CUDA_TOOLKIT}?

CuBLASLt examples aren't installed

I am packing these up for my distro to make it easier to test GPU setups. It would be nice if all examples had installation rules so the default cmake packaging setup just worked.

undefined reference to cusparseDenseToSparse_convert (analysis and buffersize as well)

Hello,
It seems that reference to cusparseDenseToSparse_buffersize, cusparseDenseToSparse_analysis and cusparseDenseToSparse_convert is undefined during linkage. Could you please help me ?

My setup is:
os : ubuntu 20.10
GPU: GeForce GTX 1050 Ti
CUDA version/driver : 11.3.58 / 460.56

On CUDA11.2 of windows and ARM, related cusparseSDDMM APIs are not found in cusparse.h.

On CUDA11.2 of windows and ARM, cusparseSDDMM, cusparseSDDMM_preprocess, and cusparseSDDMM_bufferSize are not found in cusparse.h. But it can be found in the X86 platform of Linux.

cuSparseLt Destroy Question

CUDALibrarySamples/cuSPARSELt/spmma/spmma_example.cpp

Line 214 in ad9b714

CHECK_CUSPARSE( cusparseLtDestroy(&handle) )

cusparseLt destroys handle using its address. Why this behavior differs from cuSparse (see below), cuBlas i.e. not using address but directly pass the handle?

CUDALibrarySamples/cuSPARSE/spgemm/spgemm_example.c

Line 215 in ad9b714

CHECK_CUSPARSE( cusparseDestroy(handle) )

Error: the provided PTX was compiled with an unsupported toolchain.

I compile the samples in the HPC SDK 21.9 container. Make smoothly, but run with Errors.

 ./contraction
Total memory: 0.45 GiB
Error: the provided PTX was compiled with an unsupported toolchain.

Thanks.

CMake Error in nvJPEG

CMake Error: The source directory "<...>/CUDALibrarySamples/nvJPEG" does not appear to contain CMakeLists.txt.

Additionally NvElement.h is missing in NvJpegDecoder.h

cuSparseLT API makes it difficult to verify 2:4 format of the weight at model load time

When I load the model, I would like to verify 2:4 format for the constant sparse weights that come with the model.

However, cusparseLtSpMMAPruneCheck requires arguments that are only available at runtime.

For example, to create cusparseLtMatmulDescriptor_t, we need to create descriptors for B and C. The exact dimensions for B are not known until the runtime and the same is true for C.

Can we not verify an chunk of data for 2:4 without knowing the variable dimension for B?

Sorry for the confusion, we actually swap A and B during the computations.

cuSparseLT question

In cuSparseLT documentation and in samples published here, 2:4 matrix is always matrix A.
Though the API allows it, I am not seeing any statement if matrix B could also be a sparse matrix. Can they both be sparse?

Problem with cusparseSpMV for datatype CUDA_C_64F

Hello,
I'm trying to use routine cusparseSpMV for datatype CUDA_C_64F, called from Fortran. First of all I tried to reproduce the example which is at https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSE/spmv_csr. I can run the example for the datatypes CUDA_R_32F and CUDA_C_32F without any issue and with the correct result. However when I change the particular data array's and the datatype to CUDA_C_64F, I get the message:
Unhandled exception at 0x00007ffe6f5d3a1a in Cusparse_t.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.

I'm using the CUDA toolkit version 11.5.
Operating system Windows 10
Compiler: Intel Fortran 2013 SP1

The code is given below.

Thanks in advance for any help.
Regards,
Peter

``module InterfaceCUDA
Interface
! cudaMalloc
integer (c_int) function cudaMalloc ( buffer, size ) &
bind (C, name="cudaMalloc" )
use iso_c_binding
implicit none
type (c_ptr) :: buffer
integer (c_int), value :: size
end function cudaMalloc

integer (c_int) function cudaMemcpy ( dst, src, count, kind ) &
                          bind (C, name="cudaMemcpy" )

! note: cudaMemcpyHostToDevice = 1
! note: cudaMemcpyDeviceToHost = 2
! note: cudaMemcpyDeviceToDevice = 3
use iso_c_binding
type (c_ptr), value :: dst, src
integer (c_int), value :: count, kind
end function cudaMemcpy

integer (c_int) function cudaFree(buffer)  bind(C, name="cudaFree")
    use iso_c_binding
    implicit none
    type (c_ptr), value :: buffer
end function cudaFree

end Interface
end module
module InterfaceCusparse

interface
integer(c_int) function cusparseCreate(cusparseHandle) &
bind(C,name="cusparseCreate")

    use iso_c_binding
    implicit none
    type(c_ptr) ::cusparseHandle
end function cusparseCreate

integer(4) function cusparseCreateCsr(descr, rows, cols, nnz, csrRowOffsets, &
            csrColInd, csrValues, csrRowOffsetsType, csrColIndType, idxBase, valueType) &
            bind(C,name="cusparseCreateCsr")
    use iso_c_binding
    type(c_ptr) :: descr
    integer(kind=c_int64_t), value :: rows, cols, nnz
    type(c_ptr), value :: csrRowOffsets, csrColInd
    type(c_ptr), value :: csrValues
    integer(c_int), value :: csrRowOffsetsType, csrColIndType, idxBase, valueType
end

integer(c_int) function cusparseDestroy(cusparseHandle) bind(C,name="cusparseDestroy")

    use iso_c_binding
    implicit none
    type(c_ptr),value::cusparseHandle
end function cusparseDestroy

integer(c_int) function cusparseDestroyMatDescr(descrA) bind(C,name="cusparseDestroyMatDescr")

    use iso_c_binding
    implicit none
    type(c_ptr), value:: descrA
end function cusparseDestroyMatDescr

integer(c_int) function cusparseDestroyDnVec(descr) bind(C,name="cusparseDestroyDnVec")

    use iso_c_binding
    type(c_ptr), value :: descr
end function cusparseDestroyDnVec

integer(4) function cusparseDestroySpMat(descr) bind(C,name="cusparseDestroySpMat")

    use iso_c_binding
    type(c_ptr), value :: descr
end function cusparseDestroySpMat

integer(4) function cusparseCreateDnVec(descr, size, values, valueType)  &
            bind(C,name="cusparseCreateDnVec")
    use iso_c_binding
    type(c_ptr) :: descr
    integer(kind=c_int64_t), value ::  size
    type(c_ptr), value :: values
    integer(c_int), value  :: valueType
end

integer(4) function cusparseSpMV_bufferSize(handle, opA, alpha, matA, vecX, beta, vecY, computeType, alg, bufferSize) &
            bind(C,name="cusparseSpMV_bufferSize")
    use iso_c_binding
    type(c_ptr), value      :: handle
    integer(c_int), value   :: opA

!! complex(c_float_complex) :: alpha, beta ! device or host variable
complex(c_double_complex) :: alpha, beta ! device or host variable
!! real(c_float) :: alpha, beta ! device or host variable
type(c_ptr), value :: matA
type(c_ptr), value :: vecX, vecY
integer(c_int), value :: computeType, alg
integer(c_size_t) :: bufferSize
end

integer(4) function cusparseSpMV(handle, opA, alpha, matA, vecX, beta, vecY, computeType, alg, buffer)  &
            bind(C,name="cusparseSpMV")
    use iso_c_binding
    type(c_ptr), value      :: handle
    integer(c_int), value   :: opA

!! complex(c_float_complex) :: alpha, beta ! device or host variable
complex(c_double_complex) :: alpha, beta ! device or host variable
!! real(c_float) :: alpha, beta ! device or host variable
type(c_ptr), value :: matA
type(c_ptr), value :: vecX
type(c_ptr), value :: vecY
integer(c_int), value :: computeType, alg
type(c_ptr), value :: buffer
end

end interface
end module
module CudaTypes
enum, bind(C) ! cusparseStatus_t
enumerator :: CUDA_R_16F = 2 !!, /* real as a half /
enumerator :: CUDA_C_16F = 6 !!, / complex as a pair of half numbers /
enumerator :: CUDA_R_16BF = 14 !!, / real as a nv_bfloat16 /
enumerator :: CUDA_C_16BF = 15 !!, / complex as a pair of nv_bfloat16 numbers /
enumerator :: CUDA_R_32F = 0 !!, / real as a float /
enumerator :: CUDA_C_32F = 4 !!, / complex as a pair of float numbers /
enumerator :: CUDA_R_64F = 1 !!, / real as a double /
enumerator :: CUDA_C_64F = 5 !!, / complex as a pair of double numbers /
enumerator :: CUDA_R_4I = 16 !!, / real as a signed 4-bit int /
enumerator :: CUDA_C_4I = 17 !!, / complex as a pair of signed 4-bit int numbers /
enumerator :: CUDA_R_4U = 18 !!, / real as a unsigned 4-bit int /
enumerator :: CUDA_C_4U = 19 !!, / complex as a pair of unsigned 4-bit int numbers /
enumerator :: CUDA_R_8I = 3 !!, / real as a signed 8-bit int /
enumerator :: CUDA_C_8I = 7 !!, / complex as a pair of signed 8-bit int numbers /
enumerator :: CUDA_R_8U = 8 !!, / real as a unsigned 8-bit int /
enumerator :: CUDA_C_8U = 9 !!, / complex as a pair of unsigned 8-bit int numbers /
enumerator :: CUDA_R_16I = 20 !!, / real as a signed 16-bit int /
enumerator :: CUDA_C_16I = 21 !!, / complex as a pair of signed 16-bit int numbers /
enumerator :: CUDA_R_16U = 22 !!, / real as a unsigned 16-bit int /
enumerator :: CUDA_C_16U = 23 !!, / complex as a pair of unsigned 16-bit int numbers /
enumerator :: CUDA_R_32I = 10 !!, / real as a signed 32-bit int /
enumerator :: CUDA_C_32I = 11 !!, / complex as a pair of signed 32-bit int numbers /
enumerator :: CUDA_R_32U = 12 !!, / real as a unsigned 32-bit int /
enumerator :: CUDA_C_32U = 13 !!, / complex as a pair of unsigned 32-bit int numbers /
enumerator :: CUDA_R_64I = 24 !!, / real as a signed 64-bit int /
enumerator :: CUDA_C_64I = 25 !!, / complex as a pair of signed 64-bit int numbers /
enumerator :: CUDA_R_64U = 26 !!, / real as a unsigned 64-bit int /
enumerator :: CUDA_C_64U = 27 !! / complex as a pair of unsigned 64-bit int numbers */
end enum
end module CudaTypes
module CuSparseTypes

USE, INTRINSIC :: ISO_C_BINDING
type cusparseHandle
type(C_PTR) :: handle
end type cusparseHandle

type cusparseCsrsv2Info
type(C_PTR) :: info
end type cusparseCsrsv2Info
type cusparseCsric02Info
type(C_PTR) :: info
end type cusparseCsric02Info
type cusparseCsrilu02Info
type(C_PTR) :: info
end type cusparseCsrilu02Info
type cusparseBsrsv2Info
type(C_PTR) :: info
end type cusparseBsrsv2Info
type cusparseBsric02Info
type(C_PTR) :: info
end type cusparseBsric02Info
type cusparseBsrilu02Info
type(C_PTR) :: info
end type cusparseBsrilu02Info
type cusparseBsrsm2Info
type(C_PTR) :: info
end type cusparseBsrsm2Info
type cusparseCsrgemm2Info
type(C_PTR) :: info
end type cusparseCsrgemm2Info
type cusparseColorInfo
type(C_PTR) :: info
end type cusparseColorInfo
type cusparseCsru2csrInfo
type(C_PTR) :: info
end type cusparseCsru2csrInfo
type cusparseSpVecDescr
type(C_PTR) :: descr
end type cusparseSpVecDescr
type cusparseDnVecDescr
type(C_PTR) :: descr
end type cusparseDnVecDescr
type cusparseSpMatDescr
type(C_PTR) :: descr
end type cusparseSpMatDescr
type cusparseDnMatDescr
type(C_PTR) :: descr
end type cusparseDnMatDescr

! cuSPARSE status return values
enum, bind(C) ! cusparseStatus_t
enumerator :: CUSPARSE_STATUS_SUCCESS=0
enumerator :: CUSPARSE_STATUS_NOT_INITIALIZED=1
enumerator :: CUSPARSE_STATUS_ALLOC_FAILED=2
enumerator :: CUSPARSE_STATUS_INVALID_VALUE=3
enumerator :: CUSPARSE_STATUS_ARCH_MISMATCH=4
enumerator :: CUSPARSE_STATUS_MAPPING_ERROR=5
enumerator :: CUSPARSE_STATUS_EXECUTION_FAILED=6
enumerator :: CUSPARSE_STATUS_INTERNAL_ERROR=7
enumerator :: CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED=8
enumerator :: CUSPARSE_STATUS_ZERO_PIVOT=9
enumerator :: CUSPARSE_STATUS_NOT_SUPPORTED=10
end enum
enum, bind(c) ! cusparsePointerMode_t
enumerator :: CUSPARSE_POINTER_MODE_HOST = 0
enumerator :: CUSPARSE_POINTER_MODE_DEVICE = 1
end enum
enum, bind(c) ! cusparseAction_t
enumerator :: CUSPARSE_ACTION_SYMBOLIC = 0
enumerator :: CUSPARSE_ACTION_NUMERIC = 1
end enum
enum, bind(C) ! cusparseMatrixType_t
enumerator :: CUSPARSE_MATRIX_TYPE_GENERAL = 0
enumerator :: CUSPARSE_MATRIX_TYPE_SYMMETRIC = 1
enumerator :: CUSPARSE_MATRIX_TYPE_HERMITIAN = 2
enumerator :: CUSPARSE_MATRIX_TYPE_TRIANGULAR = 3
end enum
enum, bind(C) ! cusparseFillMode_t
enumerator :: CUSPARSE_FILL_MODE_LOWER = 0
enumerator :: CUSPARSE_FILL_MODE_UPPER = 1
end enum
enum, bind(C) ! cusparseDiagType_t
enumerator :: CUSPARSE_DIAG_TYPE_NON_UNIT = 0
enumerator :: CUSPARSE_DIAG_TYPE_UNIT = 1
end enum
enum, bind(C) ! cusparseIndexBase_t
enumerator :: CUSPARSE_INDEX_BASE_ZERO = 0
enumerator :: CUSPARSE_INDEX_BASE_ONE = 1
end enum
enum, bind(C) ! cusparseOperation_t
enumerator :: CUSPARSE_OPERATION_NON_TRANSPOSE = 0
enumerator :: CUSPARSE_OPERATION_TRANSPOSE = 1
enumerator :: CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE = 2
end enum
enum, bind(C) ! cusparseDirection_t
enumerator :: CUSPARSE_DIRECTION_ROW = 0
enumerator :: CUSPARSE_DIRECTION_COLUMN = 1
end enum
enum, bind(C) ! cusparseHybPartition_t
enumerator :: CUSPARSE_HYB_PARTITION_AUTO = 0
enumerator :: CUSPARSE_HYB_PARTITION_USER = 1
enumerator :: CUSPARSE_HYB_PARTITION_MAX = 2
end enum
enum, bind(C) ! cusparseSolvePolicy_t
enumerator :: CUSPARSE_SOLVE_POLICY_NO_LEVEL = 0
enumerator :: CUSPARSE_SOLVE_POLICY_USE_LEVEL = 1
end enum
enum, bind(C) ! cusparseSideMode_t
enumerator :: CUSPARSE_SIDE_LEFT = 0
enumerator :: CUSPARSE_SIDE_RIGHT = 1
end enum
enum, bind(C) ! cusparseColorAlg_t
enumerator :: CUSPARSE_COLOR_ALG0 = 0
enumerator :: CUSPARSE_COLOR_ALG1 = 1
end enum
enum, bind(C) ! cusparseAlgMode_t;
enumerator :: CUSPARSE_ALG0 = 0
enumerator :: CUSPARSE_ALG1 = 1
enumerator :: CUSPARSE_ALG_NAIVE = 0
enumerator :: CUSPARSE_ALG_MERGE_PATH = 1
end enum
enum, bind(C) ! cusparseCsr2CscAlg_t;
enumerator :: CUSPARSE_CSR2CSC_ALG1 = 1
enumerator :: CUSPARSE_CSR2CSC_ALG2 = 2
end enum
enum, bind(C) ! cusparseFormat_t;
enumerator :: CUSPARSE_FORMAT_CSR = 1
enumerator :: CUSPARSE_FORMAT_CSC = 2
enumerator :: CUSPARSE_FORMAT_COO = 3
enumerator :: CUSPARSE_FORMAT_COO_AOS = 4
end enum
enum, bind(C) ! cusparseOrder_t;
enumerator :: CUSPARSE_ORDER_COL = 1
enumerator :: CUSPARSE_ORDER_ROW = 2
end enum
enum, bind(C) ! cusparseSpMVAlg_t;
enumerator :: CUSPARSE_MV_ALG_DEFAULT = 0
enumerator :: CUSPARSE_COOMV_ALG = 1
enumerator :: CUSPARSE_CSRMV_ALG1 = 2
enumerator :: CUSPARSE_CSRMV_ALG2 = 3
end enum
enum, bind(C) ! cusparseSpMMAlg_t;
enumerator :: CUSPARSE_MM_ALG_DEFAULT = 0
enumerator :: CUSPARSE_COOMM_ALG1 = 1
enumerator :: CUSPARSE_COOMM_ALG2 = 2
enumerator :: CUSPARSE_COOMM_ALG3 = 3
enumerator :: CUSPARSE_CSRMM_ALG1 = 4
end enum
enum, bind(C) ! cusparseIndexType_t;
enumerator :: CUSPARSE_INDEX_16U = 1
enumerator :: CUSPARSE_INDEX_32I = 2
enumerator :: CUSPARSE_INDEX_64I = 3
end enum

end module
program Main_test
use CudaTypes
use CuSparseTypes
use InterfaceCusparse
use InterfaceCUDA

integer*4 hA_csrOffsets(5), hA_columns(9)
integer*8 A_num_rows, A_num_cols, A_nnz
integer ierr
Integer, parameter  :: size_of_int=4
Integer, parameter  :: size_of_real=4

!! complex(c_float_complex) hA_values(9), hX(4), hY(4), hY_result(4)
complex(c_double_complex) hA_values(9), hX(4), hY(4), hY_result(4)
complex(c_double_complex) alpha, beta
!! complex(c_float_complex) alpha, beta

data      hA_csrOffsets / 1, 4, 5, 8, 10 /
data      hA_columns / 1, 3, 4, 2, 1, 3, 4, 2, 4 /
data      hA_values / (1.0, 0.0), (2.0, 0.0), (3.0, 0.0), (4.0, 0.0), (5.0, 0.0),  & 
           (6.0, 0.0), (7.0, 0.0), (8.0, 0.0), (9.0, 0.0) /
data      hX / (1.0, 0.0), (2.0, 0.0), (3.0, 0.0), (4.0, 0.0) /
data      hY / (0.0, 0.0), (1.0, 0.0), (1.0, 0.0), (0.0, 0.0) /
data      hY_result / (19.0, 0.0), (8.0, 0.0), (51.0, 0.0), (52.0, 0.0) /

!pointers to data
type(C_PTR) :: AcsrOff_index, Acolumns_index, dAvalues_index, hX_index, hY_index

! Device memory management
integer*4     :: AcsrOff_i_size, Acolumns_i_size, dAvalues_d_size, hX_d_size, hY_d_size
integer*8     :: bufferSize
type(C_PTR)   :: devPtr_AcsrOff, devPtr_Acolumns, devPtr_dAvalues, devPtr_dX, devPtr_dY, devPtr_Buffer

integer*8 cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToDevice
parameter (cudaMemcpyHostToDevice=1)
parameter (cudaMemcpyDeviceToHost=2)
parameter (cudaMemcpyDeviceToDevice=3)

integer*8 rows, cols, nnz, sizeVec
integer*4 RowOffType, ColIndType, idxBase, valueType
type(C_PTR)   :: devPtr_RowOff, devPtr_ColInd, devPtr_csrValues

! CUSPARSE APIs
type(C_PTR)          handle, bl_handle
type(C_PTR)          matA
type(C_PTR)          vecX, vecY

A_num_rows = 4
A_num_cols = 4
A_nnz      = 9
alpha      = cmplx(1.0, 0.0)
beta       = cmplx(0.0, 0.0)

AcsrOff_i_size = sizeof(hA_csrOffsets(1:5))
Acolumns_i_size = sizeof(hA_columns(1:9))
dAvalues_d_size = sizeof(hA_values(1:9))
hX_d_size = sizeof(hX(1:4))
hY_d_size = sizeof(hY(1:4))

AcsrOff_index = c_loc(hA_csrOffsets)
Acolumns_index = c_loc(hA_columns)
dAvalues_index = c_loc(hA_values)
hX_index = c_loc(hX)
hY_index = c_loc(hY)


ierr = cudaMalloc(devPtr_AcsrOff, AcsrOff_i_size)
ierr = cudaMalloc(devPtr_Acolumns, Acolumns_i_size)
ierr = cudaMalloc(devPtr_dAvalues, dAvalues_d_size)
ierr = cudaMalloc(devPtr_dX, hX_d_size)
ierr = cudaMalloc(devPtr_dY, hY_d_size)
  
ierr = cudaMemcpy(devPtr_AcsrOff,AcsrOff_index,AcsrOff_i_size,cudaMemcpyHostToDevice)
ierr = cudaMemcpy(devPtr_Acolumns,Acolumns_index,Acolumns_i_size,cudaMemcpyHostToDevice)
ierr = cudaMemcpy(devPtr_dAvalues,dAvalues_index,dAvalues_d_size,cudaMemcpyHostToDevice)
ierr = cudaMemcpy(devPtr_dX,hX_index,hX_d_size,cudaMemcpyHostToDevice)
ierr = cudaMemcpy(devPtr_dY,hY_index,hY_d_size,cudaMemcpyHostToDevice)
if (ierr /= CUSPARSE_STATUS_SUCCESS) goto 9000

ierr = cusparseCreate(handle)
if (ierr /= CUSPARSE_STATUS_SUCCESS) goto 9000

! Create sparse matrix A in CSR format
ierr = cusparseCreateCsr(matA, A_num_rows, A_num_cols, A_nnz,                      &
                                  devPtr_AcsrOff, devPtr_Acolumns, devPtr_dAvalues,  &
                                  CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,            &
                                  CUSPARSE_INDEX_BASE_ONE, CUDA_C_64F)
if (ierr /= CUSPARSE_STATUS_SUCCESS) goto 9000


!Create dense vector X
ierr = cusparseCreateDnVec(vecX, A_num_cols, devPtr_dX, CUDA_C_64F)
!Create dense vector y
ierr = cusparseCreateDnVec(vecY, A_num_rows, devPtr_dY, CUDA_C_64F)

!allocate an external buffer if needed
ierr = cusparseSpMV_bufferSize(handle, CUSPARSE_OPERATION_NON_TRANSPOSE,    &
                               Alpha, matA, vecX, Beta, vecY, CUDA_C_64F,   &
                               CUSPARSE_MV_ALG_DEFAULT, bufferSize) 
if (ierr /= CUSPARSE_STATUS_SUCCESS) goto 9000

ierr = cudaMalloc(devPtr_Buffer, int(bufferSize)*16) 
if (ierr /= CUSPARSE_STATUS_SUCCESS) goto 9000

!execute SpMV
ierr = cusparseSpMV(handle, CUSPARSE_OPERATION_NON_TRANSPOSE,    &
                    Alpha, matA, vecX, Beta, vecY, CUDA_C_64F,   &
                    CUSPARSE_MV_ALG_DEFAULT, devPtr_Buffer) 
if (ierr /= CUSPARSE_STATUS_SUCCESS) goto 9000

!device result check
ierr = cudaMemcpy(hY_index, devPtr_dY, hY_d_size, cudaMemcpyDeviceToHost)

write(*,*) (hY(i), i=1,A_num_rows)

!destroy matrix/vector descriptors
ierr = cusparseDestroySpMat(matA)
ierr = cusparseDestroyDnVec(vecX)
ierr = cusparseDestroyDnVec(vecY)
ierr = cusparseDestroy(handle)

!device memory deallocation
ierr = cudaFree(devPtr_Buffer)
ierr = cudaFree(devPtr_AcsrOff)
ierr = cudaFree(devPtr_Acolumns)
ierr = cudaFree(devPtr_dAvalues)
ierr = cudaFree(devPtr_dX)
ierr = cudaFree(devPtr_dY)

9000 if (ierr /= CUSPARSE_STATUS_SUCCESS) write(,) "Error occurred"
end``

Possible crash issue related to cusparseSparseToDense routines

Find a possible crash situation for cusparseSparseToDense with valid input data. The sample code, which is basically the same as example provided with small change related to initialization of sparse matrix from files attached, and the data files associated to sparse matrix, are all attached. Tested under both CUSPARSE_ORDER_COL and CUSPARSE_ORDER_ROW (controlled by macro dense_mat_in_col_major) cases with same error saying:

CUDA API failed at line 204 with error: an illegal memory access was encountered (700)

I am using visual studio 2017 SDK version 10.0.15063.0 with CUDA 11.2. Hardware is 1080ti & 2080ti and intel i7-8700k. Tested on both 1080ti and 2080ti with the same error report.

Code_Data.zip

Minor issue: LTSgemm m, n, k integer variables

Hi,
cublaslt sgemm supports int64 vector and matrix sizes. Is it possible to change m, n, k and lda, ldb etc as uint64_t?
https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuBLASLt/LtSgemm/sample_cublasLt_LtSgemm.cu

cuSparse BlockedEll sparse format documentation is incomplete online

I am looking at cusparseCreateBlockedEll and it looks like the library does not offer any conversion helpers in the likes of Csr/Coo formats. Am I missing something?

Also, I am looking at the BlockedEll format description and it does not mention what should be the value for a missing block index. For example, if the last row of blocks has only one non-zero block, what should be the value of the missing block column index? Would it be -1 or something else?

Another issue is that cusparseCooSetPointers does not appear to have online documentation but is present in the header file.

Question about cusparseGetColorAlgs

Maybe this is not the correct place to ask this question, but I couldn't find documentation of cusparseGetColorAlgs. I am trying to find a function which would tell me which algorithm is used by cusparseSpMM when called with CUSPARSE_SPMM_ALG_DEFAULT?

CublasLtMatMul seems slow compared with Gemm

I tried to replace SGemm() with CublasLtMatMul() for its multiple choices of Algos such as Tile but found that CublasLtMatMul() is in general slower compared with Gemm(). Is it expected?

Here is a profiling tool out there for you for reproduce: https://github.com/jeng1220/cuGemmProf

e.g ./cuGemmProf -m 512 -n 768 -k 3072 --type 5,6 -l 1000 --all_algo
Device, Op(A), Op(B), M, N, K, ComputeType, A, B, C, DP4A.Restrictions(lda.ldb), TensorCoreRestrictions(m.k.A.B.C.lda.ldb.ldc), Algo, Time(ms), GFLOPS, LtAlgoId, TileId, SpliteK, Red.Sch, Swizzle, CustomId, WorkSpaceSize, WaveCount
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_DEFAULT, 0.214479, 11264.1
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO22, 0.215919, 11189
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO9, 0.223748, 10797.5
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO1_TENSOR_OP, 0.0716704, 33708.8
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO0_TENSOR_OP, 0.170531, 14167.1
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO7_TENSOR_OP, 0.198502, 12170.8
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLASLT_1ST_HEURISTIC_ALG, 0.240428, 10048.4, 1, CUBLASLT_MATMUL_TILE_64x32, 1, CUBLASLT_REDUCTION_SCHEME_NONE, 0, 0, 0, 1.000000
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO5, 0.406726, 5939.92
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_DEFAULT, 0.406746, 5939.63
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO4, 0.410825, 5880.65
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO14_TENSOR_OP, 0.406679, 5940.61
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO12_TENSOR_OP, 0.406684, 5940.53
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO3_TENSOR_OP, 0.406689, 5940.45
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLASLT_1ST_HEURISTIC_ALG, 0.583007, 4143.89, 0, CUBLASLT_MATMUL_TILE_128x64, 1, CUBLASLT_REDUCTION_SCHEME_NONE, 0, 0, 0, 0.000000

Possible bug related to cusparseDenseToSparse routines

Find a possible bug for cusparseDenseToSparse routines with valid input data. The sample code, which is basically the same as example provided with small change related to initialization of dense matrix from files attached, and the data files associated to dense matrix, are all attached. The code attached tries to convert a dense matrix into CSR format. However the column indices for each row is not in ascending order. Would like to make sure if such behavior is as expected.

I am using visual studio 2017 SDK version 10.0.15063.0 with CUDA 11.2. Hardware is 1080ti & 2080ti and intel i7-8700k. Tested on both 1080ti and 2080ti with the same behavior.

Code_Data.zip

cusparseSpGEMM is not working with data type CUDA_C_64F

Hello,

I am currently working on an application where I need to execute the multiplication of two complex sparse matrices. I've done the multiplication with the method cusparseSpGEMM defining the data type of both matrices and the result as CUDA_C_32F. However, when I try to change the precision with the data type CUDA_C_64F, the function cusparseSpGEMM_copy breaks the application and returns the following error:

Exception thrown at 0x00007FFD9F0F8F4E (cusparse64_11.dll) in cuF64ExampleMM.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF. occurred

I followed the example from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSE/spgemm, and I can run the example using the data type CUDA_R_32F, CUDA_R_64F, and CUDA_C_32F without any issue.

I don't know what's causing this since the example works properly for the other data types, even the single-precision complex one (CUDA_C_32F ), and the documentation says cusparseSpGEMM supports CUDA_C_64F.

The cusparseSpGEMM example using CUDA_C_64F is below.

#include <cuda_runtime_api.h> // cudaMalloc, cudaMemcpy, etc.
#include <cusparse.h>         // cusparseSpGEMM
#include <cuComplex.h>
#include <stdio.h>            // printf
#include <stdlib.h>           // EXIT_FAILURE

#define CHECK_CUDA(func)                                                       \
{                                                                              \
    cudaError_t status = (func);                                               \
    if (status != cudaSuccess) {                                               \
        printf("CUDA API failed at line %d with error: %s (%d)\n",             \
               __LINE__, cudaGetErrorString(status), status);                  \
        return EXIT_FAILURE;                                                   \
    }                                                                          \
}

#define CHECK_CUSPARSE(func)                                                   \
{                                                                              \
    cusparseStatus_t status = (func);                                          \
    if (status != CUSPARSE_STATUS_SUCCESS) {                                   \
        printf("CUSPARSE API failed at line %d with error: %s (%d)\n",         \
               __LINE__, cusparseGetErrorString(status), status);              \
        return EXIT_FAILURE;                                                   \
    }                                                                          \
}

int main(void) {
    // Host problem definition
#define   A_NUM_ROWS 4   // C compatibility
    const int A_num_rows = 4;
    const int A_num_cols = 4;
    const int A_nnz = 9;
    const int B_num_rows = 4;
    const int B_num_cols = 4;
    const int B_nnz = 8;
    int   hA_csrOffsets[] = { 0, 3, 4, 7, 9 };
    int   hA_columns[] = { 0, 2, 3, 1, 0, 2, 3, 1, 3 };
    cuDoubleComplex hA_values[] = { make_cuDoubleComplex(1,0), make_cuDoubleComplex(2,0), make_cuDoubleComplex(3,0), make_cuDoubleComplex(4,0), make_cuDoubleComplex(5,0),
                              make_cuDoubleComplex(6,0), make_cuDoubleComplex(7,0), make_cuDoubleComplex(8,0), make_cuDoubleComplex(9,0) };
    int   hB_csrOffsets[] = { 0, 2, 4, 7, 8 };
    int   hB_columns[] = { 0, 3, 1, 3, 0, 1, 2, 1 };
    cuDoubleComplex hB_values[] = { make_cuDoubleComplex(1,0), make_cuDoubleComplex(2,0), make_cuDoubleComplex(3,0), make_cuDoubleComplex(4,0), make_cuDoubleComplex(5,0),
                              make_cuDoubleComplex(6,0), make_cuDoubleComplex(7,0), make_cuDoubleComplex(8,0) };
    int   hC_csrOffsets[] = { 0, 4, 6, 10, 12 };
    int   hC_columns[] = { 0, 1, 2, 3, 1, 3, 0, 1, 2, 3, 1, 3 };
    cuDoubleComplex hC_values[] = { make_cuDoubleComplex(11,0), make_cuDoubleComplex(36,0), make_cuDoubleComplex(14,0), make_cuDoubleComplex(2,0),  make_cuDoubleComplex(12,0),
                              make_cuDoubleComplex(16,0), make_cuDoubleComplex(35,0), make_cuDoubleComplex(92,0), make_cuDoubleComplex(42,0), make_cuDoubleComplex(10,0),
                              make_cuDoubleComplex(96,0), make_cuDoubleComplex(32,0) };
    const int C_nnz = 12;
    #define C_NUM_NNZ 12   // C compatibility
    cuDoubleComplex               alpha = make_cuDoubleComplex(1,0);
    cuDoubleComplex               beta = make_cuDoubleComplex(0,0);
    cusparseOperation_t opA = CUSPARSE_OPERATION_NON_TRANSPOSE;
    cusparseOperation_t opB = CUSPARSE_OPERATION_NON_TRANSPOSE;
    cudaDataType        computeType = CUDA_C_64F;
    //--------------------------------------------------------------------------
    // Device memory management: Allocate and copy A, B
    int* dA_csrOffsets, * dA_columns, * dB_csrOffsets, * dB_columns,
        * dC_csrOffsets, * dC_columns;
    cuDoubleComplex* dA_values, * dB_values, * dC_values;
    // allocate A
    CHECK_CUDA(cudaMalloc((void**)&dA_csrOffsets,
        (A_num_rows + 1) * sizeof(int)))
    CHECK_CUDA(cudaMalloc((void**)&dA_columns, A_nnz * sizeof(int)))
    CHECK_CUDA(cudaMalloc((void**)&dA_values, A_nnz * sizeof(cuDoubleComplex)))
    // allocate B
    CHECK_CUDA(cudaMalloc((void**)&dB_csrOffsets,
        (B_num_rows + 1) * sizeof(int)))
    CHECK_CUDA(cudaMalloc((void**)&dB_columns, B_nnz * sizeof(int)))
    CHECK_CUDA(cudaMalloc((void**)&dB_values, B_nnz * sizeof(cuDoubleComplex)))
    // allocate C offsets
    CHECK_CUDA(cudaMalloc((void**)&dC_csrOffsets,
        (A_num_rows + 1) * sizeof(int)))

    // copy A
    CHECK_CUDA(cudaMemcpy(dA_csrOffsets, hA_csrOffsets,
        (A_num_rows + 1) * sizeof(int),
        cudaMemcpyHostToDevice))
    CHECK_CUDA(cudaMemcpy(dA_columns, hA_columns, A_nnz * sizeof(int),
        cudaMemcpyHostToDevice))
    CHECK_CUDA(cudaMemcpy(dA_values, hA_values,
        A_nnz * sizeof(cuDoubleComplex), cudaMemcpyHostToDevice))
    // copy B
    CHECK_CUDA(cudaMemcpy(dB_csrOffsets, hB_csrOffsets,
        (B_num_rows + 1) * sizeof(int),
        cudaMemcpyHostToDevice))
    CHECK_CUDA(cudaMemcpy(dB_columns, hB_columns, B_nnz * sizeof(int),
        cudaMemcpyHostToDevice))
    CHECK_CUDA(cudaMemcpy(dB_values, hB_values,
        B_nnz * sizeof(cuDoubleComplex), cudaMemcpyHostToDevice))
    //--------------------------------------------------------------------------
    // CUSPARSE APIs
    cusparseHandle_t     handle = NULL;
    cusparseSpMatDescr_t matA, matB, matC;
    void* dBuffer1 = NULL, * dBuffer2 = NULL;
    size_t bufferSize1 = 0, bufferSize2 = 0;
    CHECK_CUSPARSE(cusparseCreate(&handle))
    // Create sparse matrix A in CSR format
    CHECK_CUSPARSE(cusparseCreateCsr(&matA, A_num_rows, A_num_cols, A_nnz,
        dA_csrOffsets, dA_columns, dA_values,
        CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
        CUSPARSE_INDEX_BASE_ZERO, CUDA_C_64F))
    CHECK_CUSPARSE(cusparseCreateCsr(&matB, B_num_rows, B_num_cols, B_nnz,
        dB_csrOffsets, dB_columns, dB_values,
        CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
        CUSPARSE_INDEX_BASE_ZERO, CUDA_C_64F))
    CHECK_CUSPARSE(cusparseCreateCsr(&matC, A_num_rows, B_num_cols, 0,
        NULL, NULL, NULL,
        CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
        CUSPARSE_INDEX_BASE_ZERO, CUDA_C_64F))
    //--------------------------------------------------------------------------
    // SpGEMM Computation
    cusparseSpGEMMDescr_t spgemmDesc;
    CHECK_CUSPARSE(cusparseSpGEMM_createDescr(&spgemmDesc))

    // ask bufferSize1 bytes for external memory
    CHECK_CUSPARSE(
        cusparseSpGEMM_workEstimation(handle, opA, opB,
            &alpha, matA, matB, &beta, matC,
            computeType, CUSPARSE_SPGEMM_DEFAULT,
            spgemmDesc, &bufferSize1, NULL))
    CHECK_CUDA(cudaMalloc((void**)&dBuffer1, bufferSize1))
    // inspect the matrices A and B to understand the memory requiremnent for
    // the next step
    CHECK_CUSPARSE(
        cusparseSpGEMM_workEstimation(handle, opA, opB,
            &alpha, matA, matB, &beta, matC,
            computeType, CUSPARSE_SPGEMM_DEFAULT,
            spgemmDesc, &bufferSize1, dBuffer1))

    // ask bufferSize2 bytes for external memory
    CHECK_CUSPARSE(
        cusparseSpGEMM_compute(handle, opA, opB,
            &alpha, matA, matB, &beta, matC,
            computeType, CUSPARSE_SPGEMM_DEFAULT,
            spgemmDesc, &bufferSize2, NULL))
    CHECK_CUDA(cudaMalloc((void**)&dBuffer2, bufferSize2))

    // compute the intermediate product of A * B
    CHECK_CUSPARSE(cusparseSpGEMM_compute(handle, opA, opB,
        &alpha, matA, matB, &beta, matC,
        computeType, CUSPARSE_SPGEMM_DEFAULT,
        spgemmDesc, &bufferSize2, dBuffer2))
    // get matrix C non-zero entries C_nnz1
    int64_t C_num_rows1, C_num_cols1, C_nnz1;
    CHECK_CUSPARSE(cusparseSpMatGetSize(matC, &C_num_rows1, &C_num_cols1,
        &C_nnz1))
    // allocate matrix C
    CHECK_CUDA(cudaMalloc((void**)&dC_columns, C_nnz1 * sizeof(int)))
    CHECK_CUDA(cudaMalloc((void**)&dC_values, C_nnz1 * sizeof(cuDoubleComplex)))
    // update matC with the new pointers
    CHECK_CUSPARSE(
        cusparseCsrSetPointers(matC, dC_csrOffsets, dC_columns, dC_values))

    // if beta != 0, cusparseSpGEMM_copy reuses/updates the values of dC_values

    // copy the final products to the matrix C
    CHECK_CUSPARSE(
        cusparseSpGEMM_copy(handle, opA, opB,
            &alpha, matA, matB, &beta, matC,
            computeType, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc))

    // destroy matrix/vector descriptors
    CHECK_CUSPARSE(cusparseSpGEMM_destroyDescr(spgemmDesc))
    CHECK_CUSPARSE(cusparseDestroySpMat(matA))
    CHECK_CUSPARSE(cusparseDestroySpMat(matB))
    CHECK_CUSPARSE(cusparseDestroySpMat(matC))
    CHECK_CUSPARSE(cusparseDestroy(handle))
    //--------------------------------------------------------------------------
    // device result check
    int   hC_csrOffsets_tmp[A_NUM_ROWS + 1];
    int   hC_columns_tmp[C_NUM_NNZ];
    cuDoubleComplex hC_values_tmp[C_NUM_NNZ];
    CHECK_CUDA(cudaMemcpy(hC_csrOffsets_tmp, dC_csrOffsets,
        (A_num_rows + 1) * sizeof(int),
        cudaMemcpyDeviceToHost))
        CHECK_CUDA(cudaMemcpy(hC_columns_tmp, dC_columns,
            C_nnz * sizeof(int),
            cudaMemcpyDeviceToHost))
        CHECK_CUDA(cudaMemcpy(hC_values_tmp, dC_values,
            C_nnz * sizeof(cuDoubleComplex),
            cudaMemcpyDeviceToHost))
        int correct = 1;
    for (int i = 0; i < A_num_rows + 1; i++) {
        if (hC_csrOffsets_tmp[i] != hC_csrOffsets[i]) {
            printf("hC_csrOffsets[i]: %d \n", hC_csrOffsets[i]);
            printf("hC_csrOffsets_tmp[i]: %d \n", hC_csrOffsets_tmp[i]);
            correct = 0;
            break;
        }
        //printf("hC_csrOffsets_tmp[i]: %d \n", hC_csrOffsets_tmp[i]);
    }
    for (int i = 0; i < C_nnz; i++) {
        //printf("hC_columns_tmp[i]: %d - hC_values_tmp[i]: %f + j%f \n", hC_columns_tmp[i], cuCreal(hC_values_tmp[i]), cuCimag(hC_values_tmp[i]));
        if (hC_columns_tmp[i] != hC_columns[i] ||
            cuCreal(hC_values_tmp[i]) != cuCreal(hC_values[i]) ||
            cuCimag(hC_values_tmp[i]) != cuCimag(hC_values[i])) {
            printf("hC_columns[i]: %d - hC_values[i]: %f + j%f \n", hC_columns[i], cuCreal(hC_values[i]), cuCimag(hC_values[i]));
            printf("hC_columns_tmp[i]: %d - hC_values_tmp[i]: %f + j%f \n", hC_columns_tmp[i], cuCreal(hC_values_tmp[i]), cuCimag(hC_values_tmp[i]));
            correct = 0;
            break;
        }
    }
    if (correct)
        printf("spgemm_example test PASSED\n");
    else {
        printf("spgemm_example test FAILED: wrong result\n");
        return EXIT_FAILURE;
    }
    //--------------------------------------------------------------------------
    // device memory deallocation
    CHECK_CUDA(cudaFree(dBuffer1))
    CHECK_CUDA(cudaFree(dBuffer2))
    CHECK_CUDA(cudaFree(dA_csrOffsets))
    CHECK_CUDA(cudaFree(dA_columns))
    CHECK_CUDA(cudaFree(dA_values))
    CHECK_CUDA(cudaFree(dB_csrOffsets))
    CHECK_CUDA(cudaFree(dB_columns))
    CHECK_CUDA(cudaFree(dB_values))
    CHECK_CUDA(cudaFree(dC_csrOffsets))
    CHECK_CUDA(cudaFree(dC_columns))
    CHECK_CUDA(cudaFree(dC_values))
    return EXIT_SUCCESS;
}

As I said above, changing all cuDoubleComplex statements for cuFloatComplex and CUDA_C_64F for CUDA_C_32F works fine.

The CUDA Toolkit version I'm using is cuda_11.1.1_456.81_win10.

Thanks in advance for any help with this issue.

Regards,
Diego

2:4 sparsity speedup not expected

On an A100 with CUDA Version 11.5 and Driver version 495.29.05.
The cuSparselt example of size m=n=k=4096 measures at 0.545792 ms, with fp16.
The cutlass gemm of the same size is profiled at 0.607959 ms, with fp 16.
This only gives 1.1x at best.
The blogpost claimed something smaller than this could get 1.5x speedup over dense and as size grows the speedup also increases? I wonder what setting are those data points measured in and what metric was used to report the speedup (was it sample/s/watt)?

cuSparse: cusparseScsrgemm2 much slower than SpGEMM

According to this comment, the current SpGEMM implementation may issue CUSPARSE_STATUS_INSUFFICIENT_RESOURCES for some specific input. Hence, I tried the cusparseScsrgemm2 method. However, I find that cusparseScsrgemm2 is quite slow. For example, for two 600,000 x 600,000 matrices A and B, where A contains 40,000,000 entries and B is a diagonal matrix, cusparseScsrgemm2 took several seconds to compute the multiplication of A and B, much slower than SpGEMM, which took only tens of milliseconds. I used CUDA11.3 and Tesla V100. The input matrices can be downloaded here. The program is as follows.

#include <cstdio>
#include <cstdlib>
#include <cusparse.h>

void mul(const int n,
         const float* const A_val, const int* const A_colind, const int* const A_rowptr, const int A_nnz,
         const float* const B_val, const int* const B_colind, const int* const B_rowptr, const int B_nnz,
         float** const C_val, int** const C_colind, int** const C_rowptr, int* const C_nnz) {

  float alpha = 1.0;

  cusparseHandle_t handle;
  cusparseCreate(&handle);

  cusparseMatDescr_t desc;
  cusparseCreateMatDescr(&desc);
  cusparseSetMatType(desc, CUSPARSE_MATRIX_TYPE_GENERAL);
  cusparseSetMatIndexBase(desc, CUSPARSE_INDEX_BASE_ZERO);

  csrgemm2Info_t info = NULL;
  cusparseCreateCsrgemm2Info(&info);

  size_t buffer_size;
  cusparseScsrgemm2_bufferSizeExt(handle, n, n, n, &alpha,
                                   desc, A_nnz, A_rowptr, A_colind,
                                   desc, B_nnz, B_rowptr, B_colind,
                                   NULL,
                                   desc, B_nnz, B_rowptr, B_colind,
                                   info, &buffer_size);
  void* buffer = NULL;
  cudaMalloc(&buffer, buffer_size);

  cudaMalloc(C_rowptr, sizeof(int) * (n + 1));
  cusparseXcsrgemm2Nnz(handle, n, n, n,
                                      desc, A_nnz, A_rowptr, A_colind,
                                      desc, B_nnz, B_rowptr, B_colind,
                                      desc, B_nnz, B_rowptr, B_colind,
                                      desc, *C_rowptr, C_nnz,
                                      info, buffer);

  cudaMalloc(C_colind, sizeof(int) * *C_nnz);
  cudaMalloc(C_val, sizeof(float) * *C_nnz);

  cusparseScsrgemm2(handle, n, n, n, &alpha,
                               desc, A_nnz, A_val, A_rowptr, A_colind,
                               desc, B_nnz, B_val, B_rowptr, B_colind,
                               NULL,
                               desc, B_nnz, B_val, B_rowptr, B_colind,
                               desc, *C_val, *C_rowptr, *C_colind,
                               info, buffer);

  cusparseDestroyCsrgemm2Info(info);
  cusparseDestroyMatDescr(desc);
  cudaFree(buffer);
}

int main() {
  FILE* file = fopen("AS.bin", "rb");
  int A_nnz, B_nnz, n;
  fread(&A_nnz, sizeof(int), 1, file);
  fread(&B_nnz, sizeof(int), 1, file);
  fread(&n, sizeof(int), 1, file);
  printf("%d %d %d\n", A_nnz, B_nnz, n);
  float* h_A_val = new float[A_nnz];
  int* h_A_colind = new int[A_nnz];
  int* h_A_rowptr = new int[n + 1];
  float* h_B_val = new float[B_nnz];
  int* h_B_colind = new int[B_nnz];
  int* h_B_rowptr = new int[n + 1];
  fread(h_A_val, sizeof(float), A_nnz, file);
  fread(h_A_colind, sizeof(int), A_nnz, file);
  fread(h_A_rowptr, sizeof(int), n + 1, file);
  fread(h_B_val, sizeof(float), B_nnz, file);
  fread(h_B_colind, sizeof(int), B_nnz, file);
  fread(h_B_rowptr, sizeof(int), n + 1, file);
  fclose(file);

  float* d_A_val;
  cudaMalloc(&d_A_val, sizeof(float) * A_nnz);
  cudaMemcpy(d_A_val, h_A_val, sizeof(float) * A_nnz, cudaMemcpyHostToDevice);
  int* d_A_colind;
  cudaMalloc(&d_A_colind, sizeof(int) * A_nnz);
  cudaMemcpy(d_A_colind, h_A_colind, sizeof(int) * A_nnz, cudaMemcpyHostToDevice);
  int* d_A_rowptr;
  cudaMalloc(&d_A_rowptr, sizeof(int) * (n + 1));
  cudaMemcpy(d_A_rowptr, h_A_rowptr, sizeof(int) * (n + 1), cudaMemcpyHostToDevice);
  int d_A_nnz = A_nnz;

  float* d_B_val;
  cudaMalloc(&d_B_val, sizeof(float) * B_nnz);
  cudaMemcpy(d_B_val, h_B_val, sizeof(float) * B_nnz, cudaMemcpyHostToDevice);
  int* d_B_colind;
  cudaMalloc(&d_B_colind, sizeof(int) * B_nnz);
  cudaMemcpy(d_B_colind, h_B_colind, sizeof(int) * B_nnz, cudaMemcpyHostToDevice);
  int* d_B_rowptr;
  cudaMalloc(&d_B_rowptr, sizeof(int) * (n + 1));
  cudaMemcpy(d_B_rowptr, h_B_rowptr, sizeof(int) * (n + 1), cudaMemcpyHostToDevice);
  int d_B_nnz = B_nnz;

  float* d_C_val = NULL;
  int* d_C_colind = NULL;
  int* d_C_rowptr = NULL;
  int d_C_nnz;

  mul(n, d_A_val, d_A_colind, d_A_rowptr, d_A_nnz,
         d_B_val, d_B_colind, d_B_rowptr, d_B_nnz,
         &d_C_val, &d_C_colind, &d_C_rowptr, &d_C_nnz);

  printf("%d\n", d_C_nnz);

  return 0;
}

I have the following questions.

Is the low efficiency of cusparseScsrgemm2 caused by not being able to exploit the architecture of V100?
Are there any other alternatives to cusparseScsrgemm2 and SpGEMM?

Thanks.

LtSgemm example on non-square matrix???

Find CUDNN Samples

Dear developer:
I'm going to do cudnn (v8.2.1) development on windows10 and I'm having some problems, but I can't find the official sample code. I would like to ask if CUDNN sample code will be added to this repository, or where is the official CUDNN sample code?
Looking forward to your reply

j2k support

Trying to open j2k files from DALI results in an error. I was told that this should be a nvjpeg2000 issue.

windows10 && cuda10.2 编译失败

1>nvjpegDecoder.cpp(40): error C2660: “cudaEventCreate”: 函数不接受 2 个参数
1>nvjpegDecoder.cpp(41): error C2660: “cudaEventCreate”: 函数不接受 2 个参数
1>nvjpegDecoder.cpp(56): error C3861: “nvjpegJpegStreamParseHeader”: 找不到标识符
1>nvjpegDecoder.cpp(58): error C3861: “nvjpegDecodeBatchedSupported”: 找不到标识符
1>nvjpegDecoder.cpp(84): warning C4267: “参数”: 从“size_t”转换到“int”，可能丢失数据
1>nvjpegDecoder.cpp(331): error C2065: “NVJPEG_BACKEND_HARDWARE”: 未声明的标识符

Typo in sizeof

CUDALibrarySamples/cuBLASLt/LtSgemm/sample_cublasLt_LtSgemm.cu

Line 67 in 5d9c00d

 checkCublasStatus(cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSB, &transb, sizeof(transa))); 

is sizeof(transa) but should be sizeof(transb)

Possible confusing routine definition in cuSparse doc in CUDA 11.2

Hi there,

Just find a confusing routine definition in section 14.6.11 of cuSparse library doc:

cusparseStatus_t CUSPARSEAPI
cusparseSpGEMM_copy(cusparseHandle_t handle,
cusparseOperation_t opA,
cusparseOperation_t opB,
const void* alpha,
cusparseSpMatDescr_t matA,
cusparseSpMatDescr_t matB,
const void* beta,
cusparseSpMatDescr_t matC,
cudaDataType computeType,
cusparseSpGEMMAlg_t alg,
cusparseSpGEMMDescr_t spgemmDescr,
void* externalBuffer2);

In example provided and my own tests, function cusparseSpGEMM_copy only works without taking the last input variable 'void* externalBuffer2'. Just FYI.

cusparseConstrainedGeMM is not working with data type CUDA_R_16F

Hi! I'm trying to use the cusparseSpGEMM. I got the correct results when the data type is CUDA_R_32F and CUDA_R_64F. However, after changing to CUDA_R_16F, the program reports an error as follows

On entry to cusparseConstrainedGeMM_bufferSize(): type of matA/matB (CUDA_R_16F), matC (CUDA_R_16F), compute (-1185136882) is not supported

I attached my code below, and I'm using the docker image 11.2.0-devel-ubuntu18.04 on V100 GPU. The driver version is 418.126.02 and the CUDA Verison is 11.2.

In the document, it says types including CUDA_R_16F, CUDA_R_32F, CUDA_R_64F are supported.

code.zip

nvJPEG failed with code 6 in CHECK_NVJPEG(status);

As titled, my program failed at line 340 in this file

I successfully self compiled the project via the official NVIDIA nvJPEG manual here with the following file hierarchy

And here is my project repo, the only difference is just I have taken out the main function from nvjpegDecoder.cpp to my own main.cpp for better modularity, and the rest are the same.

I found that this error code (6) means :Error during the execution of the device tasks
But I am not clear about what is the reason causing this.

OS: Ubuntu 18.04.3 LTS
CPU: i9-9900KF
NVIDIA-SMI: 440.59
GPU: TITAN V
Makefile flags:

LDFLAGS=-L../lib64 -L/usr/local/cuda-11.0/lib64
INCLUDES=-I./include -I/usr/local/cuda-11.0/include
LIBS=-ldl -lrt -lcudart -lnvjpeg
RPATH=-Wl,-rpath=../lib64 -Wl,-rpath=/usr/local/cuda-11.0/lib64
CXXFLAGS:= -O2 -std=c++14 $(LDFLAGS) $(INCLUDES) $(LIBS) $(RPATH)

Any suggestion would be greatly appreciate.
Thank you.

nvJPEG encoder buffer size explode?

Hi,
I have some questions regarding nvjpeg encoder calls.
Currently I am processing large scale whole slide images read from another library called Openslide; let's say the resolution is about 25000 * 15000 * 3(RGB). I found that nvjpegEncodeImage() will throw either NVJPEG failure: '6' or NVJPEG failure: '5' depending on two different machine setup.

Next, to investigate the problem, I called nvjpegEncodeGetBufferSize() to get roughly how much memory I may need. (Since error code 5 looks like memory allocator failure)

Then I found that once I gave $image_width, $image_height > 20000, the $max_stream_length will go to incredibly large value.

Code

Details as comments
I followed this sample code at nvJPEG official documentaion

nvjpegHandle_t nv_handle;
nvjpegEncoderState_t nv_enc_state;
nvjpegEncoderParams_t nv_enc_params;
cudaStream_t stream;
cudaStreamCreate(&stream);
 
// initialize nvjpeg structures
CHECK_NVJPEG(nvjpegCreateSimple(&nv_handle));
CHECK_NVJPEG(nvjpegEncoderStateCreate(nv_handle, &nv_enc_state, stream));
CHECK_NVJPEG(nvjpegEncoderParamsCreate(nv_handle, &nv_enc_params, stream));
CHECK_NVJPEG(nvjpegEncoderParamsSetSamplingFactors(nv_enc_params, NVJPEG_CSS_444, stream));

int64_t target_width, target_height;
/* whole_slide_rgb is decalred as OpenCV's vector of matrix, i.e., vector<Mat>
*  So we loop through this vector to sequentially fetch every RGB image for nvJPEG encoder
*/
for(int idx=0; idx < whole_slide_rgb.size(); idx++){
    nvjpegImage_t nv_image;
    target_width = target_dimension[idx].first;
    target_height = target_dimension[idx].second;
    
    // Fill nv_image with image data, let’s say target_height, target_width image in RGB format
    Mat bgr[3];
    split(whole_slide_rgb[idx], bgr); // split RGB three channels from whole_slide_rgb[idx]
    printf("   Target width: %ld, Target height: %ld\n", target_width, target_height);
    for(int i=0; i<3; i++){
        CHECK_CUDA(cudaMalloc((void **)&(nv_image.channel[i]), target_width * target_height));
        CHECK_CUDA(cudaMemcpy(nv_image.channel[i], bgr[2-i].data, target_width * target_height, cudaMemcpyHostToDevice));
        nv_image.pitch[i] = (size_t)target_width;
    }
    
    /* Tricky part
    *  (target_width, target_height) real case, which is 25659 * 15781
    *  (25000, 15000) is close to my real image dimension, and it would output extraordinarily long length
    *  (15000, 15000) is for comparison, works fine and reasonable
    *  (10000, 10000) Also, works fine and reasonable.
    */
    size_t max_stream_length=0;
    CHECK_NVJPEG(nvjpegEncodeGetBufferSize(nv_handle, nv_enc_params, target_width, target_height, &max_stream_length));
    printf("   Max stream length estimated: %zu\n", max_stream_length);
    CHECK_NVJPEG(nvjpegEncodeGetBufferSize(nv_handle, nv_enc_params, 25000, 15000, &max_stream_length));
    printf("   Max stream length estimated: %zu\n", max_stream_length);
    CHECK_NVJPEG(nvjpegEncodeGetBufferSize(nv_handle, nv_enc_params, 15000, 15000, &max_stream_length));
    printf("   Max stream length estimated: %zu\n", max_stream_length);
    CHECK_NVJPEG(nvjpegEncodeGetBufferSize(nv_handle, nv_enc_params, 10000, 10000, &max_stream_length));
    printf("   Max stream length estimated: %zu\n", max_stream_length);
    // Compress image
    CHECK_NVJPEG(nvjpegEncodeImage(nv_handle, nv_enc_state, nv_enc_params,
        &nv_image, NVJPEG_INPUT_RGB, target_width, target_height, stream));
     
    // get compressed stream size
    size_t length;
    CHECK_NVJPEG(nvjpegEncodeRetrieveBitstream(nv_handle, nv_enc_state, NULL, &length, stream));
    // get stream itself
    cudaStreamSynchronize(stream);
    vector<unsigned char> jpeg(length);
    CHECK_NVJPEG(nvjpegEncodeRetrieveBitstream(nv_handle, nv_enc_state, jpeg.data(), &length, 0));
     
    // write stream to file
    const char* jpegstream = reinterpret_cast<const char*>(jpeg.data());
    cudaStreamSynchronize(stream);
    ofstream output_file("test"+to_string(idx)+".jpg", ios::out | ios::binary);
    output_file.write(jpegstream, length);
    output_file.close();
}
CHECK_NVJPEG(nvjpegDestroy(nv_handle));`

Outputs

[Note]: Line 182 is exactly nvjpegEncodeImage right after checking for max buffer size

Machine environments

Throwing NVJPEG failure: '5'

Self PC

OS: Ubuntu 18.04
host ram: 16GB ddr4
nvcc --version: cuda_11.0
GPU: RTX 2080Ti
CPU: Intel i7-9700K

Throwing NVJPEG failure: '6'

Super computer -- Taiwania-2

OS: Ubuntu 18.04
host ram: 754GB ddr4
nvcc --version: cuda_11.1
GPU: NVIDIA Tesla V100-SXM2 (32G)
CPU: Intel(R) Xeon(R) Gold 6154

Thanks for your support!

SVS tile decoding get "NVJPEG2K_STATUS_JPEG_NOT_SUPPORTED"

Hi,
I want to decode medical image file (SVS).
I use this sample code without any modification.
But got NVJPEG2K_STATUS_JPEG_NOT_SUPPORTED from nvjpeg2kStreamParse.
Here is my sample bitstream.
It should be decoded like this:

It can be decoded successfully using openjpeg.

Will nvjpeg2k support this JPEG2000 stream in the future?

ENV:
Ubuntu 20.04.2 LTS
A100-PCIE-40GB
CUDA Version: 11.2
Driver Version: 460.73.01
nvJPEG2000 0.2.0.22

Thanks!

How to calculate J'J, where J(mxn, m!=n) is a sparse matrix?

I changed the code to

    CHECK_CUSPARSE(cusparseSpGEMM_workEstimation(handle, OP_T, OP_N,
                                                 &alpha, mat_J, mat_J, &beta, mat_JTJ,
                                                 CUDA_R_32F, CUSPARSE_SPGEMM_DEFAULT,
                                                 spgemmDesc, &bufferSize1, dBuffer1));

with error

** On entry to cusparseSpGEMM_workEstimation() dimension mismatch: matA->cols != matB->rows

In fact, I pass in coo format of matrix J with residual vector R, I want to calculate J'J, and J'R, is there any efficient pipeline?

Revise the product Ax = y

I think there is an error in the product illustration for the example: isn't it A * y = x instead of A * x = y?

make error (nvJPEG/nvJPEG-Decoder)

env: cuda10.2 ubuntu 18.04

(base) root@xx-11-0:/workspace/CUDALibrarySamples-master/nvJPEG/nvJPEG-Decoder/build# make -B
[ 50%] Building CUDA object CMakeFiles/nvjpegDecoder.dir/nvjpegDecoder.cpp.o
/workspace/CUDALibrarySamples-master/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp(56): error: identifier "nvjpegJpegStreamParseHeader" is undefined

/workspace/CUDALibrarySamples-master/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp(58): error: identifier "nvjpegDecodeBatchedSupported" is undefined

/workspace/CUDALibrarySamples-master/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp(331): error: identifier "NVJPEG_BACKEND_HARDWARE" is undefined

3 errors detected in the compilation of "/tmp/tmpxft_00000f8f_00000000-6_nvjpegDecoder.cpp1.ii".
CMakeFiles/nvjpegDecoder.dir/build.make:81: recipe for target 'CMakeFiles/nvjpegDecoder.dir/nvjpegDecoder.cpp.o' failed
make[2]: *** [CMakeFiles/nvjpegDecoder.dir/nvjpegDecoder.cpp.o] Error 1
CMakeFiles/Makefile2:94: recipe for target 'CMakeFiles/nvjpegDecoder.dir/all' failed
make[1]: *** [CMakeFiles/nvjpegDecoder.dir/all] Error 2
Makefile:148: recipe for target 'all' failed
make: *** [all] Error 2

Does cusparse_spmm function support multiplying of multiple sparse matrices and multiple dense matrices

In this doc, nvidia stated that spmm can support batch sparse matrix and batch dense matrix.

However, it is stated here that multiple sparse matrices are currently not supported.

Possible bug related to cusparseCnnz_compress

Find a possible bug associated to cusparseCnnz_compress with valid input data. The returned non-zero elements number and non-zero elements per row are incorrect for specific input. The sample code and the data files associated to input csr sparse matrix, are attached.

I am using visual studio 2017 SDK version 10.0.15063.0 with CUDA 11.2. Hardware are 1080ti & 2080ti and intel i7-8700k. Tested on both 1080ti and 2080ti with different behavior. Inside attached package, there are two subfolders called 'data_for_1080ti' and 'data_for_2080ti', indicating data that triggers wrong output for each hardware from my side. The correct & wrong output of total non-zero number for 'data_for_1080ti' on 1080ti is:
nnz_cpu = 1282036 (correct)
nnz_gpu = 1281069 (wrong)
The output for 'data_for_2080ti' on 2080ti is:
nnz_cpu = 1590140 (correct)
nnz_gpu = 1589628 (wrong)
code_data.zip

How to process the nvjpeg_t.channel{0,1,2...} in CUDA kernel function.

Hello, I have a question about how we can process the object of channels{0, 1, 2...} nvjpegImage_t

I have read this doc and this doc about the underlying objects in nvjpegImage_t.

It is obvious that channel 0, 1, 2 represents the R G B channel according to line 146 in nvjpegDecoder.cpp and line 407 in nvjpegDecoder.h.

But when I try to perform a simple CUDA function as shown below, it immediately crash, and what's even worse is that my printf debugging log did not shown.

CUDA kernel

__global__ void simple_kernel_function_RGB(unsigned char* img_R, unsigned char* img_G, unsigned char* img_B, const int& img_row, const int& img_col){
    printf("CUDA kernel function, img_row %d img_col %d \n", img_row, img_col);
    for(int i = 0; i < img_row; i++){
        for(int j = 0; j < img_col; j++){
            printf("[KERNEL]: Access i %d j %d \n", i, j);
            img_R[i * img_col + j] = 0;
            img_G[i * img_col + j] = 255;
            img_B[i * img_col + j] = 255;
        }
    }
}

Arguments passing

if(params.fmt == NVJPEG_OUTPUT_RGB){
            img_R = iout[batch].channel[0];
            img_G = iout[batch].channel[1];
            img_B = iout[batch].channel[2];
            printf("RGB image! \n");
            if(!img_R || !img_G || !img_B){
                fprintf(stderr, "%s", "Nullpointerexception \n");
            }
            simple_kernel_function_RGB<<<1, 1>>>(img_R, img_G, img_B, img_row, img_col);
            printf("Finished GPU kernel call \n");

I have used "cuda-memcheck" to verify where I got wrong and it says I have an invalid address read. (If the CUDA kernel function call is removed, then everything works fine again)

at 0x00000070 in simple_kernel_function_RGB(unsigned char*, unsigned char*, unsigned char*, int const &, int const &)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7ffecf8df618 is out of bounds

And here goes my questions

Q1: Is it possible to directly operate the pointer in nvjpegImage_t, for example in nvjpedDecode.cpp ,

iout.channel[0][0] = 0
iout.channel[1][0] = 0
iout.channel[2][0] = 0

Describes to make the RGB value in pixel 0 to ber 0
Or we have to use cuda kernel call <<<>>> to operate the pointer data in GPU

Q2: From Q1, my case it is not workable, or does some pre-setup for operating the nvjpegImage_t is required ( I think cudaMemcpy(D2H) should not be used since nvjpeg directly decode image in GPU, providing with great performance and throughput, suppose we move the data back to CPU pand process it, the advantage will lose.)
Q3: Could the problem be my unsupported hardware since it says

in the very beginning.

HW Specs:
CPU: i9-9900K, GPU: TITAN V
OS: Ubuntu 18.04.3 LTS, Kernel version is 5.4.0-53-generic
NVIDIA-SMI: 455.38 (CUDA Version 11.1)
nvcc:

Makefile:

#---------------------------Required files-----------------------#
SRC=./src/
BIN=./bin/

SOURCE=$(SRC)main.cu
TARGET=$(BIN)main_out

#------------------------Compiler and flag-----------------------#
# Compilers
CC=nvcc
# CC= g++

# Libraries and flags
OPENCV_LIBS=`pkg-config --cflags --libs opencv`
LDFLAGS=-L../lib64 -L/usr/local/cuda-11.1/lib64
INCLUDES=-I./include -I/usr/local/cuda-11.1/include
LIBS=-ldl -lrt -lcudart -lnvjpeg
RPATH=-Wl,-rpath=../lib64 -Wl,-rpath=/usr/local/cuda-11.1/lib64
CXXFLAGS:= -O2 -std=c++14 $(LDFLAGS) $(INCLUDES) $(LIBS) #$(RPATH)

#---------------------------Rules-------------------------------#
all: cuda

cuda: $(SOURCE)  
	$(CC) -m64 $(SOURCE) $(CXXFLAGS) -o $(TARGET)

.PHONY: clean

clean:
	rm -f $(BIN)*.o

Where main.cu is just renamed of nvJpegDecoder.cpp and with my CUDA kernel and simple self-written functions mentioned above.

Sorry about the long problems because my research project need this library.
Many thanks to any advice in advance!

With regards,
Alfons

failed to compile Block-SpMM examples.

Hi folks, I tried to compile Block-SpMM example, but met these errors. Any suggestion to fix it?

nvcc -I/usr/bin/../include spmm_blockedell_example.cpp -o spmm_blockedell_example -lcudart -lcusparse spmm_blockedell_example.cpp: In function ‘int main()’: spmm_blockedell_example.cpp:143:21: error: ‘cusparseCreateBlockedEll’ was not declared in this scope 143 | CHECK_CUSPARSE( cusparseCreateBlockedEll( | ^~~~~~~~~~~~~~~~~~~~~~~~ spmm_blockedell_example.cpp:67:32: note: in definition of macro ‘CHECK_CUSPARSE’ 67 | cusparseStatus_t status = (func); \ | ^~~~ make: *** [Makefile:54: spmm_blockedell_example] Error 1

Test program contraction_simple.cu needs a fix!

Hi,

CUDALibrarySamples/cuTENSOR/contraction_simple.cu

Line 139 in e7250e4

(void*) &alpha, A, B,

CUDALibrarySamples/cuTENSOR/contraction_simple.cu

Line 140 in e7250e4

(void*) &beta, C, D,

should be

I spent a few hours debugging this hahaha. The results become correct when it is alpha and beta only! No need to take a pointer of a pointer. It's a last-minute small mistake when converting the code to a function, I guess.

nvJPEG-Decoder isn't faster on A100 vs V100

Hi, and thanks for providing these examples which are very useful.

I'm running the nvJPEG-Decoder on a A100 GPU and a V100 GPU, and I'm observing pretty much the same decoding times. A typical run for both A100 and V100 is something like:

./nvjpegDecoder -i ../input_images/ -o /tmp -w 100
Total decoding time: 12.2522
Avg decoding time per image: 1.02101
Avg images per sec: 0.979419
Avg decoding time per batch: 1.02101

Shouldn't we expect the A100 to be much faster?

Also, when I change NVJPEG_BACKEND_HARDWARE to instead use NVJPEG_BACKEND_DEFAULT, I'm not seing any performance hit on the A100. Is this expected? Am I doing something wrong?

nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

Thanks for your help!

Cusparse SpGEMM insufficient resources

I used the following code to multiply two sparse matrices A and B using SpGEMM, following the sample code.

#include <cstdio>
#include <cstdlib>

#include <cusparse.h>

void mul(const int n,  // the size of matrix
         float *A_val, int *A_colind, int *A_rowptr, int A_nnz,
         float *B_val, int *B_colind, int *B_rowptr, int B_nnz,
         float **C_val, int **C_colind, int **C_rowptr, int *C_nnz) {
  float alpha = 1.0, beta = 0.0;

  cusparseHandle_t handle;
  cusparseCreate(&handle);

  cusparseSpMatDescr_t matA, matB, matC;
  cusparseCreateCsr(&matA, n, n, A_nnz,
                    A_rowptr, A_colind, A_val,
                    CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
                    CUSPARSE_INDEX_BASE_ZERO, CUDA_R_32F);
  cusparseCreateCsr(&matB, n, n, B_nnz,
                    B_rowptr, B_colind, B_val,
                    CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
                    CUSPARSE_INDEX_BASE_ZERO, CUDA_R_32F);
  cusparseCreateCsr(&matC, n, n, 0, NULL, NULL, NULL,
                    CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
                    CUSPARSE_INDEX_BASE_ZERO, CUDA_R_32F);

  cusparseSpGEMMDescr_t spgemmDesc;
  cusparseSpGEMM_createDescr(&spgemmDesc);


  size_t buffer_size1 = 0, buffer_size2 = 0;
  void* buffer1 = NULL, *buffer2 = NULL;

  cusparseSpGEMM_workEstimation(
          handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE,
		  &alpha, matA, matB, &beta, matC,
          CUDA_R_32F, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc,
          &buffer_size1, NULL);
  cudaMalloc(&buffer1, buffer_size1);
  cusparseSpGEMM_workEstimation(
          handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE,
		  &alpha, matA, matB, &beta, matC,
          CUDA_R_32F, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc,
          &buffer_size1, buffer1);

  cusparseSpGEMM_compute(
          handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE,
		  &alpha, matA, matB, &beta, matC,
          CUDA_R_32F, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc,
          &buffer_size2, NULL);
  cudaMalloc(&buffer2, buffer_size2);
  if (CUSPARSE_STATUS_INSUFFICIENT_RESOURCES == cusparseSpGEMM_compute(
          handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE,
		  &alpha, matA, matB, &beta, matC,
          CUDA_R_32F, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc,
          &buffer_size2, buffer2)) {
    printf("insufficient resources\n");
    exit(EXIT_FAILURE);
  }

  int64_t rows, cols, nnz;
  cusparseSpMatGetSize(matC, &rows, &cols, &nnz);

  *C_nnz = nnz;
  cudaMalloc(C_val, sizeof(float) * nnz);
  cudaMalloc(C_colind, sizeof(int) * nnz);
  cudaMalloc(C_rowptr, sizeof(int) * (n + 1));
  cusparseCsrSetPointers(matC, *C_rowptr, *C_colind, *C_val);
  cusparseSpGEMM_copy(
          handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE,
		  &alpha, matA, matB, &beta, matC,
          CUDA_R_32F, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc);

  cusparseSpGEMM_destroyDescr(spgemmDesc);
  cusparseDestroySpMat(matA);
  cusparseDestroySpMat(matB);
  cusparseDestroySpMat(matC);
  cudaFree(buffer1);
  cudaFree(buffer2);
}

int main() {
  FILE* file = fopen("matrix.bin", "rb");
  int A_nnz, B_nnz, n;
  fread(&A_nnz, sizeof(int), 1, file);
  fread(&B_nnz, sizeof(int), 1, file);
  fread(&n, sizeof(int), 1, file);
  printf("%d %d %d\n", A_nnz, B_nnz, n);
  float* h_A_val = new float[A_nnz];
  int* h_A_colind = new int[A_nnz];
  int* h_A_rowptr = new int[n + 1];
  float* h_B_val = new float[B_nnz];
  int* h_B_colind = new int[B_nnz];
  int* h_B_rowptr = new int[n + 1];
  fread(h_A_val, sizeof(float), A_nnz, file);
  fread(h_A_colind, sizeof(int), A_nnz, file);
  fread(h_A_rowptr, sizeof(int), n + 1, file);
  fread(h_B_val, sizeof(float), B_nnz, file);
  fread(h_B_colind, sizeof(int), B_nnz, file);
  fread(h_B_rowptr, sizeof(int), n + 1, file);
  fclose(file);

  float* d_A_val;
  cudaMalloc(&d_A_val, sizeof(float) * A_nnz);
  cudaMemcpy(d_A_val, h_A_val, sizeof(float) * A_nnz, cudaMemcpyHostToDevice);
  int* d_A_colind;
  cudaMalloc(&d_A_colind, sizeof(int) * A_nnz);
  cudaMemcpy(d_A_colind, h_A_colind, sizeof(int) * A_nnz, cudaMemcpyHostToDevice);
  int* d_A_rowptr;
  cudaMalloc(&d_A_rowptr, sizeof(int) * (n + 1));
  cudaMemcpy(d_A_rowptr, h_A_rowptr, sizeof(int) * (n + 1), cudaMemcpyHostToDevice);
  int d_A_nnz = A_nnz;

  float* d_B_val;
  cudaMalloc(&d_B_val, sizeof(float) * B_nnz);
  cudaMemcpy(d_B_val, h_B_val, sizeof(float) * B_nnz, cudaMemcpyHostToDevice);
  int* d_B_colind;
  cudaMalloc(&d_B_colind, sizeof(int) * B_nnz);
  cudaMemcpy(d_B_colind, h_B_colind, sizeof(int) * B_nnz, cudaMemcpyHostToDevice);
  int* d_B_rowptr;
  cudaMalloc(&d_B_rowptr, sizeof(int) * (n + 1));
  cudaMemcpy(d_B_rowptr, h_B_rowptr, sizeof(int) * (n + 1), cudaMemcpyHostToDevice);
  int d_B_nnz = B_nnz;

  float* d_C_val = NULL;
  int* d_C_colind = NULL;
  int* d_C_rowptr = NULL;
  int d_C_nnz;


  mul(n, d_A_val, d_A_colind, d_A_rowptr, d_A_nnz,
         d_B_val, d_B_colind, d_B_rowptr, d_B_nnz,
         &d_C_val, &d_C_colind, &d_C_rowptr, &d_C_nnz);

  printf("%d\n", d_C_nnz);

  return 0;
}

Using as input two 600,000 x 600,000 matrices A and B with respectively 600,000 and 40,000,000 entries, the second cusparseSpGEMM_compute issues an "insufficient resources" error. According to the official document, this is caused by the insufficient buffer size. But how can it be possible? The buffer size is determined by the first invocation of cusparseSpGEMM_compute and is an upper bound of the actual size needed, according to the document. Is it a bug?

I ran the test on a CentOS machine with CUDA-11.3. The GPU is Tesla V100. The compile command is
nvcc -std=c++17 -O3 mul.cu -o mul -lcusparse

The input file matrix.bin can be downloaded from here.

Compile error in ubuntu machine

Unable to make.
My machine info:
OS: Ubuntu 18.04 LTS
CPU: i7-5930K
GPU: RTX 2080 Ti, driver version: 455.23.05, CUDA version: 11.0 (also have tried 10.1, 10.0 on same machine and all failed)
Screen snapshot provided as below:

and error messages after make cammand:

Thanks!

Using int8 TC in cublasLtMatmul

Hi,
I m trying to use IMMA kernel with cublasLtMatmul API, with Atype = Btype = Ctype = Dtype = CUDA_R_8I, and alpha =float, beta=float. Here is my quesion about this API:

Is this config correct? I think Dtype should be the same with Ctype.
Since Dtype is int8, so there is the great chance that the result element of A*B can be greater than 128 or less than -127. and this API just return result as 128 or -127. It is designed be like this, right?
What should I do to avoid the overflow of the result in this case with Atype = Btype = CUDA_R_8I and alpha, beta = float? do you have any suggestion?
Thank you！

nvidia / cudalibrarysamples Goto Github PK

cudalibrarysamples's Introduction

CUDA Library Samples

About

Library Examples

Copyright

cudalibrarysamples's People

Contributors

Stargazers

Watchers

Forkers

cudalibrarysamples's Issues

Code

Outputs

Machine environments

Throwing NVJPEG failure: '5'

Self PC

Throwing NVJPEG failure: '6'

Super computer -- Taiwania-2

And here goes my questions

Recommend Projects

Recommend Topics

Recommend Org