russwong / cudatutorial Goto Github PK

View Code? Open in Web Editor NEW

175.0 2.0 43.0 163 KB

A CUDA tutorial to make people learn CUDA program from 0

Cuda 96.36% CMake 3.23% Python 0.42%

cudatutorial's Introduction

CUDATutorial

A CUDA tutorial to make people learn CUDA program from 0

test enviroment

Turing T4 GPU

compile command

compile by hand

nvcc xxx.cu -o xxx

if that does not work, pls try:

nvcc xxx.cu --gpu-architecture=compute_yy -o xxx

xxx is file name, yy is GPU compute capability, ep.A100's compute capability is 86.

one-click compile and run

please ensure:

1.cmake version >= 3.8

2.you have CUDA TOOLKIT installed in system root directory, downloaded link is https://developer.nvidia.com/cuda-downloads.

 mkdir build 
 cd build 
 cmake .. && make -j8 
 cd bin 
 ./xxx

remark

related performance data is attached at the top of code file.
the performance data is diverse and diverse on different GPU platforms and NVCC compiler, so some counter-intuitive result is normal, we should only explore and debug the result.
welcome all comments and pull requests.

update notes

v2.0

add cuda stream
add quantize

v2.1

add fp32/fp16 gemv(vec * mat,mat is col major)

v2.2

add fp32/fp16 gemv(vec * mat,mat is row major)
add some code explaination(WIP)

v2.6

add fp32 dropout

cudatutorial's People

Contributors

Stargazers

Watchers

cudatutorial's Issues

8_copy_if.cu has an unspecified variable

In 8_copy_if.cu, the variable d in line 44 is not assigned a value, but the variable is used directly in lines 60 and 65. I think it should be changed to d = src[i].

9_gelu.cu code wrong

https://github.com/RussWong/CUDATutorial/blob/main/9_gelu.cu#L67

acording to the gelu formula, there should be only two __hmul12 in the outermost layer
and after commenting the original code, the compilation will also fail

i think here is one version of correct codes
//const half2 y2 = __hmul2(__hmul2(__float2half2_rn(0.5F), x2),
// __hadd2(__float2half2_rn(1.0F), __float22half2_rn(tanh_out)));

a little question about reduce in softmax

我尝试实现了一个block-level的softmax：
输入shape(1000*1024),Grid(1000);Block(1024);
想让一个block处理一行数据，kernel实现如下（相关疑问见**部分）：

__global__ void BlockSoftmax(const float* src, float* dst, int rows, int cols) {
    __shared__ float shared_max[32];
    __shared__ float shared_sum[32];

    int tid = threadIdx.x + threadIdx.y * blockDim.x;
    int row = blockIdx.x;

    float thread_max = -INFINITY;
    float thread_sum = 0;

    for (int i = threadIdx.x + threadIdx.y * blockDim.x; i < cols; i += blockDim.x * blockDim.y) {
        float val = src[row * cols + i];
        thread_max = fmax(thread_max, val);
    }

    // Reduce max using __shfl_xor_sync
    for (int mask = warpSize / 2; mask > 0; mask /= 2) {
        thread_max = fmax(thread_max, __shfl_xor_sync(0xFFFFFFFF, thread_max, mask));
    }
   
    if (tid % warpSize == 0) {
        shared_max[tid / warpSize] = thread_max;
    }
    __syncthreads();
**//经过以下代码后，在我的理解是只有前32线程的寄存器里的thread_max 是正确的**
    if (tid < warpSize) {
        thread_max = shared_max[tid];
    }
    __syncthreads();
**//经过以下代码后，在我的理解是只有第一个线程的寄存器里的thread_max 是正确的**
    for (int mask = warpSize / 2; mask > 0; mask /= 2) {
        thread_max = fmax(thread_max, __shfl_xor_sync(0xFFFFFFFF, thread_max, mask));
    }
**//所以我使用一下注释代码广播第一个线程的thread_max 给block里的其他线程**
**//但是没有下面的代码测试也是成功的，我不明白为什么，同理thread_sum也是一样的现象** 
    // if(tid==0){
    //     shared_max[0] = thread_max;
    // }
    // __syncthreads();
    // float max_val = shared_max[0];

    for (int i = threadIdx.x + threadIdx.y * blockDim.x; i < cols; i += blockDim.x * blockDim.y) {
        float val = src[row * cols + i];
        float exp_val = exp(val - thread_max);
        dst[row * cols + i] = exp_val;
        thread_sum += exp_val;
    }

    // Reduce sum using __shfl_xor_sync
    for (int mask = warpSize / 2; mask > 0; mask /= 2) {
        thread_sum += __shfl_xor_sync(0xFFFFFFFF, thread_sum, mask);
    }

    if (tid % warpSize == 0) {
        shared_sum[tid / warpSize] = thread_sum;
    }
    __syncthreads();

    if (tid < warpSize) {
        thread_sum = shared_sum[tid];
    }
    __syncthreads();

    for (int mask = warpSize / 2; mask > 0; mask /= 2) {
        thread_sum += __shfl_xor_sync(0xFFFFFFFF, thread_sum, mask);
    }

    //float sum_val = thread_sum;
    // if(tid==0){
    //     shared_sum[0] = thread_sum;
    // }
    // __syncthreads();
    // float sum_val = shared_sum[0];

    for (int i = threadIdx.x + threadIdx.y * blockDim.x; i < cols; i += blockDim.x * blockDim.y) {
        dst[row * cols + i] /= thread_sum;
    }
}

10_fused_bias_mask_scale_and_add

10_fused_bias_mask_scale_and_add/10_fused_bias_mask_scale_and_add_fp32.cu

bias可能没有初始化
63行声明的functor使用的指针是指向CPU的，不是GPU的