Git Product home page Git Product logo

cudatutorial's Introduction

CUDATutorial

A CUDA tutorial to make people learn CUDA program from 0

test enviroment

Turing T4 GPU

compile command

  1. compile by hand

nvcc xxx.cu -o xxx

if that does not work, pls try:

nvcc xxx.cu --gpu-architecture=compute_yy -o xxx

xxx is file name, yy is GPU compute capability, ep.A100's compute capability is 86.

  1. one-click compile and run

please ensure:

1.cmake version >= 3.8

2.you have CUDA TOOLKIT installed in system root directory, downloaded link is https://developer.nvidia.com/cuda-downloads.

 mkdir build 
 cd build 
 cmake .. && make -j8 
 cd bin 
 ./xxx

remark

  • related performance data is attached at the top of code file.
  • the performance data is diverse and diverse on different GPU platforms and NVCC compiler, so some counter-intuitive result is normal, we should only explore and debug the result.
  • welcome all comments and pull requests.

update notes

v2.0

  • add cuda stream
  • add quantize

v2.1

  • add fp32/fp16 gemv(vec * mat,mat is col major)

v2.2

  • add fp32/fp16 gemv(vec * mat,mat is row major)
  • add some code explaination(WIP)

v2.6

  • add fp32 dropout

cudatutorial's People

Contributors

russwong avatar wenxin-zhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

cudatutorial's Issues

8_copy_if.cu has an unspecified variable

In 8_copy_if.cu, the variable d in line 44 is not assigned a value, but the variable is used directly in lines 60 and 65. I think it should be changed to d = src[i].

9_gelu.cu code wrong

https://github.com/RussWong/CUDATutorial/blob/main/9_gelu.cu#L67

acording to the gelu formula, there should be only two __hmul12 in the outermost layer
and after commenting the original code, the compilation will also fail

i think here is one version of correct codes
//const half2 y2 = __hmul2(__hmul2(__float2half2_rn(0.5F), x2),
// __hadd2(__float2half2_rn(1.0F), __float22half2_rn(tanh_out)));

a little question about reduce in softmax

我尝试实现了一个block-level的softmax:
输入shape(1000*1024),Grid(1000);Block(1024);
想让一个block处理一行数据,kernel实现如下(相关疑问见**部分):

__global__ void BlockSoftmax(const float* src, float* dst, int rows, int cols) {
    __shared__ float shared_max[32];
    __shared__ float shared_sum[32];

    int tid = threadIdx.x + threadIdx.y * blockDim.x;
    int row = blockIdx.x;

    float thread_max = -INFINITY;
    float thread_sum = 0;

    for (int i = threadIdx.x + threadIdx.y * blockDim.x; i < cols; i += blockDim.x * blockDim.y) {
        float val = src[row * cols + i];
        thread_max = fmax(thread_max, val);
    }

    // Reduce max using __shfl_xor_sync
    for (int mask = warpSize / 2; mask > 0; mask /= 2) {
        thread_max = fmax(thread_max, __shfl_xor_sync(0xFFFFFFFF, thread_max, mask));
    }
   
    if (tid % warpSize == 0) {
        shared_max[tid / warpSize] = thread_max;
    }
    __syncthreads();
**//经过以下代码后,在我的理解是只有前32线程的寄存器里的thread_max 是正确的**
    if (tid < warpSize) {
        thread_max = shared_max[tid];
    }
    __syncthreads();
**//经过以下代码后,在我的理解是只有第一个线程的寄存器里的thread_max 是正确的**
    for (int mask = warpSize / 2; mask > 0; mask /= 2) {
        thread_max = fmax(thread_max, __shfl_xor_sync(0xFFFFFFFF, thread_max, mask));
    }
**//所以我使用一下注释代码广播第一个线程的thread_max 给block里的其他线程**
**//但是没有下面的代码测试也是成功的,我不明白为什么,同理thread_sum也是一样的现象** 
    // if(tid==0){
    //     shared_max[0] = thread_max;
    // }
    // __syncthreads();
    // float max_val = shared_max[0];

    for (int i = threadIdx.x + threadIdx.y * blockDim.x; i < cols; i += blockDim.x * blockDim.y) {
        float val = src[row * cols + i];
        float exp_val = exp(val - thread_max);
        dst[row * cols + i] = exp_val;
        thread_sum += exp_val;
    }

    // Reduce sum using __shfl_xor_sync
    for (int mask = warpSize / 2; mask > 0; mask /= 2) {
        thread_sum += __shfl_xor_sync(0xFFFFFFFF, thread_sum, mask);
    }

    if (tid % warpSize == 0) {
        shared_sum[tid / warpSize] = thread_sum;
    }
    __syncthreads();

    if (tid < warpSize) {
        thread_sum = shared_sum[tid];
    }
    __syncthreads();

    for (int mask = warpSize / 2; mask > 0; mask /= 2) {
        thread_sum += __shfl_xor_sync(0xFFFFFFFF, thread_sum, mask);
    }

    //float sum_val = thread_sum;
    // if(tid==0){
    //     shared_sum[0] = thread_sum;
    // }
    // __syncthreads();
    // float sum_val = shared_sum[0];

    for (int i = threadIdx.x + threadIdx.y * blockDim.x; i < cols; i += blockDim.x * blockDim.y) {
        dst[row * cols + i] /= thread_sum;
    }
}

10_fused_bias_mask_scale_and_add

10_fused_bias_mask_scale_and_add/10_fused_bias_mask_scale_and_add_fp32.cu

  • bias可能没有初始化

  • 63行声明的functor使用的指针是指向CPU的,不是GPU的

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.