A CUDA tutorial to make people learn CUDA program from 0
Turing T4 GPU
- related performance data is attached at the top of code file.
- the performance data is diverse and diverse on different GPU platforms and NVCC compiler, so some counter-intuitive result is normal, we should only explore and debug the result.
- welcome all comments and pull requests.
- add cuda stream
- add quantize
- add fp32/fp16 gemv