Git Product home page Git Product logo

how-to-optim-algorithm-in-cuda's Introduction

how-to-optim-algorithm-in-cuda

本工程记录如何基于 cuda 优化一些常见的算法。请注意,下面的介绍都分别对应了子目录的代码实现,所以想复现性能的话请查看对应子目录下面的 README 。

0. how-to-compile-pytorch-from-source

记录如何手动编译 PyTorch 源码,学习 PyTorch 的一些 cuda 实现。

1. reduce

这里记录学习 NIVDIA 的reduce优化官方博客 做的笔记。完整实验代码见这里 , 原理讲解请看:【BBuf的CUDA笔记】三,reduce优化入门学习笔记

性能和带宽的测试情况如下 (A100 PCIE 40G):

优化手段 耗时(us) 带宽利用率 加速比
reduce_baseline 990.66us 39.57% ~
reduce_v1_interleaved_addressing 479.58us 81.74% 2.06
reduce_v2_bank_conflict_free 462.02us 84.81% 2.144
reduce_v3_idle_threads_free 244.16us 83.16% 4.057
reduce_v4_unroll_last_warp 167.10us 54.10% 5.928
reduce_v5_completely_unroll 158.78us 56.94% 6.239
reduce_v6_multi_add 105.47us 85.75% 9.392

2. elementwise

将 oneflow 的 elementwise 模板抽出来方便大家使用,这个 elementwise 模板实现了高效的性能和带宽利用率,并且用法非常灵活。完整实验代码见这里 ,原理讲解请看:【BBuf 的CUDA笔记】一,解析OneFlow Element-Wise 算子实现 。这里以逐点乘为例,性能和带宽的测试情况如下 (A100 PCIE 40G):

优化手段 数据类型 耗时(us) 带宽利用率
naive elementwise float 298.46us 85.88%
oneflow elementwise float 284us 89.42%
naive elementwise half 237.28us 52.55%
oneflow elementwise half 140.74us 87.31%

可以看到无论是性能还是带宽,使用 oneflow 的 elementwise 模板相比于原始实现都有较大提升。

3. FastAtomicAdd

实现的脚本是针对half数据类型做向量的内积,用到了atomicAdd,保证数据的长度以及gridsize和blocksize都是完全一致的。一共实现了3个脚本:

  1. https://github.com/BBuf/how-to-optim-algorithm-in-cuda/blob/master/FastAtomicAdd/atomic_add_half.cu 纯half类型的atomicAdd。
  2. https://github.com/BBuf/how-to-optim-algorithm-in-cuda/blob/master/FastAtomicAdd/atomic_add_half_pack2.cu half+pack,最终使用的是half2类型的atomicAdd。
  3. https://github.com/BBuf/how-to-optim-algorithm-in-cuda/blob/master/FastAtomicAdd/fast_atomic_add_half.cu 快速原子加,虽然没有显示的pack,但本质上也是通过对单个half补0使用上了half2的原子加。

性能和带宽的测试情况如下 (A100 PCIE 40G):

原子加方式 性能(us)
纯half类型 422.36ms
pack half2类型 137.02ms
fastAtomicAdd 137.01ms

可以看到使用pack half的方式和直接使用half的fastAtomicAdd方式得到的性能结果一致,均比原始的half的原子加快3-4倍。

4. UpsampleNearest2D

upsample_nearest_2d.cu 展示了 oneflow 对 upsample_nearest2d 的前后向的优化 kernel 的用法,性能和带宽的测试情况如下 (A100 PCIE 40G):

框架 数据类型 Op类型 带宽利用率 耗时
PyTorch Float32 UpsampleNearest2D forward 28.30% 111.42us
PyTorch Float32 UpsampleNearest2D backward 60.16% 65.12us
OneFlow Float32 UpsampleNearest2D forward 52.18% 61.44us
OneFlow Float32 UpsampleNearest2D backward 77.66% 50.56us
PyTorch Float16 UpsampleNearest2D forward 16.99% 100.38us
PyTorch Float16 UpsampleNearest2D backward 31.56% 57.38us
OneFlow Float16 UpsampleNearest2D forward 43.26% 35.36us
OneFlow Float16 UpsampleNearest2D backward 44.82% 40.26us

可以看到基于 oneflow upsample_nearest2d 的前后向的优化 kernel 可以获得更好的带宽利用率和性能。注意这里的 profile 使用的是 oneflow 脚本,而不是 upsample_nearest_2d.cu ,详情请看 UpsampleNearest2D/README.md

5. indexing

在 PyTorch 中对 index_add 做了极致的优化,我这里将 PyTorch 的 index_add 实现 进行了剥离,方便大家应用于其它框架。具体请看 indexing 文件夹的 README 。其中还有和 oneflow 的 index_add 实现的各个 case 的性能比较结果。整体来说 PyTorch 在 index Tensor元素很小,但Tensor很大的情况下有较大的性能提升,其它情况和 OneFlow 基本持平。详情请看 indexing/README.md

6. 学习笔记相关博客

how-to-optim-algorithm-in-cuda's People

Contributors

bbuf avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.