- pytorch
- cupy
You will also need a NVidia GPU to run the code.
Implement a JIT compiler using Python decorator!
Implement a simple matrix exp
function in CUDA!
Make the exp
kernel more efficient by using more parallelism! Now the performance already matches cuBLAS.
Simplify the kernel code by using 2D partitioning. The pitfall is partitioning the rows to x dim.
First taste of fusion by creating a fused exp-div kernel!