srush / gpu-puzzles Goto Github PK

View Code? Open in Web Editor NEW

5.1K 27.0 275.0 603 KB

Solve puzzles. Learn CUDA.

License: MIT License

Python 1.19% Jupyter Notebook 98.81%

cuda machine-learning puzzles

gpu-puzzles's Introduction

GPU Puzzles

by Sasha Rush - srush_nlp

GPU architectures are critical to machine learning, and seem to be becoming even more important every day. However, you can be an expert in machine learning without ever touching GPU code. It is hard to gain intuition working through abstractions.

This notebook is an attempt to teach beginner GPU programming in a completely interactive fashion. Instead of providing text with concepts, it throws you right into coding and building GPU kernels. The exercises use NUMBA which directly maps Python code to CUDA kernels. It looks like Python but is basically identical to writing low-level CUDA code. In a few hours, I think you can go from basics to understanding the real algorithms that power 99% of deep learning today. If you do want to read the manual, it is here:

NUMBA CUDA Guide

I recommend doing these in Colab, as it is easy to get started. Be sure to make your own copy, turn on GPU mode in the settings (Runtime / Change runtime type, then set Hardware accelerator to GPU), and then get to coding.

(If you are into this style of puzzle, also check out my Tensor Puzzles for PyTorch.)

Walkthrough Guide

!pip install -qqq git+https://github.com/danoneata/chalk@srush-patch-1
!wget -q https://github.com/srush/GPU-Puzzles/raw/main/robot.png https://github.com/srush/GPU-Puzzles/raw/main/lib.py

import numba
import numpy as np
import warnings
from lib import CudaProblem, Coord

warnings.filterwarnings(
    action="ignore", category=numba.NumbaPerformanceWarning, module="numba"
)

Puzzle 1: Map

Implement a "kernel" (GPU function) that adds 10 to each position of vector a and stores it in vector out. You have 1 thread per position.

Warning This code looks like Python but it is really CUDA! You cannot use standard python tools like list comprehensions or ask for Numpy properties like shape or size (if you need the size, it is given as an argument). The puzzles only require doing simple operations, basically +, *, simple array indexing, for loops, and if statements. You are allowed to use local variables. If you get an error it is probably because you did something fancy :).

Tip: Think of the function call as being run 1 time for each thread. The only difference is that cuda.threadIdx.x changes each time.

def map_spec(a):
    return a + 10


def map_test(cuda):
    def call(out, a) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 1 lines)

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Map", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec
)
problem.show()

# Map
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0.]
Spec : [10 11 12 13]

Puzzle 2 - Zip

Implement a kernel that adds together each position of a and b and stores it in out. You have 1 thread per position.

def zip_spec(a, b):
    return a + b


def zip_test(cuda):
    def call(out, a, b) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 1 lines)

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem(
    "Zip", zip_test, [a, b], out, threadsperblock=Coord(SIZE, 1), spec=zip_spec
)
problem.show()

# Zip
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0.]
Spec : [0 2 4 6]

Puzzle 3 - Guards

Implement a kernel that adds 10 to each position of a and stores it in out. You have more threads than positions.

def map_guard_test(cuda):
    def call(out, a, size) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Guard",
    map_guard_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(8, 1),
    spec=map_spec,
)
problem.show()

# Guard
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0.]
Spec : [10 11 12 13]

Puzzle 4 - Map 2D

Implement a kernel that adds 10 to each position of a and stores it in out. Input a is 2D and square. You have more threads than positions.

def map_2D_test(cuda):
    def call(out, a, size) -> None:
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
problem = CudaProblem(
    "Map 2D", map_2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), spec=map_spec
)
problem.show()

# Map 2D
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [[0. 0.]
 [0. 0.]]
Spec : [[10 11]
 [12 13]]

Puzzle 5 - Broadcast

Implement a kernel that adds a and b and stores it in out. Inputs a and b are vectors. You have more threads than positions.

def broadcast_test(cuda):
    def call(out, a, b, size) -> None:
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE).reshape(SIZE, 1)
b = np.arange(SIZE).reshape(1, SIZE)
problem = CudaProblem(
    "Broadcast",
    broadcast_test,
    [a, b],
    out,
    [SIZE],
    threadsperblock=Coord(3, 3),
    spec=zip_spec,
)
problem.show()

# Broadcast
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [[0. 0.]
 [0. 0.]]
Spec : [[0 1]
 [1 2]]

Puzzle 6 - Blocks

Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a.

Tip: A block is a group of threads. The number of threads per block is limited, but we can have many different blocks. Variable cuda.blockIdx tells us what block we are in.

def map_block_test(cuda):
    def call(out, a, size) -> None:
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 9
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Blocks",
    map_block_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(4, 1),
    blockspergrid=Coord(3, 1),
    spec=map_spec,
)
problem.show()

# Blocks
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0.]
Spec : [10 11 12 13 14 15 16 17 18]

Puzzle 7 - Blocks 2D

Implement the same kernel in 2D. You have fewer threads per block than the size of a in both directions.

def map_block2D_test(cuda):
    def call(out, a, size) -> None:
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        # FILL ME IN (roughly 4 lines)

    return call


SIZE = 5
out = np.zeros((SIZE, SIZE))
a = np.ones((SIZE, SIZE))

problem = CudaProblem(
    "Blocks 2D",
    map_block2D_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(3, 3),
    blockspergrid=Coord(2, 2),
    spec=map_spec,
)
problem.show()

# Blocks 2D
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
Spec : [[11. 11. 11. 11. 11.]
 [11. 11. 11. 11. 11.]
 [11. 11. 11. 11. 11.]
 [11. 11. 11. 11. 11.]
 [11. 11. 11. 11. 11.]]

Puzzle 8 - Shared

Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a.

Warning: Each block can only have a constant amount of shared memory that threads in that block can read and write to. This needs to be a literal python constant not a variable. After writing to shared memory you need to call cuda.syncthreads to ensure that threads do not cross.

(This example does not really need shared memory or syncthreads, but it is a demo.)

TPB = 4
def shared_test(cuda):
    def call(out, a, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x

        if i < size:
            shared[local_i] = a[i]
            cuda.syncthreads()

        # FILL ME IN (roughly 2 lines)

    return call


SIZE = 8
out = np.zeros(SIZE)
a = np.ones(SIZE)
problem = CudaProblem(
    "Shared",
    shared_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(TPB, 1),
    blockspergrid=Coord(2, 1),
    spec=map_spec,
)
problem.show()

# Shared
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             0 |             0 |             1 |

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0. 0. 0.]
Spec : [11. 11. 11. 11. 11. 11. 11. 11.]

Puzzle 9 - Pooling

Implement a kernel that sums together the last 3 position of a and stores it in out. You have 1 thread per position. You only need 1 global read and 1 global write per thread.

Tip: Remember to be careful about syncing.

def pool_spec(a):
    out = np.zeros(*a.shape)
    for i in range(a.shape[0]):
        out[i] = a[max(i - 2, 0) : i + 1].sum()
    return out


TPB = 8
def pool_test(cuda):
    def call(out, a, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 8 lines)

    return call


SIZE = 8
out = np.zeros(SIZE)
a = np.arange(SIZE)
problem = CudaProblem(
    "Pooling",
    pool_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(TPB, 1),
    blockspergrid=Coord(1, 1),
    spec=pool_spec,
)
problem.show()

# Pooling
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0. 0. 0.]
Spec : [ 0.  1.  3.  6.  9. 12. 15. 18.]

Puzzle 10 - Dot Product

Implement a kernel that computes the dot-product of a and b and stores it in out. You have 1 thread per position. You only need 2 global reads and 1 global write per thread.

Note: For this problem you don't need to worry about number of shared reads. We will handle that challenge later.

def dot_spec(a, b):
    return a @ b

TPB = 8
def dot_test(cuda):
    def call(out, a, b, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)

        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 9 lines)
    return call


SIZE = 8
out = np.zeros(1)
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem(
    "Dot",
    dot_test,
    [a, b],
    out,
    [SIZE],
    threadsperblock=Coord(SIZE, 1),
    blockspergrid=Coord(1, 1),
    spec=dot_spec,
)
problem.show()

# Dot
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0.]
Spec : 140

Puzzle 11 - 1D Convolution

Implement a kernel that computes a 1D convolution between a and b and stores it in out. You need to handle the general case. You only need 2 global reads and 1 global write per thread.

def conv_spec(a, b):
    out = np.zeros(*a.shape)
    len = b.shape[0]
    for i in range(a.shape[0]):
        out[i] = sum([a[i + j] * b[j] for j in range(len) if i + j < a.shape[0]])
    return out


MAX_CONV = 4
TPB = 8
TPB_MAX_CONV = TPB + MAX_CONV
def conv_test(cuda):
    def call(out, a, b, a_size, b_size) -> None:
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x

        # FILL ME IN (roughly 17 lines)

    return call


# Test 1

SIZE = 6
CONV = 3
out = np.zeros(SIZE)
a = np.arange(SIZE)
b = np.arange(CONV)
problem = CudaProblem(
    "1D Conv (Simple)",
    conv_test,
    [a, b],
    out,
    [SIZE, CONV],
    Coord(1, 1),
    Coord(TPB, 1),
    spec=conv_spec,
)
problem.show()

# 1D Conv (Simple)
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0.]
Spec : [ 5.  8. 11. 14.  5.  0.]

Test 2

out = np.zeros(15)
a = np.arange(15)
b = np.arange(4)
problem = CudaProblem(
    "1D Conv (Full)",
    conv_test,
    [a, b],
    out,
    [15, 4],
    Coord(2, 1),
    Coord(TPB, 1),
    spec=conv_spec,
)
problem.show()

# 1D Conv (Full)
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Spec : [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]

Puzzle 12 - Prefix Sum

Implement a kernel that computes a sum over a and stores it in out. If the size of a is greater than the block size, only store the sum of each block.

We will do this using the parallel prefix sum algorithm in shared memory. That is, each step of the algorithm should sum together half the remaining numbers. Follow this diagram:

TPB = 8
def sum_spec(a):
    out = np.zeros((a.shape[0] + TPB - 1) // TPB)
    for j, i in enumerate(range(0, a.shape[-1], TPB)):
        out[j] = a[i : i + TPB].sum()
    return out


def sum_test(cuda):
    def call(out, a, size: int) -> None:
        cache = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 12 lines)

    return call


# Test 1

SIZE = 8
out = np.zeros(1)
inp = np.arange(SIZE)
problem = CudaProblem(
    "Sum (Simple)",
    sum_test,
    [inp],
    out,
    [SIZE],
    Coord(1, 1),
    Coord(TPB, 1),
    spec=sum_spec,
)
problem.show()

# Sum (Simple)
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0.]
Spec : [28.]

Test 2

SIZE = 15
out = np.zeros(2)
inp = np.arange(SIZE)
problem = CudaProblem(
    "Sum (Full)",
    sum_test,
    [inp],
    out,
    [SIZE],
    Coord(2, 1),
    Coord(TPB, 1),
    spec=sum_spec,
)
problem.show()

# Sum (Full)
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [0. 0.]
Spec : [28. 77.]

Puzzle 13 - Axis Sum

Implement a kernel that computes a sum over each column of a and stores it in out.

TPB = 8
def sum_spec(a):
    out = np.zeros((a.shape[0], (a.shape[1] + TPB - 1) // TPB))
    for j, i in enumerate(range(0, a.shape[-1], TPB)):
        out[..., j] = a[..., i : i + TPB].sum(-1)
    return out


def axis_sum_test(cuda):
    def call(out, a, size: int) -> None:
        cache = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        batch = cuda.blockIdx.y
        # FILL ME IN (roughly 12 lines)

    return call


BATCH = 4
SIZE = 6
out = np.zeros((BATCH, 1))
inp = np.arange(BATCH * SIZE).reshape((BATCH, SIZE))
problem = CudaProblem(
    "Axis Sum",
    axis_sum_test,
    [inp],
    out,
    [SIZE],
    Coord(1, BATCH),
    Coord(TPB, 1),
    spec=sum_spec,
)
problem.show()

# Axis Sum
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [[0.]
 [0.]
 [0.]
 [0.]]
Spec : [[ 15.]
 [ 51.]
 [ 87.]
 [123.]]

Puzzle 14 - Matrix Multiply!

Implement a kernel that multiplies square matrices a and b and stores the result in out.

Tip: The most efficient algorithm here will copy a block into shared memory before computing each of the individual row-column dot products. This is easy to do if the matrix fits in shared memory. Do that case first. Then update your code to compute a partial dot-product and iteratively move the part you copied into shared memory. You should be able to do the hard case in 6 global reads.

def matmul_spec(a, b):
    return a @ b


TPB = 3
def mm_oneblock_test(cuda):
    def call(out, a, b, size: int) -> None:
        a_shared = cuda.shared.array((TPB, TPB), numba.float32)
        b_shared = cuda.shared.array((TPB, TPB), numba.float32)

        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 14 lines)

    return call

# Test 1

SIZE = 2
out = np.zeros((SIZE, SIZE))
inp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
inp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T

problem = CudaProblem(
    "Matmul (Simple)",
    mm_oneblock_test,
    [inp1, inp2],
    out,
    [SIZE],
    Coord(1, 1),
    Coord(TPB, TPB),
    spec=matmul_spec,
)
problem.show(sparse=True)

# Matmul (Simple)
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [[0. 0.]
 [0. 0.]]
Spec : [[ 1  3]
 [ 3 13]]

Test 2

SIZE = 8
out = np.zeros((SIZE, SIZE))
inp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
inp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T

problem = CudaProblem(
    "Matmul (Full)",
    mm_oneblock_test,
    [inp1, inp2],
    out,
    [SIZE],
    Coord(3, 3),
    Coord(TPB, TPB),
    spec=matmul_spec,
)
problem.show(sparse=True)

# Matmul (Full)
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             0 |             0 |             0 |             0 |

problem.check()

Failed Tests.
Yours: [[0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]
Spec : [[  140   364   588   812  1036  1260  1484  1708]
 [  364  1100  1836  2572  3308  4044  4780  5516]
 [  588  1836  3084  4332  5580  6828  8076  9324]
 [  812  2572  4332  6092  7852  9612 11372 13132]
 [ 1036  3308  5580  7852 10124 12396 14668 16940]
 [ 1260  4044  6828  9612 12396 15180 17964 20748]
 [ 1484  4780  8076 11372 14668 17964 21260 24556]
 [ 1708  5516  9324 13132 16940 20748 24556 28364]]

gpu-puzzles's People

Contributors

Stargazers

Watchers

Forkers

jordiclive ssghost austinkeller spartag117 jingmingcn karanamlokesh unshorn-forks tanelpoder nikinnn heruix techthiyanes dmarx richardrl nining carlosdeabreu13 sambroy jcupitt arnaudhillen hanrwth coreqode yunda-si mbrukman fakerpawnno1 sukanya-rs buxxux nitin-mane nanderoo freshi andreichenchik brentspell epopt kylejmorris sidtalia santakd akavasis simrit1 balvisio gakkiri cyber-machine shenyih0ng fawazghali cnwarden ashfaqsyedn akankushjnvku r8bhavneet goswamig sautee alejandrosuarez saikat248 brandondk97 dennis-lynch mu-l ssground chensjtu lyrl nomiscientist snikhilvashista tchigher asdlei99 xiaojuntan zhangch9 anarchistmegabyte gdmacmillan idangrady freddy1020 eryaw kalpa-ranjan kauterry stalkermustang dencher92 dleodbs mbilalai richardhgl inarikami devriesmatt cho-geonwoo dangom mhanco auaan siambhuiyan tonyle9 tontonis josetorrs clefourrier itachi1232gg deepsworld gowriaddepalli zivzone thiagolira jamesthesnake kenyony codedev99 shantanunair mountagha huiminlim nlp4whp icodein amseej dogeplusplus thahmina

gpu-puzzles's Issues

Want more!!!

Hi srush, I am so grateful for this art for learning cuda!

Could we get more (especially related to some real-life cases)? ❤️❤️❤️

Issues with Puzzle 12

Hey
Thanks for the great set of puzzles
I was stuck on Puzzle 12 for a while and decided to look up the solution in your explanation video. I even tried copying your code verbatim but it seems to fail for some reason. I don't get why this happens. This is the code I copied -

def sum_test(cuda):
    def call(out, a, size: int) -> None:
        cache = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        if i < size:
            cache[local_i] = a[i]
            cuda.syncthreads()
            for k in range(3):
                p = 2**k
                if local_i % (p*2) == 0:
                    cache[local_i] =cache[local_i]+ cache[local_i+p]
                cuda.syncthreads()
            if local_i == 0:
                out[cuda.blockIdx.x] = cache[local_i]
    return call

The diagram looks fine to me -

However the test case fails -

Failed Tests.
Yours: [28. 88.]
Spec : [28. 77.]

Any clues what might be going wrong?

AttributeError: 'ScalarHistory' object has no attribute 'location'

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-52-025a9cd45d7a>](https://localhost:8080/#) in <module>()
     12     spec=conv_spec,
     13 )
---> 14 problem.show()

1 frames
[/content/lib.py](https://localhost:8080/#) in show(self, sparse)
    389     def show(self, sparse=False):
    390         results = self.run_python()
--> 391         self.score(results)
    392         return draw_results(results, self.name,
    393                             self.threadsperblock.x, self.threadsperblock.y, sparse)

[/content/lib.py](https://localhost:8080/#) in score(self, results)
    373                         count["shared_writes"] += 1
    374                     for ins in inc[1].inputs:
--> 375                         if ins.location[0].startswith("S"):
    376                             count["shared_reads"] += 1
    377                         else:

AttributeError: 'ScalarHistory' object has no attribute 'location'

problem.check() works (passes tests)

observed for problems 10 and 11

later edit:
it's probably catching a real bug because i have fewer lines than your estimate, but i couldn't figure it out, adding my solution here in case it maybe helps

================================================================

# 10) dot product
# it passes tests but plot doesn't work and doesn't have 9 lines: i don't think it's right 
def call(out, a, b, size) -> None:
    a_shared = cuda.shared.array(TPB, numba.float32)
    b_shared = cuda.shared.array(TPB, numba.float32)

    i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
    local_i = cuda.threadIdx.x
    # FILL ME IN (roughly 9 lines)
    if i < size:
        a_shared[local_i] = a[i]
        b_shared[local_i] = b[i]
    cuda.syncthreads()
    if i == 0:
        for j in range(size):
            out[0] += a_shared[j] * b_shared[j]

answers

Hi! Thanks for the puzzles! There's a Triton reading group in eleuther ai so each of us went through the GPU Puzzles. For this, will you be fine with say a repo of the answers to these puzzles? Happy to send over the notebook via dm!

Puzzle 13 - Incorrect Problem Statement

The current statement for problem 13 reads:
"Implement a kernel that computes a sum over each row of a and stores it in out."

It should read:
"Implement a kernel that computes a sum over each column of a and stores it in out."

This is evident by the fact that out is of length 4, and only solutions which sum the columns are said to be correct.

Any tips on using 1 global read for dot product (puzzle 10)?

The question stated that

You only need 1 global read and 1 global write per thread.

However, loading both a[i] and b[i] into the shared memory costs 2 global reads for each thread.
May I know if there are any hints on satisfying the puzzle's requirement?

How does the solution for 1D-conv work?

For 1D, conv there's the solution's accumulation and my accumulation:

            # official solution
            # for k in range(b_size):
            #     if i + k < a_size:
            #         acc += sh_a[local_i + k] * sh_b[k]

            # mine
            for k in range(0, min(a_size - i, b_size)):
              acc += (sh_a[local_i + k] * sh_b[k])

However, my accumulation loop fails on the last 3 threads in the first block for the 2nd test case.
As far as I can tell, the 2 loops should be functionally identical. Any thoughts on what might be going on?

Yours: [14. 20. 26. 32. 38. 20.  7.  0. 62. 68. 74. 80. 41. 14.  0.]
Spec : [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]

NameError: name 'unit_y' is not defined

After run: problem.show() in my local environment

NameError                                 Traceback (most recent call last)
/tmp/ipykernel_1818177/1878365812.py in <module>
      5     "Map", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec
      6 )
----> 7 problem.show()

~/code/mycode/GPU-Puzzles/lib.py in show(self, sparse)
    389         self.score(results)
    390         return draw_results(results, self.name,
--> 391                             self.threadsperblock.x, self.threadsperblock.y, sparse)
    392 
    393     def check(self):

~/code/mycode/GPU-Puzzles/lib.py in draw_results(results, name, tpbx, tpby, sparse)
    256     blocks = []
    257     locations = []
--> 258     base = draw_base(*results[Coord(0, 0)][Coord(0, 0)])
    259     for block, inner in results.items():
    260         dia = base

~/code/mycode/GPU-Puzzles/lib.py in draw_base(_, a, c, out)
    227 
    228 def draw_base(_, a, c, out):
--> 229     inputs = vcat([draw_table(d) for d in a], 2.0).center_xy()
    230     shared_tables = [[draw_table(c2.refs[i]) for i in range(1, c.rounds())] for c2 in c.caches]
    231     shareds = grid(shared_tables, 1.0).center_xy()

~/code/mycode/GPU-Puzzles/lib.py in <listcomp>(.0)
    227 
    228 def draw_base(_, a, c, out):
--> 229     inputs = vcat([draw_table(d) for d in a], 2.0).center_xy()
    230     shared_tables = [[draw_table(c2.refs[i]) for i in range(1, c.rounds())] for c2 in c.caches]
    231     shareds = grid(shared_tables, 1.0).center_xy()

~/code/mycode/GPU-Puzzles/lib.py in draw_table(tab)
    211         tab = table(tab.name, *tab.size)
    212     tab = tab.line_width(0.05)
--> 213     return tab.beside((t + vstrut(0.5)), -unit_y)
    214 
    215 

NameError: name 'unit_y' is not defined

And problem.check() no error, it's nice.

[Puzzle 12, Test2] Question regarding shared memory

Hi,

For this puzzle, it seems that shared cuda arrays are lazily initialized. Since SIZE is 15 and TPB is 8, the last element of Block 1 is not overwritten, but the value got carried over from Block 0, even after calling cuda.syncthreads(). So is this a normal behavior or actually a leak?

My original implentation assumed that the shared array would all be intialized as 0, so it was safe to perform the sum over the entire block. Given the above findings, there are two options for those unaligned block: 1) fill the rest as 0 or 2) check the index bounds for sum ops ((i+step) < size). While both pass the test case, the latter gives a weird dependency graph for Block 1 as attached, which raises my concern.

1-d convolution question, accessing values out of block

I think 2 global reads is not enough. If a_size is larger than TPB, then values outside of what is assigned to the block must be read in order to calculate the convolution of the last indices of the block. the number of values to globally read are then in the worst case (TBP+MAX_COV+(MAX_COV-1)) which is more than twice TBP.

Puzzle 12 log_2 implementation?

I want log_2() function. Any tips on how I can implement this with numba? It looks like even the natural log doesn't run?

Eg. int(math.log(size)) returns:

LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
No definition for lowering <built-in function log>(int64,) -> float64

File "<ipython-input-235-5ed9de693fee>", line 9:
    def call(out, a, size: int) -> None:
        <source elided>

        num_steps = int(math.log(size))
        ^

During: lowering "$58call_method.25 = call $54load_method.23(size, func=$54load_method.23, args=[Var(size, <ipython-input-235-5ed9de693fee>:4)], kws=(), vararg=None)" at <ipython-input-235-5ed9de693fee> (9)

Update: I just hard coded the log_2 results for now.

SYCL version?

Thanks for doing this - any hints on how could do this using SYCL kernels instead of CUDA?

Puzzle 12 (test 2) and 13 puzzle.show() Display bug

My solution to 13 passes but throws this error with the show() function:

[/content/lib.py](https://localhost:8080/#) in myconnect(diagram, loc, color, con, name1, name2)
    186     bb2 = diagram.get_subdiagram_envelope(name2)
    187     assert bb1 is not None, name1
--> 188     assert bb2 is not None, name2
    189     off = P2(loc[0] - 0.5, loc[1] - 0.5) * 0.85
    190     dia = empty()

AssertionError: ('S0', 0)

Here is my solution, sorry if I shouldn't be posting it!

def sum_test(cuda):
    def call(out, a, size: int) -> None:
        cache = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 12 lines)

        # hard coding in log_2 results needed.
        if TPB <16:
            num_steps=3 
        elif TPB==16:
            num_steps =4
        
        print("number of steps", num_steps)

        if i<size:
            cache[local_i] = a[i]
            cuda.syncthreads()

            for j in range(num_steps):
                filter = 2**(j+1)
                jump = 2**j
                print("filter and jump", filter, jump)
                if i%filter==0:
                    print(local_i)
                    # can do the addition operation here
                    cache[local_i] = cache[local_i] + cache[local_i+jump]
                cuda.syncthreads()
        out[cuda.blockIdx.x] = cache[0]

    return call

Hint for Q11: 1D Convolution

Currently my solution which passes the first test within the required constraints (2 global reads and 1 global write)

MAX_CONV = 4
TPB = 8
TPB_MAX_CONV = TPB + MAX_CONV
def conv_test(cuda):
    def call(out, a, b, a_size, b_size) -> None:
        shared = cuda.shared.array(TPB_MAX_CONV, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x

        # FILL ME IN (roughly 17 lines)
        if i < a_size:
            shared[local_i] = a[i]
            if i < b_size:
                shared[TPB + local_i] = b[i]
            cuda.syncthreads()

            # compute conv
            s = 0
            for j in range(b_size):
                if local_i + j < a_size:
                    s += shared[local_i + j] * shared[j + TPB]
            out[i] = s

    return call

I am stuck with the second test where the blocks increase. Is there a way to do this with the current shared memory size (TPB_MAX_CONV)

Puzzle 11 issue creating shared array

My code does this:
a_s = cuda.shared.array(a_size, numba.float32)

And I get this error:

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function shared.array at 0x7f48d83f9050>) found for signature:
 
 >>> array(int64, class(float32))
 
There are 2 candidate implementations:
      - Of which 2 did not match due to:
      Overload of function 'array': File: numba/cuda/cudadecl.py: Line 44.
        With argument(s): '(int64, class(float32))':
       No match.

During: resolving callee type: Function(<function shared.array at 0x7f48d83f9050>)
During: typing of call at <ipython-input-150-4bdb1f01d70a> (11)


File "<ipython-input-150-4bdb1f01d70a>", line 11:
    def call(out, a, b, a_size, b_size) -> None:
        <source elided>
        #print(type(a_size), type(SIZE))
        a_s = cuda.shared.array(a_size, numba.float32)

If I replace a_size with the global variable SIZE then things run correctly but this fails to generalize to the later tests where a_size changes.

Why would puzzle 14 - Matrix Multiply need shared memory?

Here is my solution to puzzle 14

        a_shared = cuda.shared.array((TPB, TPB), numba.float32)
        b_shared = cuda.shared.array((TPB, TPB), numba.float32)

        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 14 lines)
        val = 0
        if i < size and j < size:
          for k in range(size):
                    val += a[i,k] * b[k,j]
          out[i, j] = val

I find that I can pass the two tests but I haven't use the shared memory. I feel that my solution is not correct but I don't know why the shared memory is needed here. Thank you for your answer!

Inaccurate test cases or problem statement for Puzzle 7, 8

The test cases say all values should be 11, but the problem statements say:

Puzzle 7 "Implement the same kernel in 2D. You have fewer threads per block than the size of a in both directions."
(I'm assuming this means the same as Puzzle 6, which is to add 10 to each position.)
Puzzle 8 "Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a."

BTW I'm having fun and learning, thanks for this :)

Question on `syncthreads` in Puzzle 14

Here's my solution to Puzzle 14 - Matrix Multiply! that passes both checks:

TPB = 3
def mm_oneblock_test(cuda):
    def call(out, a, b, size: int) -> None:
        a_shared = cuda.shared.array((TPB, TPB), numba.float32)
        b_shared = cuda.shared.array((TPB, TPB), numba.float32)

        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 14 lines)
        t = 0
        BLOCKS = size // TPB + (size % TPB != 0)
        for it in range(BLOCKS):
            #if it != 0:
            #    cuda.syncthreads()
            pi = it*cuda.blockDim.x+cuda.threadIdx.x
            pj = it*cuda.blockDim.y+cuda.threadIdx.y
            if i < size and pj < size:
                a_shared[local_i,local_j] = a[i,pj]
            if pi < size and j < size:
                b_shared[local_i,local_j] = b[pi,j]
            cuda.syncthreads()
            if i < size and j < size:
                for k in range(min(TPB, size-it*TPB)):
                    t += a_shared[local_i,k] * b_shared[k,local_j]
        if i < size and j < size:
            out[i,j] = t

    return cal

I notice that it passes regardless if I do a syncthreads between reading the shared memory and modifying them for the next iteration. Why the absence of that sync does not cause any issue?

Also any suggestions on improvement are welcome!

error: metadata-generation-failed

In Colab, the command to install the first package is showing an error as the following. I have tried several solutions but none of them worked. Even restarting didn't solve the issue. Do you have any ideas?

Python 3.10.11
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/danoneata/chalk@srush-patch-1
  Cloning https://github.com/danoneata/chalk (to revision srush-patch-1) to /tmp/pip-req-build-onupcd9d
  Running command git clone --filter=blob:none --quiet https://github.com/danoneata/chalk /tmp/pip-req-build-onupcd9d
  Running command git checkout -b srush-patch-1 --track origin/srush-patch-1
  Switched to a new branch 'srush-patch-1'
  Branch 'srush-patch-1' set up to track remote branch 'srush-patch-1' from 'origin'.
  Resolved https://github.com/danoneata/chalk to commit fe0e7057192c6420a99b4229ad38aa405b219537
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: toolz in /usr/local/lib/python3.10/dist-packages (from chalk-diagrams==0.1.2) (0.12.0)
Collecting colour (from chalk-diagrams==0.1.2)
  Using cached colour-0.1.5-py2.py3-none-any.whl (23 kB)
Collecting svgwrite (from chalk-diagrams==0.1.2)
  Using cached svgwrite-1.4.3-py3-none-any.whl (67 kB)
Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from chalk-diagrams==0.1.2) (8.4.0)
Collecting loguru (from chalk-diagrams==0.1.2)
  Using cached loguru-0.7.0-py3-none-any.whl (59 kB)
Collecting planar (from chalk-diagrams==0.1.2)
  Using cached planar-0.4.zip (622 kB)
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Puzzles 1

out[local_i] = map_spec(a[local_i])
out[local_i] = a[local_i] + 10

The first command gets an error:
Untyped global name 'map_spec': cannot determine Numba type of <class 'function'>

The second one successes

Did I miss some basic concepts?

Diagram seems wrong here.

I have written this code for the "pooling" task:

TPB = 8
def pool_test(cuda):
    def call(out, a, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 8 lines)
        shared[local_i] = a[i]
        cuda.syncthreads()
        s = shared[local_i]
        if i > 0:
          s += shared[local_i - 1]
        if i > 1:
          s += shared[local_i - 2]
        out[i] = s
    return call

But the diagram and read/write stats are wrong:

Puzzel 10 minimum global reads?

Puzzel 10 Dot product states that "You have 1 thread per position. You only need 1 global read and 1 global write per thread".

I am rather confused by this hint. This problem has two inputs, a and b, of the same size. Merely reading the elements of these two inputs at each thread count as TWO global reads already which exceeds the expectation of 1 global read.

Is the expectation supposed to be 2 global reads or am I missing some kind of trick? Thank you!

How to transfer these code to pytorch?

Hi. Thank you for sharing such an amazing tutorials. I have learned a lot from your tensor puzzles. : )
I have the following questions.

I am new to GPU programming (stuffs like numba) and have not idea of how those framework connected together. I am wondering that how to transfer the code written by following your tutorials to pytorch? It is very common in my workflow that I feel like writing some efficient algorithms for a neural network.

In current deep learning researches, where large-scale models emerges rapidly, the efficiency has become the major bottleneck. I thank being able to write low-level code is a must-have skills for DL researchers to develop modern neural networks. Would you like to share some additional resources (e.g., tutorials, courses) related to that? If you think the resources are important, I am willing to collects them and submit a pr to your repo to benefit more and more DL researchers.

Thanks again for your selflessness.

Best,
Zhangzhi

Error when runnign notebook

After running through the lines of code as intended, the images are not showing correctly when running the notebook.

srush / gpu-puzzles Goto Github PK

gpu-puzzles's Introduction

GPU Puzzles

Puzzle 1: Map

Puzzle 2 - Zip

Puzzle 3 - Guards

Puzzle 4 - Map 2D

Puzzle 5 - Broadcast

Puzzle 6 - Blocks

Puzzle 7 - Blocks 2D

Puzzle 8 - Shared

Puzzle 9 - Pooling

Puzzle 10 - Dot Product

Puzzle 11 - 1D Convolution

Puzzle 12 - Prefix Sum

Puzzle 13 - Axis Sum

Puzzle 14 - Matrix Multiply!

gpu-puzzles's People

Contributors

Stargazers

Watchers

Forkers

gpu-puzzles's Issues

Recommend Projects

Recommend Topics

Recommend Org