Notes: CUDA Vector Addition

Kernel Vector Addition

The kernel function, vector_add, is executed in parallel by many threads. The goal is for one thread to handle one element of the vector.

Grid, blocks, threads

A grid contains blocks, which contains threads.

Global Index

Each thread in the kernel has a unique global index. This index tells the thread which specific element of the vector it should process.

The global index can be calculated with built-in CUDA variables:

threadIdx.x (thread's index within it's block)
blockIdx.x (block's index within it's grid)
blockDim.x (# of threads per block)

idx = (blockIdx.x * blockDim.x) + threadIdx.x

We multiply blockIdx.x by blockDim.x because thread 0 of block X is offset by X * blockDim.x threads preceeding it.

Integer division

Often, more threads are launched than there are elements (N) because we need to round up to ensure all elements are covered.

e.g. If 1024 threads are launched for N elements, for threads with idx >= N, the kernel must make them exit immediately.

if (idx >= N) {
    return;
}

Solving

#include 

__global__ void vector_add(const float* A, const float* B, float* C, int N) {

}

// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(const float* A, const float* B, float* C, int N) {
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    vector_add<<>>(A, B, C, N);
    cudaDeviceSynchronize();
}

We need to write the vector_add function:

__global__ void vector_add(const float* A, const float* B, float* C, int N) {

    // 1. Calculate global index (idx)
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // 2. Perform bounds check
    if (idx>=N){
      return;
    }

    // 3. Execute the addition;
    C[idx] = A[idx] + B[idx];
}

Spinning up NVIDIA TESLA T4...
Running...
A = [1.0, 2.0, 3.0, 4.0]
B = [5.0, 6.0, 7.0, 8.0]
N = 4
✓ Test passed!

Zimeng Xiong's Weblog