Kernel Vector Addition
The kernel function, vector_add, is executed in parallel by many threads. The goal is for one thread to handle one element of the vector.
Grid, blocks, threads
A grid contains blocks, which contains threads.
Global Index
Each thread in the kernel has a unique global index. This index tells the thread which specific element of the vector it should process.
The global index can be calculated with built-in CUDA variables:
threadIdx.x (thread's index within it's block)
blockIdx.x (block's index within it's grid)
blockDim.x (# of threads per block)idx = (blockIdx.x * blockDim.x) + threadIdx.xWe multiply blockIdx.x by blockDim.x because thread 0 of block X is offset by X * blockDim.x threads preceeding it.
Integer division
Often, more threads are launched than there are elements (N) because we need to round up to ensure all elements are covered.
e.g. If 1024 threads are launched for N elements, for threads with idx >= N, the kernel must make them exit immediately.
if (idx >= N) {
return;
}Solving
#include
__global__ void vector_add(const float* A, const float* B, float* C, int N) {
}
// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(const float* A, const float* B, float* C, int N) {
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vector_add<<>>(A, B, C, N);
cudaDeviceSynchronize();
} We need to write the vector_add function:
__global__ void vector_add(const float* A, const float* B, float* C, int N) {
// 1. Calculate global index (idx)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// 2. Perform bounds check
if (idx>=N){
return;
}
// 3. Execute the addition;
C[idx] = A[idx] + B[idx];
}Spinning up NVIDIA TESLA T4...
Running...
A = [1.0, 2.0, 3.0, 4.0]
B = [5.0, 6.0, 7.0, 8.0]
N = 4
✓ Test passed!