LeetGPU: CUDA Matrix Multiplication

Matrix Multiplication

Write a program that multiplies two matrices of 32-bit floating point numbers on a GPU. Given matrix A of dimensions MxN and matrix B of dimensions NxK, compute the product matrix C, which will have dimensions MxK. All matrices are stored in row-major format.

__global__ void matrix_multiplication_kernel(const float* A, const float* B, float* C, int M, int N, int K) {

}

// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(const float* A, const float* B, float* C, int M, int N, int K) {
    dim3 threadsPerBlock(16, 16);
    dim3 blocksPerGrid((K + threadsPerBlock.x - 1) / threadsPerBlock.x,
                       (M + threadsPerBlock.y - 1) / threadsPerBlock.y);

    matrix_multiplication_kernel<<>>(A, B, C, M, N, K);
    cudaDeviceSynchronize();
}

Global Indexes, revisited

Unlike the previous problem, we have multiple dimensions now. We need two global indexes:

A global row index (i)
A global column index (j)

i = (blockIdx.y * blockDim.y) + threadIdx.y
j = (blockIdx.x * blockDim.x) + threadIdx.x

With these two indicies, each thread is assigned to calculate one unique element $C_{i,j}$ .

Checks for integer division

The bounds check is written as follows:

if (i>=M || j>=K){
  return;
}

Core Matrix Multiplication

Now that a valid thread is assigned to calculate a unique element $C_{i,j}$ , we need to perform the dot product calculation of the i-th row of matrix A and the j-th column of matrix B.

$C_{i,j}=\sum_{k=0}^{N-1}{A_{i,k}\times B_{k,j}}$

To do this calculations, we need a loop that iterates N times with a local accumulator variable to store the dot product sum.

2D to 1D Indexing

Since the matrices are stored in row-major format, we can find the 1D index for an element $X_{r,c}$ in a matrix with $C_{dim}$ columns:

$Index = r \times C_{dim} + c$

Thus, we can calculate the index for $A_{i,k}, B_{k,j}, \text{and } C_{i,j}$ :

$ idx_{A_{i,k}} = iN+k \newline idx_{B_{k,j}} = kK+j \newline idx_{C_{i,j}} = i*K+j \newline $

Solving

__global__ void matrix_multiplication_kernel(const float* A, const float* B, float* C, int M, int N, int K) {

    // 1. Calculate i and j
    int i = (blockIdx.y * blockDim.y) + threadIdx.y;
    int j = (blockIdx.x * blockDim.x) + threadIdx.x;
    // 2. Bounds check
    if (i>=M || j>=K){
      return;
    }
    // 3. Initialize accumulator
    float d = 0.0;
    // 4. Inner loop for dot product
    for (int k=0;k

Zimeng Xiong's Weblog