Zimeng Xiong's Weblog

About

LeetGPU: CUDA Matrix Multiplication

Matrix Multiplication

Write a program that multiplies two matrices of 32-bit floating point numbers on a GPU. Given matrix A of dimensions MxN and matrix B of dimensions NxK, compute the product matrix C, which will have dimensions MxK. All matrices are stored in row-major format.

__global__ void matrix_multiplication_kernel(const float* A, const float* B, float* C, int M, int N, int K) {

}

// A, B, C are device pointers (i.e. pointers to memory on the GPU)
extern "C" void solve(const float* A, const float* B, float* C, int M, int N, int K) {
    dim3 threadsPerBlock(16, 16);
    dim3 blocksPerGrid((K + threadsPerBlock.x - 1) / threadsPerBlock.x,
                       (M + threadsPerBlock.y - 1) / threadsPerBlock.y);

    matrix_multiplication_kernel<<>>(A, B, C, M, N, K);
    cudaDeviceSynchronize();
}

Global Indexes, revisited

Unlike the previous problem, we have multiple dimensions now. We need two global indexes:

  • A global row index (i)
  • A global column index (j)
i = (blockIdx.y * blockDim.y) + threadIdx.y
j = (blockIdx.x * blockDim.x) + threadIdx.x

With these two indicies, each thread is assigned to calculate one unique element Ci,jC_{i,j}.

Checks for integer division

The bounds check is written as follows:

if (i>=M || j>=K){
  return;
}

Core Matrix Multiplication

Now that a valid thread is assigned to calculate a unique element Ci,jC_{i,j}, we need to perform the dot product calculation of the i-th row of matrix A and the j-th column of matrix B.

Ci,j=k=0N1Ai,k×Bk,jC_{i,j}=\sum_{k=0}^{N-1}{A_{i,k}\times B_{k,j}}

To do this calculations, we need a loop that iterates N times with a local accumulator variable to store the dot product sum.

2D to 1D Indexing

Since the matrices are stored in row-major format, we can find the 1D index for an element Xr,cX_{r,c} in a matrix with CdimC_{dim} columns:

Index=r×Cdim+cIndex = r \times C_{dim} + c

Thus, we can calculate the index for Ai,k,Bk,j,and Ci,jA_{i,k}, B_{k,j}, \text{and } C_{i,j}:

$ idx_{A_{i,k}} = iN+k \newline idx_{B_{k,j}} = kK+j \newline idx_{C_{i,j}} = i*K+j \newline $

Solving

__global__ void matrix_multiplication_kernel(const float* A, const float* B, float* C, int M, int N, int K) {

    // 1. Calculate i and j
    int i = (blockIdx.y * blockDim.y) + threadIdx.y;
    int j = (blockIdx.x * blockDim.x) + threadIdx.x;
    // 2. Bounds check
    if (i>=M || j>=K){
      return;
    }
    // 3. Initialize accumulator
    float d = 0.0;
    // 4. Inner loop for dot product
    for (int k=0;k
002352 visitors