c++ - how to optimize matrix multiplication (matmul) code to run fast on a single processor core

Question

Welcome To Ask or Share your Answers For Others

c++ - how to optimize matrix multiplication (matmul) code to run fast on a single processor core

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - how to optimize matrix multiplication (matmul) code to run fast on a single processor core

I am working on parallel programming concepts and trying to optimize matrix multiplication example on single core. The fastest implementation I came up so far is the following:

/* This routine performs a dgemm operation
 *  C := C + A * B
 * where A, B, and C are lda-by-lda matrices stored in column-major format.
 * On exit, A and B maintain their input values. */    
void square_dgemm (int n, double* A, double* B, double* C)
{
  /* For each row i of A */
  for (int i = 0; i < n; ++i)
    /* For each column j of B */
    for (int j = 0; j < n; ++j) 
    {
      /* Compute C(i,j) */
      double cij = C[i+j*n];
      for( int k = 0; k < n; k++ )
          cij += A[i+k*n] * B[k+j*n];
      C[i+j*n] = cij;
    }
}

The results are like below. how to reduce the loops and increase the performance

login4.stampede(72)$ tail -f job-naive.stdout
Size: 480       Mflop/s:  1818.89       Percentage: 18.95
Size: 511       Mflop/s:  2291.73       Percentage: 23.87
Size: 512       Mflop/s:  937.061       Percentage:  9.76
Size: 639       Mflop/s:  293.434       Percentage:  3.06
Size: 640       Mflop/s:  270.238       Percentage:  2.81
Size: 767       Mflop/s:  240.209       Percentage:  2.50
Size: 768       Mflop/s:  242.118       Percentage:  2.52
Size: 769       Mflop/s:  240.173       Percentage:  2.50
Average percentage of Peak = 22.0802
Grade = 33.1204

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:04:14+0000

The state-of-the-art implementation of matrix multiplication on CPUs uses GotoBLAS algorithm. Basically the loops are organized in the following order:

Loop5 for jc = 0 to N-1 in steps of NC
Loop4   for kc = 0 to K-1 in steps of KC
          //Pack KCxNC block of B
Loop3     for ic = 0 to M-1 in steps of MC
            //Pack MCxKC block of A
//--------------------Macro Kernel------------
Loop2       for jr = 0 to NC-1 in steps of NR
Loop1         for ir = 0 to MC-1 in steps of MR
//--------------------Micro Kernel------------
Loop0           for k = 0 to KC-1 in steps of 1
                //update MRxNR block of C matrix

A key insight underlying modern high-performance implementations of matrix multiplication is to organize the computations by partitioning the operands into blocks for temporal locality (3 outer most loops), and to pack (copy) such blocks into contiguous buffers that fit into various levels of memory for spatial locality (3 inner most loops).

The above figure (originally from this paper, directly used in this tutorial) illustrates the GotoBLAS algorithm as implemented in BLIS. Cache blocking parameters {MC, NC, KC} determine the submatrix sizes of Bp (KC × NC) and Ai (MC × KC), such that they fit in various caches. During the computation, row panels Bp are contiguously packed into buffer Bp to fit in the L3 cache. Blocks Ai are similarly packed into buffer Ai to fit in the L2 cache. Register block sizes {MR, NR} relate to submatrices in registers that contribute to C. In the micro-kernel (the inner most loop), a small MR × NR micro-tile of C is updated by pair of MR × KC and KC × NR slivers of Ai and Bp.

For the Strassen's algorithm with O(N^2.87) complexity, you might be interested in reading this paper. Other fast matrix multiplication algorithms with asymptotic complexity less than O(N^3) can be easily extended in this paper. There is a recent thesis about the practical fast matrix multiplication algorithms.

The following tutorials might be helpful if you want to learn more about how to optimize matrix multiplication on CPUs:

How to Optimize GEMM Wiki

GEMM: From Pure C to SSE Optimized Micro Kernels

BLISlab: A sandbox for optimizing GEMM for CPU and ARM

A most updated document about how to optimize GEMM on CPUs (with AVX2/FMA) step by step can be downloaded here: https://github.com/ULAFF/LAFF-On-HPC/blob/master/LAFF-On-PfHP.pdf

A Massive Open Online Course to be offered on edX starting in June 2019 (LAFF-On Programming for High Performance): https://github.com/ULAFF/LAFF-On-HPC http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html

Categories

c++ - how to optimize matrix multiplication (matmul) code to run fast on a single processor core

c++ - how to optimize matrix multiplication (matmul) code to run fast on a single processor core

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags