SMM Benchmark

This collection of code samples exercises different memory streaming cases when performing the matrix multiplication C_m x n = alpha · A_m x k · B_k x n + beta · C_m x n: (1) streaming the matrices A, B, and C which is usually referred as batched matrix multiplication, (2) streaming the inputs A and B but accumulating C within cache, (3) streaming the A and C matrices while B is kept in cache, (4) streaming the B and C matrices while A is kept in cache, and (4) not streaming any of the operands but repeating the very same multiplication until the requested number of matrix multiplications has been completed.

Beside of measuring the duration of a test case, the performance is presented in GFLOPS/s. As an alternative metric, the memory bandwidth is given (the artificial “cached” case omits to present the cache-memory bandwidth). The “pseudo-performance” given in FLOPS/cycle is an artificial scoring, it not only uses a non-standard formula for calculating the FLOPS (2 * M * N * K - M * N rather than 2 * M * N * K) but also relies on (pseudo-)clock cycles:

$ ./specialized.sh 0
m=32 n=32 k=32 size=87381 memory=2048.0 MB (DP)

Batched (A,B,C)...
        pseudo-perf.: 10.7 FLOPS/cycle
        performance: 23.9 GFLOPS/s
        bandwidth: 11.1 GB/s
        duration: 239 ms
Finished

There are two sub collections of samples codes: (1) a collection of C++ code samples showing either BLAS, Compiler-generated code (inlined code), LIBXSMM/dispatched, LIBXSMM/specialized functions to carry out the multiplication, and (2) a Fortran sample code showing BLAS versus LIBXSMM including some result validation.

C/C++ Code Samples: Command Line Interface (CLI)

Takes an optional number (1st arg.) to select the streaming-case (0…8)
Optionally takes the M, N, and K parameter of the GEMM in this order
If only M is supplied, the N and K “inherit” the M-value
Example I (A,B,C): ./specialized.sh 0 16 8 9
Example II (A,B): ./specialized.sh 6 16

Fortran Code Sample: Command Line Interface (CLI)

Optionally takes the M, N, and K parameter of the GEMM in this order
Optional problem size (in MB) of the workload; M/N/K must have been supplied
Optional total problem size (in MB) implying the number of repeated run
If only M is supplied, the N and K are “inheriting” the M-value
Shows the performance of each of the streaming cases
Example I: ./smm.sh 16 8 9 1024 16384
Example II: ./smm.sh 16