SPECFEM Sample

This sample contains a dummy example from a spectral-element stiffness kernel taken from SPECFEM3D_GLOBE.

It is based on a 4th-order, spectral-element stiffness kernel for simulations of elastic wave propagation through the Earth. Matrix sizes used are (25,5), (5,25) and (5,5) determined by different cut-planes through a three dimensional (5,5,5)-element with a total of 125 GLL points.

Usage Step-by-Step

This example needs the LIBXSMM library to be built with static kernels, using MNK=”5 25” (for matrix size (5,25), (25,5) and (5,5)).

Build LIBXSMM

General Default Compilation

In LIBXSMM root directory, compile the library with:

make MNK="5 25" ALPHA=1 BETA=0

Additional Compilation Examples

Compilation using only single precision version and aggressive optimization:

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3

For Sandy Bridge CPUs:

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 AVX=1

For Haswell CPUs:

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 AVX=2

For Knights Corner (KNC) (and thereby creating a Sandy Bridge version):

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 AVX=1 \
OFFLOAD=1 KNC=1

Installing libraries into a sub-directory workstation/:

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 AVX=1 \
OFFLOAD=1 KNC=1 \
PREFIX=workstation/ install-minimal

Build SpecFEM example code

For default CPU host:

cd sample/specfem
make

For Knights Corner (KNC):

cd sample/specfem
make KNC=1

Additionally, adding some specific Fortran compiler flags, for example:

cd sample/specfem
make FCFLAGS="-O3 -fopenmp" [...]

Note that steps 1 and 2 could be shortened by specifying a “specfem” make target in the LIBXSMM root directory:

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 AVX=1 specfem

For Knights Corner, this would need two steps:

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 AVX=1 OFFLOAD=1 KNC=1
make OPT=3 specfem_mic

Run the Performance Test

For default CPU host:

./specfem.sh

For Knights Corner (KNC):

./specfem.sh -mic

Results

Using Intel Compiler suite: icpc 15.0.2, icc 15.0.2, and ifort 15.0.2.

Sandy Bridge - Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Library compilation by (root directory):

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 AVX=1

Single threaded example run:

cd sample/specfem
make; OMP_NUM_THREADS=1 ./specfem.sh

Output:

===============================================================
average over           15 repetitions
 timing with Deville loops    =   0.1269
 timing with unrolled loops   =   0.1737 / speedup =   -36.87 %
 timing with LIBXSMM dispatch =   0.1697 / speedup =   -33.77 %
 timing with LIBXSMM prefetch =   0.1611 / speedup =   -26.98 %
 timing with LIBXSMM static   =   0.1392 / speedup =    -9.70 %
===============================================================

Haswell - Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Library compilation by (root directory):

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 AVX=2

Single threaded example run:

cd sample/specfem
make; OMP_NUM_THREADS=1 ./specfem.sh

Output:

===============================================================
average over           15 repetitions
 timing with Deville loops    =   0.1028
 timing with unrolled loops   =   0.1385 / speedup =   -34.73 %
 timing with LIBXSMM dispatch =   0.1408 / speedup =   -37.02 %
 timing with LIBXSMM prefetch =   0.1327 / speedup =   -29.07 %
 timing with LIBXSMM static   =   0.1151 / speedup =   -11.93 %
===============================================================

Multi-threaded example run:

cd sample/specfem
make OPT=3; OMP_NUM_THREADS=24 ./specfem.sh

Output:

OpenMP information:
  number of threads =           24

[...]

===============================================================
average over           15 repetitions
 timing with Deville loops    =   0.0064
 timing with unrolled loops   =   0.0349 / speedup =  -446.71 %
 timing with LIBXSMM dispatch =   0.0082 / speedup =   -28.34 %
 timing with LIBXSMM prefetch =   0.0076 / speedup =   -19.59 %
 timing with LIBXSMM static   =   0.0068 / speedup =    -5.78 %
===============================================================

Knights Corner - Intel Xeon Phi B1PRQ-5110P/5120D

Library compilation by (root directory):

make MNK="5 25" ALPHA=1 BETA=0 PRECISION=1 OPT=3 OFFLOAD=1 KNC=1

Multi-threaded example run:

cd sample/specfem
make FCFLAGS="-O3 -fopenmp -warn" OPT=3 KNC=1; ./specfem.sh -mic

Output:

OpenMP information:
  number of threads =          236

[...]

===============================================================
average over           15 repetitions
 timing with Deville loops    =   0.0164
 timing with unrolled loops   =   0.6982 / speedup = -4162.10 %
 timing with LIBXSMM dispatch =   0.0170 / speedup =    -3.89 %
 timing with LIBXSMM static   =   0.0149 / speedup =     9.22 %
===============================================================