Benchmarking

In the HPC-SIG, we're concerned with two types of benchmarks: system and cluster.

System Benchmarks

These are benchmarks where CPU architecture and features and memory availability and speed are the main factors in deciding performance.

They measure the quality of toolchains to generate the most optimal code to each micro-arch within the Arm architecture, as well as the ability of the different vendors' hardware (CPU, memory) to execute such code.

A number of additional factors can also impact performance:

Number of cores, threads, use of hyper-threading
Base/Max frequency, power scheduling
Cache configuration (per core, cluster, socket)
Amount of RAM per core, NUMA effects, multi-channel

To make sure these parameters are measured and taken into account, we need a strategy to run the benchmarks that not only track those changes, but also enforces a consistent, repeatable and accurate execution model.

For example, different machines will have different DIMM slots in different channels and CPU sockets will have different access to them. By understanding the system configuration, we can taskset the process(es) to specific cores/threads in order to make sure:

Cache and memory utilisation will be pushed to the maximum,
Concurrent processes (either different programs, like SPEC, or different threads of the same program, like Lulesh) don't starve each others' resources.

These are the benchmarks we are working on to bootstrap our harness:

Lulesh: OpenMP from one (CPU/mem bound) to all (scalability, locality, OMP tool support) cores.
Himeno: Single core matrix operations (cache hierarchy, tools optimisation)

The benchmarks we'll be looking at next are:

Open single run
- Linpack (F, C) = Standard Gaussian elimination pivot factorisation (Flops, cache)
Open multiple runs
- Polybench: 30 programs (LU, Jacobi, matrix ops, solvers, decomposition)
- SciMark2: FFT, sparse matrix, monte carlo, LU, etc.
- Livermore Loops: Loop vectorisation targets
Closed:
- SPEC 2017: Focusing on the HPC side of it: modelling, solving, rendering

The benchmarks that the TCWG team already runs are:

SPEC 2k and SPEC 2k6 (compiler benchmarks)
LLVM Test-Suite (including TSVC, Polybench, SciMark2)
EEMBC (tailored to embedded and mobile workloads)

Cluster Benchmarks

These are scalability benchmarks, assuming the tool optimisation quality, library availability and micro-arch affinity and more importantly, network topology, bandwidth, latency and special features (RDMA, sub-network, MPI integration).

Most of those benchmarks need special care to pick the parameters of the run. Data locality, network topology and single-core raw power can make a big difference on the amount of useful work a benchmark does.

For example, too fine a grid and the cluster will spend most of its time reducing (ie aggregating the data, collecting results and writing to global memory), but the number of cores in each node and the power of these cores are limited, so there's a different balance for every cluster.

In light of that, the exercises in looking at the benchmarks below is more about the cluster than it is about each node. Loop vectorisation will probably be the compiler optimisation most likely to affect anything here, but still less than optimised libraries and network configuration. Those, in turn, are likely to have less impact than choosing the perfect fine tuned parameters.

Given that our clusters are very small (<5 nodes), we are likely to learn very little form running those benchmarks, thus, their priorities are much lower than the ones above. But we still want to run them as a complete exercise: proving we can automatically provision an entire cluster and run a benchmark with reasonable performance.

The benchmarks we'll be looking at are:

Lulesh OpenMPI and OpenMPI+OpenMP versions (scalability, network bottleneck)
HPL Standard LU pivot factorisation, but scalable to many nodes
HPCG More elaborate data access patterns that stress things HPL doesn't
HPGMG Large scale solvers, stressing computation and scalability

Further Applications/Benchmarks

Here's an additional list of compiler benchmarks from the Polly team, many of them relevant to HPC:

http://llvm.org/docs/Proposals/TestSuite.html

These are a few benchmarks we're not looking at, nor have plans, but can be a list for future work.