/
HPC-SIG Bootstrap Tasks

HPC-SIG Bootstrap Tasks

Introduction

There are a number of stakeholders and a number of uncertainties in the HPC SIG. To make sure we unblock as many avenues of research as possible, there are a few tasks that need to be done first.

These tasks fall broadly into three categories:

  1. Infrastructure: Most SIG members won't have hardware available soon enough, so we need to make sure we facilitate and document clear processes to using OpenHPC on QEMU clusters. This means understanding the network topology of QEMU's User Network system as well as making sure that the usual cluster management systems (PXE, DHCP, Slurm, Nagios) works well on QEMU clusters.
  2. Emulation: Similarly, SVE won't be available in hardware for a while, so having a way to emulate SVE in software will allow all members to test their code generation tools (by hand, compilers). Having a full-featured QEMU is our target, but other alternatives may help while we work on QEMU.
  3. Libraries & Benchmarks: There are a huge number of HPC libraries out there and we can't look at them all. More importantly, it would be inefficient for us to look directly at any one of them in particular, to the detriment of others. We need to define the scope of how deep we go on our investigations, and how do we help the SIG members to potentiate their own efforts on the same libraries.

The primary idea for the bootstrap effort is to understand how to bring all members together by spending time on the common/shared parts in order to improve each member's own bootstrap.

It will take some time to reach this synergy, and this is the aim of this document: to show what and how we're doing, so that members can follow the progress and discuss the priorities in such an important moment.

Later on, when our roadmap is defined and the process is ongoing, we can go back to a more traditional development process (project management, roadmaps, monthly updates, etc).

Infrastructure

The final goal of this bootstrap task is to be able to easily setup an AArch64+SVE OpenHPC cluster using QEMU.

We can approach it from two sides: implementing SVE in QEMU (see Emulation below) and defining the QEMU infrastructure (this task) to allow multiple machines in a cluster configuration.

We shall work with Centos 7.3 + OpenHPC 1.3 for both x86 and AArch64.

QEMU Networking (both x86_64 and AArch64)

QEMU has a fully functional User Network emulation layer, which allows you to create multiple networks onto which the virtual machines can attach their ethernet devices.

Using Virt Manager, you can create two networks by going on "Edit > Connection Details > Virtual Networks" and adding a new, "internal" network (no DHCP).

Once you create your master node, make sure one of its interfaces is connected to the "default" network (in NAT mode), so that you can have Internet access, but adding a secondary NIC to the internal one. This is the one you'll serve PXE/DHCP over to the slaves.

On the slave nodes, add the NIC to the internal, so that they don't see the default network directly, but via the master. The main differences from the OpenHPC 1.3 docs are:

  • You need to change your DHCP configuration to be "authoritative" (/etc/warewulf/dhcpd-template.conf), as your master will be your only DHCP;
  • You need to setup NAT via IPTABLES, to route traffic from the internal to the external network;

QEMU Images (AArch64 only)

Debian/Ubuntu installations should have the AArch64 UEFI image from packages (qemu-efi), but Fedora needs a few additional steps.

If you're not on Fedora (ex. Arch Linux), you'll need to hack Fedora's steps, extract the RPM and copy the files in the right places.

  • Go to the mentioned repository, on the EDK2 directory
  • Download the "aarch64" RPM (there should be only one)
  • Unpack it with rpm2cpio (beware, this will output binary to your screen, look for wrappers or use it wisely).
  • Copy the CODE/VARS files into an "aarch64" directory inside the one where the x86 ones are (on my box, it's /usr/share/ovmf)
    • IMPORTANT: virt-install didn't like when I copied directly with the x86 ones, but did when I created an "aarch64" directory inside.
  • Update your /etc/libvirt/qemu.conf as per the UEFI instruction, paying attention to your local directory.

This would be enough, if there wasn't a bug in libvirt. This has to do with how interrupts are emulated, and the new version 3 doesn't work well with AArch64.

Find your configuration file (/etc/libvirt/qemu/*.xml) and change as per the bug.

Also, avoid having a monitor. Prefer a console interface. On virt-manager, remove the monitor (if any) and add a console and QEMU will automatically redirect the output there.

After these changes, you may need to restart libvirtd by:

$ sudo systemctl restart libvirtd

QEMU Version

The latest QEMU version is 2.8, but that version is quite slow when enabling multiple cores (MTTCG), especially for cross-architectures (like AArch64 on x86_64). There was a big change in how QEMU handles locks on the master branch, which will be out on 2.9, but it's not released yet. If you want to try it out, you'll have to compile from master (or install a *-git package from your distro, if available).

Emulation

The goal here is to have some form of SVE emulation going, so that we can being testing SVE workloads.

Linaro is working on an upstream QEMU SVE implementation, but if we can find something already existing, it would speed up adoption and member bootstrapping.

ARM has an instruction emulation, which could be used in hardware or maybe even QEMU, to wrap just the SVE part, for now. But it depends on how this can be distributed.

After several discussions with the QEMU team and HPC-SIG members using emulators, "user emulation" is the best way to start. This means we'll run ARM software on x86_64 machines directly in the x86_64 kernel, offloading to QEMU to execute the ARM parts. This is not the same as ARM's emulator (which requires actual ARM hardware), but it also requires physical deployment (an x86_64 cluster) to test distributed workloads.

We can start with single-threaded SVE workloads first, and then move up once we have either full system emulation or actual hardware, in order to provide the required ecosystem.

Libraries & Benchmarks

There are an almost infinite number of libraries we can work on, not all of them important to bootstrap others.

The focus now should be on two fronts:

  1. Unblocking further work: Things like OpenMP, BLAS, FFTW and making sure they run well on AArch64 (not necessarily SVE for now) and can be optimised. These libraries are heavily used for additional libraries, and we can reach a large potential by looking at them first, allowing our members or the rest of the community to look at their own libraries instead.
  2. Experimentation: Specific libraries for specific problems that are small enough and vectoriser-friendly enough that we can get them using SVE relatively quickly (with the help of the members) as to have a few proof-of-concepts on how SVE can and should work.

The highest priority library task is to make sure OpenMP works well on plain AArch64 (no SVE). "Works well" generally means performance is comparable to the slow down / speed up of the rest of applications on a particular sub-architecture. So, if a certain chip is x% slower on average (on other benchmarks) than a top-of-the-line Xeon, OpenMP should perform x% slower and scale up in the same way for more cores.

This also involves investigating both GOMP and LLVM's OMP libraries, and making them available in OpenHPC, so that members (and everyone else) can experiment on their own. LLVM 4.0 has been released with OpenMP built by default, so having that version on OpenHPC as a whole is our target.

SVE Enablement

The core libraries (BLAS, FFTW, LAPACK) are too big and too important to start adding SVE directly. We need to start on smaller, niche libraries, hopefully ones that already vectorise for NEON and AVX512, so that we already have a migration path ready.

CERN has identified a few choices:

Riken has identified these programs:

Benchmarks

There are many benchmarks that exhibit parallelisation, and here is a list of a few interesting ones:

As well as the standard compiler benchmarks: