This page describes a validation plan for OpenHPC deployment on AArch64 systems. It is not restrict to hardware, cloud or emulation systems and should be reasonably independent of the media.
Our main concern is how to automate the deployment in a media that is not only easy but meaningful.
The key points we need are:
- Full automation. To be able to start new images, install OpenHPC, download some libraries, configure the master and at least one slave, compile a few HPC programs, run them and make sure that they conform to the expected output.
- Representative images. We need to make sure that the minimum number of nodes brings a meaningful result, that the programs we run will be executed in real clusters and that the environment we have chosen is not hiding real-world errors.
- Change driven. Triggering builds and validation need to be automated from relevant changes to relevant repositories, or at least at a compatible frequency with said repositories.
To achieve the final stage of continuous integration, we'll need to go through a set of steps, from a fully manual setup to fully automated. Not all steps will need to be done in this effort, as many of them already exist for other projects, and we should aim to not duplicate any previous (or future) efforts in achieving our CI loop.
Step #0: Identify Process
The first step is to identify every step that is needed, on both x86_64 and AArch64, so that we can get a full repeatable process.
Engineering Specification
This will likely involve:
- Installing CentOS/SLES on QEMU, container, bare-metal environments.
- Applying the industry standard changes to those images (security, authentication, performance, etc.).
- Installing OpenHPC directly from their upstream documents, identifying all pertinent steps that are taken/not-taken and their reasons, per environment.
- Run a baseline validation on the core setup. This should be in a git repo somewhere, probably GitHub so we can share with our members and the community.
- Installing all additional components that are needed for the test, for example LLVM's libomp (to compare with GNU's GOMP), special libraries, etc.
- Download all tests that will be compiled and executed as part of the validation.
- Compile all tests, possible with different toolchains, options, libraries. Identify every problem (errors, warnings, etc).
- Run all tests that were successfully compiled and make sure that their output is as expected, understanding the floating point nature of most of them.
Once these steps are reproduced by hand, and documented thoroughly, we can start automating the process.
Acceptance Criteria
The outcomes of this step are:
- At least one deployment of OpenHPC on AArch64 running OpenMP and MPI workloads in at least one compute node.
- A document, with step-by-step instructions, on how to install, setup and execute a minimal OpenHPC validation suite.
- At least one base test and one workload test need to be achieved successfully.
- Not all of this task need to be finished before the following ones start, but it would be good to have at least one successful base test (before additional components).
- (optional) GitHub repositories containing a few examples on base tests, scripts to use, etc.
Step #1: Automating deployment
After we know how to validate OpenHPC on AArch64 by hand, we need to identify which deployment solution is both easy to automate and meaningful in the HPC community.
Engineering Specification
Although bare-metal deployments are the most common nowadays, containers are becoming ubiquitous in large-scale, heterogeneous HPC clusters. Containers are also a very easy way to deploy base images on a busy validation hardware infrastructure.
We have a few options:
- Cross-QEMU containers (emulation) on x86_64 hardware. This is slow and problematic, but can sometimes be the only way some SIG members will have access to ARM hardware.
- PROs: AArch64 QEMU is available upstream, anyone can use it;
- CONs: It's really slow and support for HPC OSs is not complete;
- Native containers (acceleration) on AArch64 hardware: This is the easiest way to deploy images, but will be barred on hardware availability and upstream support for the existing hardware.
- PROs: Fast, reliable, easy to automate;
- CONs: Needs access to hardware (developer cloud would help here);
- Bare-metal AArch64 hardware: This is closer to most current deployments and would allow us to test real-world situations (10GBe, InfiniBand, UEFI via serial, etc).
- PROs: Tests real-world cases for large scale deployment;
- CONs: Really limited on hardware, as each machine will be entirely dedicated;
We shouldn't focus on a single one, as all of them are important, but we should prioritise them and pick the more important to do first, and let the others after step #2 is finished in its first iteration.