...
To achieve the final stage of continuous integration, we'll need to go through a set of steps, from a fully manual setup to fully automated. Not all steps will need to be done in this effort, as many of them already exist for other projects, and we should aim to not duplicate any previous (or future) efforts in achieving our CI loop.
First Iteration
The first steps are the bootstrap process, where we'll define the best practices, consult other Linaro teams, the upstream OpenHPC community and the SIG members for the multiple decisions we'll need to take to guide us to a meaningful validation process without duplication of work or moving in the wrong direction.
Step #0: Identify Process
The first step is to identify every step that is needed , on both x86_64 and AArch64, so that we can get a full repeatable process.
Engineering Specification
This will likely involve:
...
Once these steps are reproduced by hand, and documented thoroughly, we can start automating the process.
Acceptance Criteria
The outcomes of this step are:
- At least one deployment of OpenHPC on AArch64 running OpenMP and MPI workloads in at least one compute node.
- A document, with step-by-step instructions, on how to install, setup and execute a minimal OpenHPC validation suite.
- At least one base test and one workload test need to be achieved successfully.
- Not all of this task need to be finished before the following ones start, but it would be good to have at least one successful base test (before additional components).
- (optional) GitHub repositories containing a few examples on base tests, scripts to use, etc.
Step #1: Automating deployment (0.1 ~ 0.2)
After we know how to validate OpenHPC on AArch64 by hand, we need to identify which deployment solution is both easy to automate and meaningful in the HPC community.
Engineering Specification
Although bare-metal deployments are the most common nowadays, containers are becoming ubiquitous in large-scale, heterogeneous HPC clusters. Containers are also a very easy way to deploy base images on a busy validation hardware infrastructure.
...
We shouldn't focus on a single one, as all of them are important, but we should prioritise them and pick the more important to do first, and let the othersĀ after step #2 is finished in its first iteration.
Acceptance Criteria
The outcomes of this step are:
- Having validated with the SIG members that the chosen deployment is representative to their needs.
- At least one successful deployment of and OS + Changes (using Ansible, SALT, etc.) using a reproducible and scalable method.
- Being able to install OpenHPC by hand on the installation above and run the same base tests are produced in step #0.
Step #2: OpenHPC deployment (0.3 ~ 0.5)
Once we can quickly deploy CentOS/SLES (and potentially other) images on demand, and successfully install OpenHPC by hand using the instructions produced in step #0, we need to work on how to automate that deployment.
Engineering Specification
This is not as simple as having OS images on demand because some parameters should be chosen on demand, too. For example, the number of nodes, which toolchain to use, which MPI stack, etc.
...
In the end, having an upstream-approved process, even if it's a bit tedious, is perhaps more important than following the guidelines of other teams. But, in the absence of that, we should try to integrate our jobs as much as possible with the other infrastructure teams at Linaro, most notably Builds & Baselines, Systems and the Lab.
Acceptance Criteria
The outcomes of this step is a process that:
- is accepted upstream (OpenHPC), so that it can be repeated by other members of the community, not just our members.
- is accepted by the infrastructure teams at Linaro, and that have had their review and input.
- is fully automated, with a few choices (compiler, MPI, etc.) and can be deployed on top of a vanilla image produced on the step #1.
- At least one OpenMP and one MPI test is performed successfully on each variation.
Step #3: Workload Testing (0.6 ~ 0.8)
Once OpenHPC is being automatically deployed and tested for simple programs, we need to start looking at the packages and tests that the SIG members want us to continuously test.
Engineering Specification
For a first step, we want the workloads / libraries to be available through OpenHPC, and we're already working on getting some in there, but ultimately, we may be able to provide a git repository and a recipe (as a script or playbooks), and run that as additional testing.
...
This last step is not necessary on the first run of step #3, but we ultimately want to track benchmark numbers and be able to identify regressions, spikes and trends.
Acceptance Criteria
The first iteration of this step needs to produce:
- A framework for how to create tests, separated in steps (download, install, build, test), and applicable to OpenHPC packages.
- A way to analyse the results based on the nature of the workload (statistical, exact, approximate) or to allow external scripts to validate output.
- A way to communicate pass/fail back to the process that initiated the deployment in the first place, so that we can have a nice green/red status.
- An example with at least one OpenHPC package being installed, compiled, tested and the results showing green/red onĀ real errors.
Second Iteration
Once we can successfully deploy an OpenHPC image with some options and an HPC test reporting red/green status, we need to refine the options and improve the availability of the workloads.