Work on InfiniBand Ansible automation in the OpenHPC recipes starting now
Ansible work on Lustre will have to wait until Whamcloud publishes the new version
Ansible OpenHPC development will change to Linaro
Upstreaming those Ansible playbooks will have to wait until both Linaro and Fujitsu are using the same set of changes (minus local ones), so that we can start welcoming other labs' entirely new playbooks into our repos.
Due to long term sickness, Fujitsu is replacing Takahiro Miyoshi with Masakazu Ueno, who will continue the work in HPC-212 (LLVM TSVC improvements). The last of Takahiro's patches was committed this week.
@Renato to update the s278 task with some info on how to get the information necessary to start debugging, to give Masakazu a head start in the LLVM world.
Both Renato and Masaki spent a good amount of time building their LLVM environments, and using Linaro's git and scripts, and hope to resume coding in the following weeks. :)
The benchmark harness is now in Alpha version, so we're encouraging everyone to start using it locally and report bugs, propose improvements, send patches for new benchmarks, etc.
Baptiste is doing the arduous (and very valuable) work of refactoring our Jenkins jobs, Mr-Provisioner's client and overall lab stability tasks.
OpenSM still causing issues when setting up IB on the D03s
Looking at different binaries on Mellanox drivers
Talking about benchmarks, noise and how to use perf to find issues
IPoIB tests too slow (15GB) while pure IB are fast (45GB+)
When we get IB jobs running and stable, we'll look at OpenMPI's MTT
Infiniband:
Lustre:
Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.
Tried building Lustre server on an x86 VM and got some weird build errors (AVX512 on a 10y.o server), bay be auto-detect.
Baptiste said there's a way to copy the host CPU features into the VM, will try that next. If it doesn't work, try to force configure options to disable AVX512.
That work will be updated in HPC-322.
Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.
Current patch doesn't help other loops under investigation, will need additional work for those later.
Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.
Not getting full speed on Mellanox fabric. May help to use auxiliary card on a PCI lane managed by the second CPU. Will need Socket Direct support (only on closed source drivers).
Would prefer to upstream the Ansible recipes into another repository (Linaro, OpenHPC) instead of having his own being the upstream.
Post-K uses a custom Lustre client/server, so they don't have the same problems we do with the server's kernel modules.
Fujitsu will use commercial version of Mellanox drivers, but also the freedom to use the open source ones.
We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).
Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).
Infiniband progress in the lab:
Following up with Mellanox to upstream required features:
Testing Lustre: