2018-11-30
Lab move
- Lab is currently down, lots of new machines added
- New internal switch for warewulf provisioning
- Nothing working yet, will work throughout next week to bring it up
...
- Pak working on PAPI support, wants to upstream needs to find out how
- Would be good to have that for other arches, so we can enable OpenHPC packages
- Testing in the Linaro lab would help making sure of that
2018-11-22
OpenBLAS
- Cleanup of the Arm builds, simplifying ARMv8 vs cores and adding support for more cores
- Performance improvement across the board and guaranteeing ARMv8 only holds ARMv8.0 code (not potentially v8.1 as before)
- Tested on Synquacer (A53), D03 (A57), D05 (A72), ThunderX, Amberwing (Falkor), ThunderX2, Moonshot (XGene)
- Pull request: https://github.com/xianyi/OpenBLAS/pull/1876
- Would be good for Fujitsu to test that code on Post-K
- ThunderX2 builds might actually be good for Post-K (larger caches)
- Will need to add
march=armv8.2+sve
(inMakefile.arm64
) to see SVE code coming out - We can later add Post-K mode when cpuinfo/cache/TLB details are public
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-11-08
OpenHPC Mellanox Ansible task
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-10-26
IB performance issues
- Software issues are being resolved (HPC-341), we need to push them upstream
- Need to test on D03, D05, QDF, etc, to make sure it's not TX2 specific
- @Renato: check who can upstream the Mellanox patch (Ilias?)
- Hardware timing issues will need time to be resolved and we can't do anything
- We can identify them (by running on different hardware, investigating)
- And report back to the vendors, if they haven't seen it yet
- Intel writes directly to cache (bypasses memory)
- Can we do that, too? This would speed up considerably
- We're adding an IB performance job to Jenkins
...
- We're only running dual-node for now, could add single node (loopback, shared mem)
- Could also add UCX perf tests to the same job
2018-10-25
Infiniband installation on OpenHPC tracking on HPC-351
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-10-19
Contacting Mellanox for upstreaming OpenSM virtualisation
...
- In our lab and the main lab, so not an issue with our setup
- Haven't tested on others, will try next week
2018-10-12
IB perf automation going strong, just finished the Ansible module to parse results into JUnit XML for Jenkins
...
- Will upstream to llvm and gnu, so Arm can pick up and release on their compiler
- Working with Mellanox on v8.1 atomics on MPI libraries
- ISVs seem to be finally joining the bandwagon, doing local tests and validation
2018-10-11
Work on InfiniBand Ansible automation in the OpenHPC recipes starting now
...
Baptiste is doing the arduous (and very valuable) work of refactoring our Jenkins jobs, Mr-Provisioner's client and overall lab stability tasks.
Slides
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-10-05
OpenSM still causing issues when setting up IB on the D03s
...
- Goal is to upload (some) results to OpenMPI's website
2018-08-31
Pak
Infiniband:
- Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
- If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
- Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
- Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
- Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
- Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.
...
- Usually needs at least 4 servers for redundancy (two disks, metadata), but made it work on single x86 machine, server and client working
- Client builds and installs on Arm, but fails to communicate with the server. May be card issues (ConnectX5 on Arm vs X3/4 on x86).
- Building the server on Arm has some build issues (platform not recognised), may be due to old autoconf scripts.
- Action: try different cards on the x86 side and try a newer autoconf script, update HPC-321
Renato
Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.
...
That work will be updated in HPC-322.
2018-08-30
Takahiro
Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.
Current patch doesn't help other loops under investigation, will need additional work for those later.
Takeharu
Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.
...
We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).
Masaki
Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).
Renato
Infiniband progress in the lab:
...
- Client from whamcloud builds on Arm (both Huawei and Qualcomm) and packages install successfully
- Server needs kernel drivers that were removed from staging, so we will start with Intel server
- We don't have a spare x86_64 server, so we'll probably create a new VM on our admin server (really bad performance)
Slides
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|