2018-08-
...
31
Pak
Infiniband:
- Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
- If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
- Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
- Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
- Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
- Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.
Lustre:
- Usually needs at least 4 servers for redundancy (two disks, metadata), but made it work on single x86 machine, server and client working
- Client builds and installs on Arm, but fails to communicate with the server. May be card issues (ConnectX5 on Arm vs X3/4 on x86).
- Building the server on Arm has some build issues (platform not recognised), may be due to old autoconf scripts.
- Action: try different cards on the x86 side and try a newer autoconf script, update HPC-321
Renato
Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.
Tried building Lustre server on an x86 VM and got some weird build errors (AVX512 on a 10y.o server), bay be auto-detect.
Baptiste said there's a way to copy the host CPU features into the VM, will try that next. If it doesn't work, try to force configure options to disable AVX512.
That work will be updated in HPC-322.
2018-08-30
Takahiro
Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.
...
Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).
Renato
Infiniband progress in the lab:
...