2018-10-05
OpenSM still causing issues when setting up IB on the D03s
- Best route is to enable it on the switch
- Subnet created with P_Key, but don't know how to add nodes to it
- LID changes when SM changes / restarts, switch should know
- @Pak will try to set it up
Looking at different binaries on Mellanox drivers
- To do with host names on RODATA
- Can also have v8.1 instructions for newer cores
- We have to be careful with older arches
Talking about benchmarks, noise and how to use perf to find issues
- Thinking about hwloc support for Arm cores
- INRIA has done some work, should upstream it
- Added issue in benchmark_harness to use it
IPoIB tests too slow (15GB) while pure IB are fast (45GB+)
- Second ports look open, may need to flip the cabled next time in London
- @Baptiste is finishing the Jenkins job to automate it
When we get IB jobs running and stable, we'll look at OpenMPI's MTT
- Goal is to upload (some) results to OpenMPI's website
2018-08-31
Pak
Infiniband:
- Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
- If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
- Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
- Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
- Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
- Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.
...