Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

2018-10-05

OpenSM still causing issues when setting up IB on the D03s

  • Best route is to enable it on the switch
  • Subnet created with P_Key, but don't know how to add nodes to it
  • LID changes when SM changes / restarts, switch should know
  • @Pak will try to set it up

Looking at different binaries on Mellanox drivers

  • To do with host names on RODATA
  • Can also have v8.1 instructions for newer cores
  • We have to be careful with older arches

Talking about benchmarks, noise and how to use perf to find issues

  • Thinking about hwloc support for Arm cores
  • INRIA has done some work, should upstream it
  • Added issue in benchmark_harness to use it

IPoIB tests too slow (15GB) while pure IB are fast (45GB+)

  • Second ports look open, may need to flip the cabled next time in London
  • @Baptiste is finishing the Jenkins job to automate it

When we get IB jobs running and stable, we'll look at OpenMPI's MTT

  • Goal is to upload (some) results to OpenMPI's website

2018-08-31

Pak

Infiniband:

  • Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
  • If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
  • Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
  • Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
  • Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
  • Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.

...