Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

2018-11-30

Lab move

  • Lab is currently down, lots of new machines added
  • New internal switch for warewulf provisioning
  • Nothing working yet, will work throughout next week to bring it up

...

  • Pak working on PAPI support, wants to upstream needs to find out how
  • Would be good to have that for other arches, so we can enable OpenHPC packages
  • Testing in the Linaro lab would help making sure of that

2018-11-22

OpenBLAS

  • Cleanup of the Arm builds, simplifying ARMv8 vs cores and adding support for more cores
  • Performance improvement across the board and guaranteeing ARMv8 only holds ARMv8.0 code (not potentially v8.1 as before)
  • Tested on Synquacer (A53), D03 (A57), D05 (A72), ThunderX, Amberwing (Falkor), ThunderX2, Moonshot (XGene)
  • Pull request: https://github.com/xianyi/OpenBLAS/pull/1876
  • Would be good for Fujitsu to test that code on Post-K
    • ThunderX2 builds might actually be good for Post-K (larger caches)
    • Will need to add march=armv8.2+sve (in Makefile.arm64) to see SVE code coming out
    • We can later add Post-K mode when cpuinfo/cache/TLB details are public

...

View file
namereport20181122.pdf
height250
View file
name20181122-tkato.pptx
height250

2018-11-08

OpenHPC Mellanox Ansible task

...

View file
namereport20181108.pdf
height250
View file
nameOpenHPC Ansible Merge.pdf
height250
View file
name20181108-tkato.pptx
height250

2018-10-26

IB performance issues

  • Software issues are being resolved (HPC-341), we need to push them upstream
    • Need to test on D03, D05, QDF, etc, to make sure it's not TX2 specific
    • @Renato: check who can upstream the Mellanox patch (Ilias?)
  • Hardware timing issues will need time to be resolved and we can't do anything
    • We can identify them (by running on different hardware, investigating)
    • And report back to the vendors, if they haven't seen it yet
  • Intel writes directly to cache (bypasses memory)
    • Can we do that, too? This would speed up considerably
  • We're adding an IB performance job to Jenkins
    • We can use that to test changes in OFED drivers (Mellanox or Inbox)
    • OpenUCX performance tests can be done on a single-node system
    • OpenMPI seems to perform better on shared memory than UCX

...

  • We're only running dual-node for now, could add single node (loopback, shared mem)
  • Could also add UCX perf tests to the same job

2018-10-25

Infiniband installation on OpenHPC tracking on HPC-351

...

View file
namereport20181025.pdf
height250
View file
name20181025-tkato.pptx
height250

2018-10-19

Contacting Mellanox for upstreaming OpenSM virtualisation

...

  • In our lab and the main lab, so not an issue with our setup
  • Haven't tested on others, will try next week

2018-10-12

IB perf automation going strong, just finished the Ansible module to parse results into JUnit XML for Jenkins

...

  • Will upstream to llvm and gnu, so Arm can pick up and release on their compiler
  • Working with Mellanox on v8.1 atomics on MPI libraries
  • ISVs seem to be finally joining the bandwagon, doing local tests and validation

2018-10-11

Work on InfiniBand Ansible automation in the OpenHPC recipes starting now

...

Baptiste is doing the arduous (and very valuable) work of refactoring our Jenkins jobs, Mr-Provisioner's client and overall lab stability tasks.

Slides

View file
namereport20181011.pdf
height250
View file
name20181011-tkato.pptx
height250

2018-10-05

OpenSM still causing issues when setting up IB on the D03s

...

  • Goal is to upload (some) results to OpenMPI's website

2018-08-31

Pak

Infiniband:

  • Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
  • If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
  • Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
  • Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
  • Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
  • Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.

...

  • Usually needs at least 4 servers for redundancy (two disks, metadata), but made it work on single x86 machine, server and client working
  • Client builds and installs on Arm, but fails to communicate with the server. May be card issues (ConnectX5 on Arm vs X3/4 on x86).
  • Building the server on Arm has some build issues (platform not recognised), may be due to old autoconf scripts.
  • Action: try different cards on the x86 side and try a newer autoconf script, update HPC-321

Renato

Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.

...

That work will be updated in HPC-322.

2018-08-30

Takahiro

Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.

Current patch doesn't help other loops under investigation, will need additional work for those later.

Takeharu

Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.

...

We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).

Masaki

Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).

Renato

Infiniband progress in the lab:

...

  • Client from whamcloud builds on Arm (both Huawei and Qualcomm) and packages install successfully
  • Server needs kernel drivers that were removed from staging, so we will start with Intel server
  • We don't have a spare x86_64 server, so we'll probably create a new VM on our admin server (really bad performance)

Slides

View file
name20180730-tkato.pptx
height250
View file
namereport20180830.pdf
height250