Skip to end of banner
Go to start of banner

Weekly Sync Minutes

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 26 Next »

2018-12-06

Meeting on January 3rd 2019 has been voted to be pushed back to January 10th, calendar update pending.

Lab Move

  •       Lab is currently up, although provisioning of the clusters is not functional (BMC errors, and firmware issues keeping the machines from PXE booting)
  •       Rewiring and Firmware installation is needed and will be done in the next few days.
  •       Added a cluster of x86_64 machines, powered by an Intel Xeon Xeon E5-2450L "Sandy-Bridge EN" CPU.

OpenHPC

  •     Issues with Infiniband Support in the Automation on Fujitsu and Linaro's sides, we are collaborating to fix them quickly.
  •     Full tests on Linaro's side are pending provisioning service availibility, in the meantime IB is functional.

LLVM

  •     JumpThreading: requires more patches, which are in the process of being upstreamed
  •    MachineModuloSched: optimization potentially not upstreamable, more discussions needed.
  •    Greedy Register Allocator: '619.lbm_s' benchmark from SPEC CPU 2017 is a very good fit for benchmarking this feature.
  •    Loop Vectorization: HPC-212 is resolved ! Work on HPC-213 can now resume.

2018-11-30

Lab move

  • Lab is currently down, lots of new machines added
  • New internal switch for warewulf provisioning
  • Nothing working yet, will work throughout next week to bring it up

OpenBLAS

  • Patch merged, new arches added, correctly identifying them
  • Ad-hoc builder in https://openblas.ddns.net/
  • We'll have to do something similar for FFTW, too!

D06

  • Pak working on PAPI support, wants to upstream needs to find out how
  • Would be good to have that for other arches, so we can enable OpenHPC packages
  • Testing in the Linaro lab would help making sure of that

2018-11-22

OpenBLAS

  • Cleanup of the Arm builds, simplifying ARMv8 vs cores and adding support for more cores
  • Performance improvement across the board and guaranteeing ARMv8 only holds ARMv8.0 code (not potentially v8.1 as before)
  • Tested on Synquacer (A53), D03 (A57), D05 (A72), ThunderX, Amberwing (Falkor), ThunderX2, Moonshot (XGene)
  • Pull request: https://github.com/xianyi/OpenBLAS/pull/1876
  • Would be good for Fujitsu to test that code on Post-K
    • ThunderX2 builds might actually be good for Post-K (larger caches)
    • Will need to add march=armv8.2+sve (in Makefile.arm64) to see SVE code coming out
    • We can later add Post-K mode when cpuinfo/cache/TLB details are public

OpenHPC

  • Mellanox code rebooting nodes on non-SMS machines
  • Will send pull request to master later
  • Working on Baptiste's code in a new branch inside Fujitsu
  • @Baptiste to add a branch with all the patches for Fujitsu
  • Testing IB changes (MOFED) in HPC Lab, working so far

LLVM

  • Completed moving work into Linaro's git
  • New branch for regalloc, initial support for control flow, but not split&spill
  • Found JumpThreading bugs, fixed
  • Created random testing for branch elimination, will run next week
  • Some new basic blocks added, need to check vectoriser still recognise all patterns

Infrastructure tasks

  • Tried to PXE boot ThunderX2, changing parameters in BIOS, provisioner, will try different ports next week
  • Tried to upgrade Amberwing's firmware, but getting unknown failures, in contact with Qualcomm

2018-11-08

OpenHPC Mellanox Ansible task

  • Fujitsu confirms their MOFED task works in their cluster (HPC-351)
    • Works on both SMS and Warewulf VNFS
    • Not confirmed if Ansible runs on compute nodes (the way Linaro does)
      • Linaro will test locally
  • Restart during installation will split installation in two
    • This is required because further tasks later (Lustre, IPoIB) will require IB interfaces up
    • Our IPoIB step should run after restart (or not, if using OSS drivers)

LLVM changes almost done, using Linaro's git now

  • Bug found in branch elimination, fixing
  • Pipeliner discussed with Arm, will update the list with a new RFC
  • Regression testing for s3111, should be ready in a few days
  • s278 forcing vectoriser works, need to work on legality

Ansible repository merge

  • Needs a cleanup on the big patches, then we can start proposing merges
  • We'll have to test on both sides and only merge when it's green everywhere
  • The order doesn't matter much, but getting the clean up first would help either way

OpenHPC v1.3.6 has GCC 8

  • SVE QEMU user emulation available upstream
  • Fujitsu SVE hardware can now be tested with OpenHPC
  • Linaro still has to test the new release and move to it by default

OpenBLAS

  • Improve ARMv8 base support, would be good for undetected/internal/experimental cores
  • Need to also improve libm (Arm is doing it) & string functions (@renato: ask again about Cortex Strings)
  • Make sure they're not on by default, as specialised kernels can't use FP registers (-mfpu=none could help?)

2018-10-26

IB performance issues

  • Software issues are being resolved (HPC-341), we need to push them upstream
    • Need to test on D03, D05, QDF, etc, to make sure it's not TX2 specific
    • @Renato: check who can upstream the Mellanox patch (Ilias?)
  • Hardware timing issues will need time to be resolved and we can't do anything
    • We can identify them (by running on different hardware, investigating)
    • And report back to the vendors, if they haven't seen it yet
  • Intel writes directly to cache (bypasses memory)
    • Can we do that, too? This would speed up considerably
  • We're adding an IB performance job to Jenkins
    • We can use that to test changes in OFED drivers (Mellanox or Inbox)
    • OpenUCX performance tests can be done on a single-node system
    • OpenMPI seems to perform better on shared memory than UCX

Adding IB test job to Jenkins

  • We're only running dual-node for now, could add single node (loopback, shared mem)
  • Could also add UCX perf tests to the same job

2018-10-25

Infiniband installation on OpenHPC tracking on HPC-351

  • Code mostly finished, will test next week
  • Will submit a pull request once finished

We need to move the repository to Linaro, Fujitsu forks, we all send pull requests to

  • Pulls requested will be merged on master, but we still keep the production branch for our lab
  • Fujitsu should slowly review our own changes, so that we can merge them to master, too
  • We can still keep separate branches for each lab, so that we can slowly review each other's patches

LLVM development on track

  • GVN ad pipeliners done, no tests yet but rebased to trunk
  • regalloc still working, still on LLVM 6
  • There was a round table at LLVM dev meeting about pipeliner
  • We sent our work to them, but not yet got feedback on discussions
  • GitHub's monorepo move should not affect those changes, we'll move when they're done

Finishing the cleanup of the lab

  • Mr-Provisioner client is upstream
  • Removing the old Ansible client from our lab (almost finished)
  • Adding an Infiniband automated test in Jenkins

Working on ERP for next release

  • Tested on our machines, kernel 4.18, working well
  • Some backports coming, we'll test again

Will look into OpenHPC test suite next week

  • Fix long/short run issues
  • understand why tests all run n 0s
  • add missing tests

Looked at OpenBLAS performance

Arm claims improvements on their Fortran support (commercial compiler)

  • None of this is upstream, so we have no idea what's going on
  • Fujitsu has to use their own old compiler or gfortran
  • Our work is independent of language
  • F18 is a new Fortran compiler for LLVM written in C++17 (sounds promising)

2018-10-19

Contacting Mellanox for upstreaming OpenSM virtualisation

  • This is a known issue and present in RedHat's release notes
  • Not very high priority for Mellanox, but we have to keep pushing

Got CentOS VM on AArch64 and trying to get IB interface through

  • Not able to get the driver through, may need changes to configuration
  • May be worth trying PCI passthrough: nodedev list

OpenBLAS hack to enable A57 instead of ARMv8

  • This is not a good solution, but it's better than the current way
  • We need to work with OpenBLAS anyway, so if no one wants the hack, we ignore it for now
  • If members want, Linaro can hold a temporary overlay on OBS
  • Huawei tested OpenBLAS last year and it was good enough, built by hand, thus A57 on D05
  • Huawei's new chip is custom and not A57, so the build could be worse than D05 as it would fallback to ARMv8

Mr-Provisioner Client done

  • Moving the provisioning jobs in Jenkins to use the client directly
  • Will develop Ansible bindings later, simplify our setup and ERP's repositories
  • With documents and everything

ERP CentOS failed on D03

  • In our lab and the main lab, so not an issue with our setup
  • Haven't tested on others, will try next week

2018-10-12

IB perf automation going strong, just finished the Ansible module to parse results into JUnit XML for Jenkins

  • Jenkins' report is a bit terse, trying to work with JSON too, for Squad
  • May reuse the same logic for OpenHPC test-suite

Continuing with infrastructure refactoring.

  • Benchmark jobs merge pushed, tested and in production
  • Other jobs need provisioner client to be fully working
  • Kea is now available in the ERP OBS
  • This helps us move the lab infrastructure from x86 to Synquacer!

Trying to create VMs on D05 - CentOS, not being very successful

  • Machines boot into EL2, virt-builder works mostly
  • but virt-install doesn't, which is weird, since it works on synquacers

Huawei working on upstream compiler (gcc, llvm) support

  • Will upstream to llvm and gnu, so Arm can pick up and release on their compiler
  • Working with Mellanox on v8.1 atomics on MPI libraries
  • ISVs seem to be finally joining the bandwagon, doing local tests and validation

2018-10-11

Work on InfiniBand Ansible automation in the OpenHPC recipes starting now

  • enable_mellanox_ib: using MOFED drivers (download ISO, build, install, reboot)
  • enable_linux_ib: using INBOX OFED driver, just package install

Ansible work on Lustre will have to wait until Whamcloud publishes the new version

  • They have promised full Arm support by then...

Ansible OpenHPC development will change to Linaro

  • Linaro's repo will be the upstream
  • We're all going to have local forks and work there, ultimately pushing to Linaro's as branches
  • Linaro's Lab will use the branch production, Fujitsu will create branches for them
  • We need some effort to make sure the two don't diverge too much (by testing each other patches and merging to master frequently, then rebasing)

Upstreaming those Ansible playbooks will have to wait until both Linaro and Fujitsu are using the same set of changes (minus local ones), so that we can start welcoming other labs' entirely new playbooks into our repos.

Due to long term sickness, Fujitsu is replacing Takahiro Miyoshi with Masakazu Ueno, who will continue the work in HPC-212 (LLVM TSVC improvements). The last of Takahiro's patches was committed this week.

@Renato to update the s278 task with some info on how to get the information necessary to start debugging, to give Masakazu a head start in the LLVM world.

Both Renato and Masaki spent a good amount of time building their LLVM environments, and using Linaro's git and scripts, and hope to resume coding in the following weeks. :)

The benchmark harness is now in Alpha version, so we're encouraging everyone to start using it locally and report bugs, propose improvements, send patches for new benchmarks, etc.

Baptiste is doing the arduous (and very valuable) work of refactoring our Jenkins jobs, Mr-Provisioner's client and overall lab stability tasks.

Slides

2018-10-05

OpenSM still causing issues when setting up IB on the D03s

  • Best route is to enable it on the switch
  • Subnet created with P_Key, but don't know how to add nodes to it
  • LID changes when SM changes / restarts, switch should know
  • @Pak will try to set it up

Looking at different binaries on Mellanox drivers

  • To do with host names on RODATA
  • Can also have v8.1 instructions for newer cores
  • We have to be careful with older arches

Talking about benchmarks, noise and how to use perf to find issues

  • Thinking about hwloc support for Arm cores
  • INRIA has done some work, should upstream it
  • Added issue in benchmark_harness to use it

IPoIB tests too slow (15GB) while pure IB are fast (45GB+)

  • Second ports look open, may need to flip the cabled next time in London
  • @Baptiste is finishing the Jenkins job to automate it

When we get IB jobs running and stable, we'll look at OpenMPI's MTT

  • Goal is to upload (some) results to OpenMPI's website

2018-08-31

Pak

Infiniband:

  • Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
  • If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
  • Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
  • Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
  • Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
  • Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.

Lustre:

  • Usually needs at least 4 servers for redundancy (two disks, metadata), but made it work on single x86 machine, server and client working
  • Client builds and installs on Arm, but fails to communicate with the server. May be card issues (ConnectX5 on Arm vs X3/4 on x86).
  • Building the server on Arm has some build issues (platform not recognised), may be due to old autoconf scripts.
  • Action: try different cards on the x86 side and try a newer autoconf script, update HPC-321

Renato

Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.

Tried building Lustre server on an x86 VM and got some weird build errors (AVX512 on a 10y.o server), bay be auto-detect.

Baptiste said there's a way to copy the host CPU features into the VM, will try that next. If it doesn't work, try to force configure options to disable AVX512.

That work will be updated in HPC-322.

2018-08-30

Takahiro

Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.

Current patch doesn't help other loops under investigation, will need additional work for those later.

Takeharu

Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.

Not getting full speed on Mellanox fabric. May help to use auxiliary card on a PCI lane managed by the second CPU. Will need Socket Direct support (only on closed source drivers).

Would prefer to upstream the Ansible recipes into another repository (Linaro, OpenHPC) instead of having his own being the upstream.

Post-K uses a custom Lustre client/server, so they don't have the same problems we do with the server's kernel modules.

Fujitsu will use commercial version of Mellanox drivers, but also the freedom to use the open source ones.

We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).

Masaki

Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).

Renato

Infiniband progress in the lab:

  • Huawei servers use ConnectX5 with two ports each: one to IB switch (for MPI), one to 100GB Eth switch (for Lustre)
  • Qualcomm servers use ConnectX4 in multi-node: OSS drivers don't support it, so we need to use MLNX_OFED. Provisioning / orchestration not ready for that.

Following up with Mellanox to upstream required features:

  • Socket Direct: needed to have aux card on second CPU working to maximise bandwidth
  • Multi-node: needed to make Amberwing aux. riser to make ports visible on second node

Testing Lustre:

  • Client from whamcloud builds on Arm (both Huawei and Qualcomm) and packages install successfully
  • Server needs kernel drivers that were removed from staging, so we will start with Intel server
  • We don't have a spare x86_64 server, so we'll probably create a new VM on our admin server (really bad performance)

Slides

  • No labels