Skip to end of banner
Go to start of banner

Bi-Weekly HPC Engineering Sync Minutes

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 59 Current »

[ Attendance ] [ Dial in Information ] [ Agenda  ] [ Minutes ]

2019-11-14

Attendance

Engineering Members 

Name

Present

Paul Isaac's (HPC Tech Lead, Linaro)

(tick)

Baptiste Gerondeau (HPC Engineer, Linaro)

(tick)

Masakazu Ueno (Fujitsu)(tick)
Masakai Arai (Fujitsu)(tick)



(error) Not present

(tick)Present


Optional / Guests 

Name

Present

Mark Orvek (VP Engineering, Linaro) 

(error)

Elsie Wahlig (Sr Director LDCG, Linaro)(error)
Graeme Gregory (LDCG Engineering Mgr, Linaro)(error)

Victor Duan (Japan Country Mgr, Linaro)

(error)

Jammy Zhou (China Country Mgr, Linaro)(error)


Dial in Information

Paul Isaac's is inviting you to a scheduled Zoom meeting.

Topic: 2019-11-14 HPC Engineering Meeting Agenda/Minutes

Join Zoom Meeting
https://zoom.us/j/2990402863

One tap mobile
+16465588656,,2990402863# US (New York)
+17207072699,,2990402863# US

Dial by your location
+1 646 558 8656 US (New York)
+1 720 707 2699 US
+1 877 853 5247 US Toll-free
+1 888 788 0099 US Toll-free
Meeting ID: 611 276 1834
Find your local number: https://zoom.us/u/axpe6BG2s


Location

Local Time

Time Zone

UTC Offset

San Jose (USA - California)Thursday, November 14, 2019 at 6:00:00 amPDTUTC-7 hours
London (United Kingdom - England)Thursday, November 14, 2019 at 1:00:00 pmGMTUTC+0 hours
Paris (France - Île-de-France)Thursday, November 14, 2019 at 2:00:00 pmCESTUTC+1 hour
Tokyo (Japan)Thursday, November 14, 2019 at 10:00:00 pmJSTUTC+9 hours
Corresponding UTC (GMT)Thursday, November 14, 2019 at 13:00:00

Agenda 

  • Previous meeting notes: 
  • Topic
    • Previous format/topics
    • Paul - Colo update
    • Baptiste - Tensorflow script. Lab Updates
    • Masaki Arai (Fujitsu) - Compiler Updates
    • AOB
  • Next Meetings: 
    • Meeting 28 November 2019
    • SC'19 November 16-22 2019 

Minutes (DRAFT - please add relevant links to your topics)

  • Previous meeting notes:  
  • Topic
    • Colo Lab Updates - Second power connections added to most nodes. Additional nodes added to switch used for Warewulf builds (shared with openHPC test nodes).
    • Tensorflow script has been documented - comments appreciated - Building TensorFlow on AArch64
      • Problems with NumPy building. Bug appears in OpenHPC and RedHat. Perhaps consider fetching latest toolchain builds from Linaro Toolchain team. GCC bug continues to be a problem.
    • Compiler Updates - Open
      • LLVM updates continuing for A64FX by Fujitsu under NDA.
      • GCC updates for A64FX - there are currently no Fujitsu resources. Non-NDA resources cannot be used until the A64FX technical specification is made available.
      • A64FX specification 'may' be released 1Q2020.
    • AOB
  • Next Meetings: 
    • Meeting 28 November 2019
    • SC'19 November 16-22 2019

Recording Link



2019-10-31

Attendance

Engineering Members 

Name

Present

Paul Isaac's (HPC Tech Lead, Linaro)

(tick)

Baptiste Gerondeau (HPC Engineer, Linaro)

(tick)

Masakazu Ueno (Fujitsu)(error)
Masakai Arai (Fujitsu)(tick)



(error) Not present

(tick)Present


Optional / Guests 

Name

Present

Mark Orvek (VP Engineering, Linaro) 

(error)

Elsie Wahlig (Sr Director LDCG, Linaro)(tick)
Graeme Gregory (LDCG Engineering Mgr, Linaro)(error)

Victor Duan (Japan Country Mgr, Linaro)

(error)

Jammy Zhou (China Country Mgr, Linaro)(error)


Dial in Information

Paul Isaac's is inviting you to a scheduled Zoom meeting.

Topic: 2019-10-31 HPC Engineering Meeting Agenda/Minutes

Join Zoom Meeting
https://zoom.us/j/2990402863

One tap mobile
+16465588656,,2990402863# US (New York)
+17207072699,,2990402863# US

Dial by your location
+1 646 558 8656 US (New York)
+1 720 707 2699 US
+1 877 853 5247 US Toll-free
+1 888 788 0099 US Toll-free
Meeting ID: 611 276 1834
Find your local number: https://zoom.us/u/axpe6BG2s


Location

Local Time

Time Zone

UTC Offset

San Jose (USA - California)Thursday, October 31, 2019 at 6:00:00 amPDTUTC-7 hours
London (United Kingdom - England)Thursday, October 31, 2019 at 1:00:00 pmGMTUTC+0 hours
Paris (France - Île-de-France)Thursday, October 31, 2019 at 2:00:00 pmCESTUTC+1 hour
Tokyo (Japan)Thursday, October 31, 2019 at 10:00:00 pmJSTUTC+9 hours
Corresponding UTC (GMT)Thursday, October 31, 2019 at 13:00:00

Agenda 

  • Previous meeting notes: 
  • Topic
    • Previous format/topics
    • Paul - Workstation config
    • Dev boards 
    • Jinshui/Peter Lui (Futurewei) - 

      Back to Aug/Sept when Joshua Terry as the Linaro HPC-SIG technical lead, Peter discussed with him about the ARM HPC system integration and testing environment options. One of the options is Futurewei may be able to host an ARM64 cluster (for example, 16-32 2-socket nodes). But then Joshua left Linaro and the discussion paused.

    • Baptiste - Tensorflow script. Lab Updates
    • Masaki Arai (Fujitsu) - Compiler Updates
    • AOB
  • Next Meetings: 
    • Meeting 14 November 2019
    • SC'19 November 16-22 2019 

Minutes

  • Previous meeting notes:  
  • Topic
    • Windows 10 Workstation config:
      • Qemu start for an emulated Aarch64/Ubuntu environment
        • "C:\Program Files\qemu"\qemu-system-aarch64 -m 8192 -cpu cortex-a57 -smp 4 -M virt -nographic -drive file=aarch64_flash0.img,format=raw,if=pflash -drive file=aarch64_flash1.img,format=raw,if=pflash -drive if=none,file=eoan-server-cloudimg-arm64.img,id=hd0 -device virtio-blk-device,drive=hd0 -drive if=none,file=cloud.img,id=hd1 -device virtio-blk-device,drive=hd1 -netdev tap,ifname=EthernetTAP,id=network01 -device e1000,netdev=network01,mac=52:54:00:12:34:56 -accel tcg,thread=multi
      • However, network connection not yet working. Comments appreciated.
      • Baptiste Comment: To be honest, going through Windows is a big added complexity to the setup. I would recommend installing an Ubuntu/Debian dual-boot and running QEMU (+ libvirt) from there if network problems persist. CUDA/GPU support should be solid on Ubuntu.
        • Paul - Dual boot is not currently an option as the only graphics card is the single Nvidia GPU (not paired with Intel embedded/Nvidia external) which causes lock-up. When nographics then Ubuntu still crashes complaining of ACPI issues. When ACPI turned off no storage is recognised. Therefore, qemu on Windows is the only option to emulate Aarch64 in this hardware (currently).
    • Lab Updates - hardware changes at Colo Nov.11/12 2019. Currently no known facility to host water-cooled nodes.
    • Tensorflow script has been documented - comments appreciated - Building TensorFlow on AArch64
      • Problems with NumPy building. Bug appears in OpenHPC and RedHat. Perhaps consider fetching latest toolchain builds from Linaro Toolchain team.
    • Futurewei hosting - How does affect 'EAR' restrictions for technology transfer if we login to a Futurewei hosted system?
    • Compiler Updates - Open
      • LLVM updates continuing for A64FX by Fujitsu under NDA.
      • GCC updates for A64FX - there are currently no Fujitsu resources. Non-NDA resources cannot be used until the A64FX technical specification is made available.
      • A64FX specification 'may' be released 1Q2020.
    • AOB
  • Next Meetings: 
    • Meeting 14 November 2019
    • SC'19 November 16-22 2019

Recording Link

2019-06-05

Attending

  • Baptiste
  • Graeme
  • Elsie
  • Masaki Arai
  • Ueno

Notes

Lab - Baptiste

  • Infiniband having crashes on TX2 servers with virtual environments
  • Ldiskfs - patches added to kernel
  • CentOS 7.5 kernel (49.10.1.el7a.aarch64)- has issues writing to block devices (zfs and lustre)

2019-02-21

(Renato) My last Asia Engineering sync

  • It was a privilege working with you all
  • Thank you for the hard work and friendship
  • Wish you all the best in the future!
  • Welcoming Rafael, the next tech-lead!

LLVM CFG simplification

  • Branch elimination patch in active review, will update tomorrow
  • Using HCQC on Himeno (plans to use in SPEC17 later)
  • LLVM unrolls unnecessarily (too eager) - after not vectorising

TSVC in LLVM

  • To convert branch into select, may need SimplifyCFG
  • Try add pass before before vectortisation, try to find utility SimplifyCFG inside vectoriser later

Infrastructure

  • Suse recipes in progress
  • Infiniband almost done
  • TX2 boot bugs, investigating with Marvell
  • Finally installed the x86 machines, Debian/CentOS, OpenHPC, benchmarks
  • Getting "stack smashing" errors on x86 (kernel 3.x, old libc)
  • Working on OpenMPI MTT tests, need access to the private test repo, needs to be a member
  • Hello World OpenMPI tests working (both upstream and OpenHPC stacks)

2019-02-07

Ansible OpenHPC

  • Fujitsu has finished their work, Linaro has validated and using in production
  • We have a new community contributor, sending patches, issues

LLVM

  • Redundant branch elimination almost ready, trunk broken local test, needs fixing
  • s278 doesn't seem to convert the condition to a select, working on it

Lab

  • Working on local network for Warewulf setup
  • QDFs working with OpenHPC IB via Ansible (needs MOFED for Multi-Host)

Benchmarks

  • Himeno runs 5x, 10x, 20x, identifying sources of noise (duality in branch-misses in D05), not much else
  • Results show big differences in compiler performance, all statistically significant 
  • Shows the harness works for its intended purpose, which was the point of the test (not to follow specific regressions)

2019-01-24

LLVM

  • HPC-154: RFC document ready, preparing test files
  • HPC-155: updating the tree, new pragma "loop pipeline" may be interesting
  • Linaro will be moving to the new GitHub repo in a few months, with updated scripts

HPC Lab

  • QDF cluster working with IB with MOFED (needed because of multi-host)
  • TX2 cards plugged but don't show up in the switch, may need firmware update

2019-01-11

Astra found race condition in RedHat kernel 4.14

 - On rhash, "deferred_worker" function sometimes hang on 100% CPU for one core

 - Hard to hit: ~4 nodes out of hundreds over many days running

 - Will try different kernels (upstream 4.14, newer, ERP?)

 - @Kevin to send more info, @Renato to contact the kernel guys, copy Sandia

Mellanox OFED 4.5 seems to shave off a few usec latency

 - 2usec with 4.4, 1.8usec with 4.5, but building UCX/OpenMPI with TX2 opts makes it 1.5usec

 - @Renato to ask Ilias again about upstreaming that to kernel, so that we can have it in OSS

Weird "kernel hung for 30 seconds" message in /var/log/system

 - Sporadically, every few days on hundreds of nodes

 - Toolchain has seem the same on APM, Tegra buildbots

 - Seems to be when the nodes are 100% busy for too long

 - Could be power scheduling?

 - Not a big deal, though

2019-01-10

Lab move complete

 - Arm machines working, network operational

 - OpenSUSE LEAP installed on D05 and QDF (via autoyast)

 - TX2 still needs some changes

OpenHPC Ansible

 - Merge done, should ask Sandia to give it a go

 - Forgot to add some options (enable_reboot / mellanox), will add later

 - New option "force_service", need a description in readme (@baptiste)

 - We need to ask other Asible users in OpenHPC to look at them

LLVM

 - 7.0 has added "unconditional branches" optimisation, reducing effect of local changes

 - Though, that opt already helps HPC applications

 - Machine pipeliner patch on https://reviews.llvm.org/D55106

 - Preparing patch for regalloc, still found more room for improvement

 - TSVC s278 has an IF that cannot be converted, even with --force-vector

 - Will try Simplify CFG or update the vectoriser if-conversion functions


2018-12-20

Lab update

    - QDF, TX2 and D05 operational and PXE booting

OpenHPC

    - Infiniband Support merged to master

    - ISO for aarch64 MOFED should be added in the future

Machine Pipeliner:

   - HPC-154 is getting prepared for upstreaming

   - Huawei patch still needs investigation to see how it conflicts with the Machine Pipeliner

   - HPC-156 is under investigation and patches are being tested

Loop Vectorization:

   - HPC-213 : the "force-vector" flag's output needs to be investigated

2018-12-14

Lab update

  • New machines, new internal network, IB everywhere
  • Amberwing, ThunderX2 machines fully operational
  • D05 still not booting (PXE-E07)
    • may need separate network card for PXE boot
    • flash a different firmware on the NICs directly
  • x86_64 machines still not operational (broken UEFI BIOS)

IB PXE problem

  • PXE segfaults over Mellanox, possibly Mellanox UEFI driver or IPoIB
  • Ask Mellanox guys to go to UEFI plugfest in Seattle to work on it

OpenHPC Ansible changes

  • Merging with Fujitsu changes, testing on our side, trying to get it before holidays
  • Sandia interested in joining the effort, we'll see next year

OpenSUSE

  • Able to install SUSE Leap 15.1 on D05, needed support for AutoYast in Mr-Provisioner
  • Adding support into hpc_lab_setup

2018-12-06

Meeting on January 3rd 2019 has been voted to be pushed back to January 10th, calendar update pending.

Lab Move

  •       Lab is currently up, although provisioning of the clusters is not functional (BMC errors, and firmware issues keeping the machines from PXE booting)
  •       Rewiring and Firmware installation is needed and will be done in the next few days.
  •       Added a cluster of x86_64 machines, each powered by two Intel Xeon Xeon E5-2450L "Sandy-Bridge EN" CPU.

OpenHPC

  •     Issues with Infiniband Support in the Automation on Fujitsu and Linaro's sides, we are collaborating to fix them quickly.
  •     Full tests on Linaro's side are pending provisioning service availibility, in the meantime IB is functional.

LLVM

  •     JumpThreading: requires more patches, which are in the process of being upstreamed
  •    MachineModuloSched: optimization potentially not upstreamable, more discussions needed.
  •    Greedy Register Allocator: '619.lbm_s' benchmark from SPEC CPU 2017 is a very good fit for benchmarking this feature.
  •    Loop Vectorization: HPC-212 is resolved ! Work on HPC-213 can now resume.

2018-11-30

Lab move

  • Lab is currently down, lots of new machines added
  • New internal switch for warewulf provisioning
  • Nothing working yet, will work throughout next week to bring it up

OpenBLAS

  • Patch merged, new arches added, correctly identifying them
  • Ad-hoc builder in https://openblas.ddns.net/
  • We'll have to do something similar for FFTW, too!

D06

  • Pak working on PAPI support, wants to upstream needs to find out how
  • Would be good to have that for other arches, so we can enable OpenHPC packages
  • Testing in the Linaro lab would help making sure of that

2018-11-22

OpenBLAS

  • Cleanup of the Arm builds, simplifying ARMv8 vs cores and adding support for more cores
  • Performance improvement across the board and guaranteeing ARMv8 only holds ARMv8.0 code (not potentially v8.1 as before)
  • Tested on Synquacer (A53), D03 (A57), D05 (A72), ThunderX, Amberwing (Falkor), ThunderX2, Moonshot (XGene)
  • Pull request: https://github.com/xianyi/OpenBLAS/pull/1876
  • Would be good for Fujitsu to test that code on Post-K
    • ThunderX2 builds might actually be good for Post-K (larger caches)
    • Will need to add march=armv8.2+sve (in Makefile.arm64) to see SVE code coming out
    • We can later add Post-K mode when cpuinfo/cache/TLB details are public

OpenHPC

  • Mellanox code rebooting nodes on non-SMS machines
  • Will send pull request to master later
  • Working on Baptiste's code in a new branch inside Fujitsu
  • @Baptiste to add a branch with all the patches for Fujitsu
  • Testing IB changes (MOFED) in HPC Lab, working so far

LLVM

  • Completed moving work into Linaro's git
  • New branch for regalloc, initial support for control flow, but not split&spill
  • Found JumpThreading bugs, fixed
  • Created random testing for branch elimination, will run next week
  • Some new basic blocks added, need to check vectoriser still recognise all patterns

Infrastructure tasks

  • Tried to PXE boot ThunderX2, changing parameters in BIOS, provisioner, will try different ports next week
  • Tried to upgrade Amberwing's firmware, but getting unknown failures, in contact with Qualcomm

2018-11-08

OpenHPC Mellanox Ansible task

  • Fujitsu confirms their MOFED task works in their cluster (HPC-351)
    • Works on both SMS and Warewulf VNFS
    • Not confirmed if Ansible runs on compute nodes (the way Linaro does)
      • Linaro will test locally
  • Restart during installation will split installation in two
    • This is required because further tasks later (Lustre, IPoIB) will require IB interfaces up
    • Our IPoIB step should run after restart (or not, if using OSS drivers)

LLVM changes almost done, using Linaro's git now

  • Bug found in branch elimination, fixing
  • Pipeliner discussed with Arm, will update the list with a new RFC
  • Regression testing for s3111, should be ready in a few days
  • s278 forcing vectoriser works, need to work on legality

Ansible repository merge

  • Needs a cleanup on the big patches, then we can start proposing merges
  • We'll have to test on both sides and only merge when it's green everywhere
  • The order doesn't matter much, but getting the clean up first would help either way

OpenHPC v1.3.6 has GCC 8

  • SVE QEMU user emulation available upstream
  • Fujitsu SVE hardware can now be tested with OpenHPC
  • Linaro still has to test the new release and move to it by default

OpenBLAS

  • Improve ARMv8 base support, would be good for undetected/internal/experimental cores
  • Need to also improve libm (Arm is doing it) & string functions (@renato: ask again about Cortex Strings)
  • Make sure they're not on by default, as specialised kernels can't use FP registers (-mfpu=none could help?)

2018-10-26

IB performance issues

  • Software issues are being resolved (HPC-341), we need to push them upstream
    • Need to test on D03, D05, QDF, etc, to make sure it's not TX2 specific
    • @Renato: check who can upstream the Mellanox patch (Ilias?)
  • Hardware timing issues will need time to be resolved and we can't do anything
    • We can identify them (by running on different hardware, investigating)
    • And report back to the vendors, if they haven't seen it yet
  • Intel writes directly to cache (bypasses memory)
    • Can we do that, too? This would speed up considerably
  • We're adding an IB performance job to Jenkins
    • We can use that to test changes in OFED drivers (Mellanox or Inbox)
    • OpenUCX performance tests can be done on a single-node system
    • OpenMPI seems to perform better on shared memory than UCX

Adding IB test job to Jenkins

  • We're only running dual-node for now, could add single node (loopback, shared mem)
  • Could also add UCX perf tests to the same job

2018-10-25

Infiniband installation on OpenHPC tracking on HPC-351

  • Code mostly finished, will test next week
  • Will submit a pull request once finished

We need to move the repository to Linaro, Fujitsu forks, we all send pull requests to

  • Pulls requested will be merged on master, but we still keep the production branch for our lab
  • Fujitsu should slowly review our own changes, so that we can merge them to master, too
  • We can still keep separate branches for each lab, so that we can slowly review each other's patches

LLVM development on track

  • GVN ad pipeliners done, no tests yet but rebased to trunk
  • regalloc still working, still on LLVM 6
  • There was a round table at LLVM dev meeting about pipeliner
  • We sent our work to them, but not yet got feedback on discussions
  • GitHub's monorepo move should not affect those changes, we'll move when they're done

Finishing the cleanup of the lab

  • Mr-Provisioner client is upstream
  • Removing the old Ansible client from our lab (almost finished)
  • Adding an Infiniband automated test in Jenkins

Working on ERP for next release

  • Tested on our machines, kernel 4.18, working well
  • Some backports coming, we'll test again

Will look into OpenHPC test suite next week

  • Fix long/short run issues
  • understand why tests all run n 0s
  • add missing tests

Looked at OpenBLAS performance

Arm claims improvements on their Fortran support (commercial compiler)

  • None of this is upstream, so we have no idea what's going on
  • Fujitsu has to use their own old compiler or gfortran
  • Our work is independent of language
  • F18 is a new Fortran compiler for LLVM written in C++17 (sounds promising)

2018-10-19

Contacting Mellanox for upstreaming OpenSM virtualisation

  • This is a known issue and present in RedHat's release notes
  • Not very high priority for Mellanox, but we have to keep pushing

Got CentOS VM on AArch64 and trying to get IB interface through

  • Not able to get the driver through, may need changes to configuration
  • May be worth trying PCI passthrough: nodedev list

OpenBLAS hack to enable A57 instead of ARMv8

  • This is not a good solution, but it's better than the current way
  • We need to work with OpenBLAS anyway, so if no one wants the hack, we ignore it for now
  • If members want, Linaro can hold a temporary overlay on OBS
  • Huawei tested OpenBLAS last year and it was good enough, built by hand, thus A57 on D05
  • Huawei's new chip is custom and not A57, so the build could be worse than D05 as it would fallback to ARMv8

Mr-Provisioner Client done

  • Moving the provisioning jobs in Jenkins to use the client directly
  • Will develop Ansible bindings later, simplify our setup and ERP's repositories
  • With documents and everything

ERP CentOS failed on D03

  • In our lab and the main lab, so not an issue with our setup
  • Haven't tested on others, will try next week

2018-10-12

IB perf automation going strong, just finished the Ansible module to parse results into JUnit XML for Jenkins

  • Jenkins' report is a bit terse, trying to work with JSON too, for Squad
  • May reuse the same logic for OpenHPC test-suite

Continuing with infrastructure refactoring.

  • Benchmark jobs merge pushed, tested and in production
  • Other jobs need provisioner client to be fully working
  • Kea is now available in the ERP OBS
  • This helps us move the lab infrastructure from x86 to Synquacer!

Trying to create VMs on D05 - CentOS, not being very successful

  • Machines boot into EL2, virt-builder works mostly
  • but virt-install doesn't, which is weird, since it works on synquacers

Huawei working on upstream compiler (gcc, llvm) support

  • Will upstream to llvm and gnu, so Arm can pick up and release on their compiler
  • Working with Mellanox on v8.1 atomics on MPI libraries
  • ISVs seem to be finally joining the bandwagon, doing local tests and validation

2018-10-11

Work on InfiniBand Ansible automation in the OpenHPC recipes starting now

  • enable_mellanox_ib: using MOFED drivers (download ISO, build, install, reboot)
  • enable_linux_ib: using INBOX OFED driver, just package install

Ansible work on Lustre will have to wait until Whamcloud publishes the new version

  • They have promised full Arm support by then...

Ansible OpenHPC development will change to Linaro

  • Linaro's repo will be the upstream
  • We're all going to have local forks and work there, ultimately pushing to Linaro's as branches
  • Linaro's Lab will use the branch production, Fujitsu will create branches for them
  • We need some effort to make sure the two don't diverge too much (by testing each other patches and merging to master frequently, then rebasing)

Upstreaming those Ansible playbooks will have to wait until both Linaro and Fujitsu are using the same set of changes (minus local ones), so that we can start welcoming other labs' entirely new playbooks into our repos.

Due to long term sickness, Fujitsu is replacing Takahiro Miyoshi with Masakazu Ueno, who will continue the work in HPC-212 (LLVM TSVC improvements). The last of Takahiro's patches was committed this week.

@Renato to update the s278 task with some info on how to get the information necessary to start debugging, to give Masakazu a head start in the LLVM world.

Both Renato and Masaki spent a good amount of time building their LLVM environments, and using Linaro's git and scripts, and hope to resume coding in the following weeks. :)

The benchmark harness is now in Alpha version, so we're encouraging everyone to start using it locally and report bugs, propose improvements, send patches for new benchmarks, etc.

Baptiste is doing the arduous (and very valuable) work of refactoring our Jenkins jobs, Mr-Provisioner's client and overall lab stability tasks.

Slides

2018-10-05

OpenSM still causing issues when setting up IB on the D03s

  • Best route is to enable it on the switch
  • Subnet created with P_Key, but don't know how to add nodes to it
  • LID changes when SM changes / restarts, switch should know
  • @Pak will try to set it up

Looking at different binaries on Mellanox drivers

  • To do with host names on RODATA
  • Can also have v8.1 instructions for newer cores
  • We have to be careful with older arches

Talking about benchmarks, noise and how to use perf to find issues

  • Thinking about hwloc support for Arm cores
  • INRIA has done some work, should upstream it
  • Added issue in benchmark_harness to use it

IPoIB tests too slow (15GB) while pure IB are fast (45GB+)

  • Second ports look open, may need to flip the cabled next time in London
  • @Baptiste is finishing the Jenkins job to automate it

When we get IB jobs running and stable, we'll look at OpenMPI's MTT

  • Goal is to upload (some) results to OpenMPI's website

2018-08-31

Pak

Infiniband:

  • Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
  • If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
  • Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
  • Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
  • Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
  • Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.

Lustre:

  • Usually needs at least 4 servers for redundancy (two disks, metadata), but made it work on single x86 machine, server and client working
  • Client builds and installs on Arm, but fails to communicate with the server. May be card issues (ConnectX5 on Arm vs X3/4 on x86).
  • Building the server on Arm has some build issues (platform not recognised), may be due to old autoconf scripts.
  • Action: try different cards on the x86 side and try a newer autoconf script, update HPC-321

Renato

Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.

Tried building Lustre server on an x86 VM and got some weird build errors (AVX512 on a 10y.o server), bay be auto-detect.

Baptiste said there's a way to copy the host CPU features into the VM, will try that next. If it doesn't work, try to force configure options to disable AVX512.

That work will be updated in HPC-322.

2018-08-30

Takahiro

Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.

Current patch doesn't help other loops under investigation, will need additional work for those later.

Takeharu

Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.

Not getting full speed on Mellanox fabric. May help to use auxiliary card on a PCI lane managed by the second CPU. Will need Socket Direct support (only on closed source drivers).

Would prefer to upstream the Ansible recipes into another repository (Linaro, OpenHPC) instead of having his own being the upstream.

Post-K uses a custom Lustre client/server, so they don't have the same problems we do with the server's kernel modules.

Fujitsu will use commercial version of Mellanox drivers, but also the freedom to use the open source ones.

We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).

Masaki

Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).

Renato

Infiniband progress in the lab:

  • Huawei servers use ConnectX5 with two ports each: one to IB switch (for MPI), one to 100GB Eth switch (for Lustre)
  • Qualcomm servers use ConnectX4 in multi-node: OSS drivers don't support it, so we need to use MLNX_OFED. Provisioning / orchestration not ready for that.

Following up with Mellanox to upstream required features:

  • Socket Direct: needed to have aux card on second CPU working to maximise bandwidth
  • Multi-node: needed to make Amberwing aux. riser to make ports visible on second node

Testing Lustre:

  • Client from whamcloud builds on Arm (both Huawei and Qualcomm) and packages install successfully
  • Server needs kernel drivers that were removed from staging, so we will start with Intel server
  • We don't have a spare x86_64 server, so we'll probably create a new VM on our admin server (really bad performance)

Slides

  • No labels