Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

[ Attendance ] [ Dial in Information ] [ Agenda  ] [ Minutes ]

Attendance

Engineering Members 

Name

Present

Paul Isaac's (HPC Tech Lead, Linaro)

(tick)(error)

Baptiste Gerondeau (HPC Engineer, Linaro)

(tick)(error)

Masakazu Ueno (Fujitsu)(error)
Masakai Arai (Fujitsu)(tick)(error)



(error) Not present

(tick)Image ModifiedPresent


Optional / Guests 

Name

Present

Mark Orvek (VP Engineering, Linaro) 

(tick)(error)

Elsie Wahlig (Sr Director LDCG, Linaro)(tick)(error)
Graeme Gregory (LDCG Engineering Mgr, Linaro)(tick)(error)

Victor Duan (Japan Country Mgr, Linaro)

(tick)(error)

Jammy Zhou (China Country Mgr, Linaro)(tick)(error)


Dial in Information

Paul Isaac's is inviting you to a scheduled Zoom meeting.

...

Dial by your location
+1 646 558 8656 US (New York)
+1 720 707 2699 US
+1 877 853 5247 US Toll-free
+1 888 788 0099 US Toll-free
Meeting ID: 611 276 1834
Find your local number: https://zoom.us/u/axpe6BG2s


Location

Local Time

Time Zone

UTC Offset

San Jose (USA - California)Thursday, October 31, 2019 at 6:00:00 amPDTUTC-7 hours
London (United Kingdom - England)Thursday, October 31, 2019 at 1:00:00 pmGMTUTC+0 hours
Paris (France - Île-de-France)Thursday, October 31, 2019 at 2:00:00 pmCESTUTC+1 hour
Tokyo (Japan)Thursday, October 31, 2019 at 10:00:00 pmJSTUTC+9 hours
Corresponding UTC (GMT)Thursday, October 31, 2019 at 13:00:00

Agenda 

  • Previous meeting notes: 
  • Topic
    • Previous format/topics
    • Lab Updates
    • Jinshui/Peter Lui (Futurewei) - 

      Back to Aug/Sept when Joshua Terry as the Linaro HPC-SIG technical lead, Peter discussed with him about the ARM HPC system integration and testing environment options. One of the options is Futurewei may be able to host an ARM64 cluster (for example, 16-32 2-socket nodes). But then Joshua left Linaro and the discussion paused.

    • Masaki Arai (Fujitsu) - Compiler Updates
    • AOB
  • Next Meetings: 
    • Meeting 14 November 2019
    • SC'19 November 16-22 2019 

Minutes

  • Previous meeting notes:  
  • Topic
    • Lab Updates
    • Futurewei hosting
    • Compiler Updates
    • AOB
  • Next Meetings: 
    • Meeting 14 November 2019
    • SC'19 November 16-22 2019

Recording Link

2019-06-05

Attending

  • Baptiste
  • Graeme
  • Elsie
  • Masaki Arai
  • Ueno

...

View file
name20190530-masakazu (1).pptx
height250
View file
nametask2019 (1).pdf
height250

2019-02-21

(Renato) My last Asia Engineering sync

...

View file
namereport20190221.pdf
height250
View file
name20190221-masakazu.pptx
height250

2019-02-07

Ansible OpenHPC

  • Fujitsu has finished their work, Linaro has validated and using in production
  • We have a new community contributor, sending patches, issues

...

View file
namereport20190207.pdf
height250
View file
name20190207-biweekly-meeting-memo.pptx
height250
View file
name20190207-masakazu.pptx
height250

2019-01-24

LLVM

  • HPC-154: RFC document ready, preparing test files
  • HPC-155: updating the tree, new pragma "loop pipeline" may be interesting
  • Linaro will be moving to the new GitHub repo in a few months, with updated scripts

...

View file
namereport20190124.pdf
height250

2019-01-11

Astra found race condition in RedHat kernel 4.14

...

 - Not a big deal, though

2019-01-10

Lab move complete

 - Arm machines working, network operational

...

View file
namereport20190110.pdf
height250
View file
name20190110-biweekly-meeting-memo.pptx
height250
View file
name20190110-masakazu.pptx
height250

2018-12-20

Lab update

    - QDF, TX2 and D05 operational and PXE booting

...

View file
namereport20181220.pdf
height250
View file
name20181220-biweekly-meeting-memo.pptx
height250
View file
name20181220-masakazu.pptx
height250

2018-12-14

Lab update

  • New machines, new internal network, IB everywhere
  • Amberwing, ThunderX2 machines fully operational
  • D05 still not booting (PXE-E07)
    • may need separate network card for PXE boot
    • flash a different firmware on the NICs directly
  • x86_64 machines still not operational (broken UEFI BIOS)

...

  • Able to install SUSE Leap 15.1 on D05, needed support for AutoYast in Mr-Provisioner
  • Adding support into hpc_lab_setup

2018-12-06

Meeting on January 3rd 2019 has been voted to be pushed back to January 10th, calendar update pending.

...

View file
namereport20181206.pdf
height250
View file
name20181206-tkato.pptx
height250
View file
name20181206-masakazu.pptx
height250

2018-11-30

Lab move

  • Lab is currently down, lots of new machines added
  • New internal switch for warewulf provisioning
  • Nothing working yet, will work throughout next week to bring it up

...

  • Pak working on PAPI support, wants to upstream needs to find out how
  • Would be good to have that for other arches, so we can enable OpenHPC packages
  • Testing in the Linaro lab would help making sure of that

2018-11-22

OpenBLAS

  • Cleanup of the Arm builds, simplifying ARMv8 vs cores and adding support for more cores
  • Performance improvement across the board and guaranteeing ARMv8 only holds ARMv8.0 code (not potentially v8.1 as before)
  • Tested on Synquacer (A53), D03 (A57), D05 (A72), ThunderX, Amberwing (Falkor), ThunderX2, Moonshot (XGene)
  • Pull request: https://github.com/xianyi/OpenBLAS/pull/1876
  • Would be good for Fujitsu to test that code on Post-K
    • ThunderX2 builds might actually be good for Post-K (larger caches)
    • Will need to add march=armv8.2+sve (in Makefile.arm64) to see SVE code coming out
    • We can later add Post-K mode when cpuinfo/cache/TLB details are public

...

View file
namereport20181122.pdf
height250
View file
name20181122-tkato.pptx
height250

2018-11-08

OpenHPC Mellanox Ansible task

...

View file
namereport20181108.pdf
height250
View file
nameOpenHPC Ansible Merge.pdf
height250
View file
name20181108-tkato.pptx
height250

2018-10-26

IB performance issues

  • Software issues are being resolved (HPC-341), we need to push them upstream
    • Need to test on D03, D05, QDF, etc, to make sure it's not TX2 specific
    • @Renato: check who can upstream the Mellanox patch (Ilias?)
  • Hardware timing issues will need time to be resolved and we can't do anything
    • We can identify them (by running on different hardware, investigating)
    • And report back to the vendors, if they haven't seen it yet
  • Intel writes directly to cache (bypasses memory)
    • Can we do that, too? This would speed up considerably
  • We're adding an IB performance job to Jenkins
    • We can use that to test changes in OFED drivers (Mellanox or Inbox)
    • OpenUCX performance tests can be done on a single-node system
    • OpenMPI seems to perform better on shared memory than UCX

...

  • We're only running dual-node for now, could add single node (loopback, shared mem)
  • Could also add UCX perf tests to the same job

2018-10-25

Infiniband installation on OpenHPC tracking on HPC-351

...

View file
namereport20181025.pdf
height250
View file
name20181025-tkato.pptx
height250

2018-10-19

Contacting Mellanox for upstreaming OpenSM virtualisation

...

  • In our lab and the main lab, so not an issue with our setup
  • Haven't tested on others, will try next week

2018-10-12

IB perf automation going strong, just finished the Ansible module to parse results into JUnit XML for Jenkins

...

  • Will upstream to llvm and gnu, so Arm can pick up and release on their compiler
  • Working with Mellanox on v8.1 atomics on MPI libraries
  • ISVs seem to be finally joining the bandwagon, doing local tests and validation

2018-10-11

Work on InfiniBand Ansible automation in the OpenHPC recipes starting now

...

Baptiste is doing the arduous (and very valuable) work of refactoring our Jenkins jobs, Mr-Provisioner's client and overall lab stability tasks.

Slides

View file
namereport20181011.pdf
height250
View file
name20181011-tkato.pptx
height250

2018-10-05

OpenSM still causing issues when setting up IB on the D03s

...

  • Goal is to upload (some) results to OpenMPI's website

2018-08-31

Pak

Infiniband:

  • Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
  • If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
  • Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
  • Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
  • Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
  • Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.

...

  • Usually needs at least 4 servers for redundancy (two disks, metadata), but made it work on single x86 machine, server and client working
  • Client builds and installs on Arm, but fails to communicate with the server. May be card issues (ConnectX5 on Arm vs X3/4 on x86).
  • Building the server on Arm has some build issues (platform not recognised), may be due to old autoconf scripts.
  • Action: try different cards on the x86 side and try a newer autoconf script, update HPC-321

Renato

Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.

...

That work will be updated in HPC-322.

2018-08-30

Takahiro

Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.

Current patch doesn't help other loops under investigation, will need additional work for those later.

Takeharu

Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.

...

We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).

Masaki

Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).

Renato

Infiniband progress in the lab:

...

  • Client from whamcloud builds on Arm (both Huawei and Qualcomm) and packages install successfully
  • Server needs kernel drivers that were removed from staging, so we will start with Intel server
  • We don't have a spare x86_64 server, so we'll probably create a new VM on our admin server (really bad performance)

Slides

View file
name20180730-tkato.pptx
height250
View file
namereport20180830.pdf
height250