[ Attendance ] [ Dial in Information ] [ Agenda  ] [ Minutes ]

2020-01-09

Attendance

Engineering Members 

Name

Present

Paul Isaac's (HPC Tech Lead, Linaro)

(tick)

Baptiste Gerondeau (HPC Engineer, Linaro)

(tick)

Masakazu Ueno (Fujitsu)(tick)
Masakai Arai (Fujitsu)(tick)
Nakashima Kouta (Fujitsu)(error)


(error) Not present

(tick)Present


Optional / Guests 

Name

Present

Mark Orvek (VP Engineering, Linaro) 

(error)

Elsie Wahlig (Sr Director LDCG, Linaro)(error)
Graeme Gregory (LDCG Engineering Mgr, Linaro)(error)

Victor Duan (Japan Country Mgr, Linaro)

(error)

Jammy Zhou (China Country Mgr, Linaro)(error)


Dial in Information

Paul Isaac's is inviting you to a scheduled Zoom meeting.

Topic: 2020-01-09 HPC Engineering Meeting Agenda/Minutes

Join Zoom Meeting
https://zoom.us/j/2990402863

One tap mobile
+16465588656,,2990402863# US (New York)
+17207072699,,2990402863# US

Dial by your location
+1 646 558 8656 US (New York)
+1 720 707 2699 US
+1 877 853 5247 US Toll-free
+1 888 788 0099 US Toll-free
Meeting ID: 611 276 1834
Find your local number: https://zoom.us/u/axpe6BG2s


Location

Local Time

Time Zone

UTC Offset

San Jose (USA - California)Thursday, November 14, 2019 at 6:00:00 amPDTUTC-7 hours
London (United Kingdom - England)Thursday, November 14, 2019 at 1:00:00 pmGMTUTC+0 hours
Paris (France - Île-de-France)Thursday, November 14, 2019 at 2:00:00 pmCESTUTC+1 hour
Tokyo (Japan)Thursday, November 14, 2019 at 10:00:00 pmJSTUTC+9 hours
Corresponding UTC (GMT)Thursday, November 14, 2019 at 13:00:00

Agenda 

Minutes (DRAFT - please add relevant links to your topics)

Recording Link




2019-11-14

Attendance

Engineering Members 

Name

Present

Paul Isaac's (HPC Tech Lead, Linaro)

(tick)

Baptiste Gerondeau (HPC Engineer, Linaro)

(tick)

Masakazu Ueno (Fujitsu)(tick)
Masakai Arai (Fujitsu)(tick)



(error) Not present

(tick)Present


Optional / Guests 

Name

Present

Mark Orvek (VP Engineering, Linaro) 

(error)

Elsie Wahlig (Sr Director LDCG, Linaro)(error)
Graeme Gregory (LDCG Engineering Mgr, Linaro)(error)

Victor Duan (Japan Country Mgr, Linaro)

(error)

Jammy Zhou (China Country Mgr, Linaro)(error)


Dial in Information

Paul Isaac's is inviting you to a scheduled Zoom meeting.

Topic: 2019-11-14 HPC Engineering Meeting Agenda/Minutes

Join Zoom Meeting
https://zoom.us/j/2990402863

One tap mobile
+16465588656,,2990402863# US (New York)
+17207072699,,2990402863# US

Dial by your location
+1 646 558 8656 US (New York)
+1 720 707 2699 US
+1 877 853 5247 US Toll-free
+1 888 788 0099 US Toll-free
Meeting ID: 611 276 1834
Find your local number: https://zoom.us/u/axpe6BG2s


Location

Local Time

Time Zone

UTC Offset

San Jose (USA - California)Thursday, November 14, 2019 at 6:00:00 amPDTUTC-7 hours
London (United Kingdom - England)Thursday, November 14, 2019 at 1:00:00 pmGMTUTC+0 hours
Paris (France - Île-de-France)Thursday, November 14, 2019 at 2:00:00 pmCESTUTC+1 hour
Tokyo (Japan)Thursday, November 14, 2019 at 10:00:00 pmJSTUTC+9 hours
Corresponding UTC (GMT)Thursday, November 14, 2019 at 13:00:00

Agenda 

Minutes (DRAFT - please add relevant links to your topics)

Recording Link



2019-10-31

Attendance

Engineering Members 

Name

Present

Paul Isaac's (HPC Tech Lead, Linaro)

(tick)

Baptiste Gerondeau (HPC Engineer, Linaro)

(tick)

Masakazu Ueno (Fujitsu)(error)
Masakai Arai (Fujitsu)(tick)



(error) Not present

(tick)Present


Optional / Guests 

Name

Present

Mark Orvek (VP Engineering, Linaro) 

(error)

Elsie Wahlig (Sr Director LDCG, Linaro)(tick)
Graeme Gregory (LDCG Engineering Mgr, Linaro)(error)

Victor Duan (Japan Country Mgr, Linaro)

(error)

Jammy Zhou (China Country Mgr, Linaro)(error)


Dial in Information

Paul Isaac's is inviting you to a scheduled Zoom meeting.

Topic: 2019-10-31 HPC Engineering Meeting Agenda/Minutes

Join Zoom Meeting
https://zoom.us/j/2990402863

One tap mobile
+16465588656,,2990402863# US (New York)
+17207072699,,2990402863# US

Dial by your location
+1 646 558 8656 US (New York)
+1 720 707 2699 US
+1 877 853 5247 US Toll-free
+1 888 788 0099 US Toll-free
Meeting ID: 611 276 1834
Find your local number: https://zoom.us/u/axpe6BG2s


Location

Local Time

Time Zone

UTC Offset

San Jose (USA - California)Thursday, October 31, 2019 at 6:00:00 amPDTUTC-7 hours
London (United Kingdom - England)Thursday, October 31, 2019 at 1:00:00 pmGMTUTC+0 hours
Paris (France - Île-de-France)Thursday, October 31, 2019 at 2:00:00 pmCESTUTC+1 hour
Tokyo (Japan)Thursday, October 31, 2019 at 10:00:00 pmJSTUTC+9 hours
Corresponding UTC (GMT)Thursday, October 31, 2019 at 13:00:00

Agenda 

Minutes

Recording Link

2019-06-05

Attending

Notes

Lab - Baptiste

2019-02-21

(Renato) My last Asia Engineering sync

LLVM CFG simplification

TSVC in LLVM

Infrastructure

2019-02-07

Ansible OpenHPC

LLVM

Lab

Benchmarks

2019-01-24

LLVM

HPC Lab

2019-01-11

Astra found race condition in RedHat kernel 4.14

 - On rhash, "deferred_worker" function sometimes hang on 100% CPU for one core

 - Hard to hit: ~4 nodes out of hundreds over many days running

 - Will try different kernels (upstream 4.14, newer, ERP?)

 - @Kevin to send more info, @Renato to contact the kernel guys, copy Sandia

Mellanox OFED 4.5 seems to shave off a few usec latency

 - 2usec with 4.4, 1.8usec with 4.5, but building UCX/OpenMPI with TX2 opts makes it 1.5usec

 - @Renato to ask Ilias again about upstreaming that to kernel, so that we can have it in OSS

Weird "kernel hung for 30 seconds" message in /var/log/system

 - Sporadically, every few days on hundreds of nodes

 - Toolchain has seem the same on APM, Tegra buildbots

 - Seems to be when the nodes are 100% busy for too long

 - Could be power scheduling?

 - Not a big deal, though

2019-01-10

Lab move complete

 - Arm machines working, network operational

 - OpenSUSE LEAP installed on D05 and QDF (via autoyast)

 - TX2 still needs some changes

OpenHPC Ansible

 - Merge done, should ask Sandia to give it a go

 - Forgot to add some options (enable_reboot / mellanox), will add later

 - New option "force_service", need a description in readme (@baptiste)

 - We need to ask other Asible users in OpenHPC to look at them

LLVM

 - 7.0 has added "unconditional branches" optimisation, reducing effect of local changes

 - Though, that opt already helps HPC applications

 - Machine pipeliner patch on https://reviews.llvm.org/D55106

 - Preparing patch for regalloc, still found more room for improvement

 - TSVC s278 has an IF that cannot be converted, even with --force-vector

 - Will try Simplify CFG or update the vectoriser if-conversion functions


2018-12-20

Lab update

    - QDF, TX2 and D05 operational and PXE booting

OpenHPC

    - Infiniband Support merged to master

    - ISO for aarch64 MOFED should be added in the future

Machine Pipeliner:

   - HPC-154 is getting prepared for upstreaming

   - Huawei patch still needs investigation to see how it conflicts with the Machine Pipeliner

   - HPC-156 is under investigation and patches are being tested

Loop Vectorization:

   - HPC-213 : the "force-vector" flag's output needs to be investigated

2018-12-14

Lab update

IB PXE problem

OpenHPC Ansible changes

OpenSUSE

2018-12-06

Meeting on January 3rd 2019 has been voted to be pushed back to January 10th, calendar update pending.

Lab Move

OpenHPC

LLVM

2018-11-30

Lab move

OpenBLAS

D06

2018-11-22

OpenBLAS

OpenHPC

LLVM

Infrastructure tasks

2018-11-08

OpenHPC Mellanox Ansible task

LLVM changes almost done, using Linaro's git now

Ansible repository merge

OpenHPC v1.3.6 has GCC 8

OpenBLAS

2018-10-26

IB performance issues

Adding IB test job to Jenkins

2018-10-25

Infiniband installation on OpenHPC tracking on HPC-351

We need to move the repository to Linaro, Fujitsu forks, we all send pull requests to

LLVM development on track

Finishing the cleanup of the lab

Working on ERP for next release

Will look into OpenHPC test suite next week

Looked at OpenBLAS performance

Arm claims improvements on their Fortran support (commercial compiler)

2018-10-19

Contacting Mellanox for upstreaming OpenSM virtualisation

Got CentOS VM on AArch64 and trying to get IB interface through

OpenBLAS hack to enable A57 instead of ARMv8

Mr-Provisioner Client done

ERP CentOS failed on D03

2018-10-12

IB perf automation going strong, just finished the Ansible module to parse results into JUnit XML for Jenkins

Continuing with infrastructure refactoring.

Trying to create VMs on D05 - CentOS, not being very successful

Huawei working on upstream compiler (gcc, llvm) support

2018-10-11

Work on InfiniBand Ansible automation in the OpenHPC recipes starting now

Ansible work on Lustre will have to wait until Whamcloud publishes the new version

Ansible OpenHPC development will change to Linaro

Upstreaming those Ansible playbooks will have to wait until both Linaro and Fujitsu are using the same set of changes (minus local ones), so that we can start welcoming other labs' entirely new playbooks into our repos.

Due to long term sickness, Fujitsu is replacing Takahiro Miyoshi with Masakazu Ueno, who will continue the work in HPC-212 (LLVM TSVC improvements). The last of Takahiro's patches was committed this week.

@Renato to update the s278 task with some info on how to get the information necessary to start debugging, to give Masakazu a head start in the LLVM world.

Both Renato and Masaki spent a good amount of time building their LLVM environments, and using Linaro's git and scripts, and hope to resume coding in the following weeks. :)

The benchmark harness is now in Alpha version, so we're encouraging everyone to start using it locally and report bugs, propose improvements, send patches for new benchmarks, etc.

Baptiste is doing the arduous (and very valuable) work of refactoring our Jenkins jobs, Mr-Provisioner's client and overall lab stability tasks.

Slides

2018-10-05

OpenSM still causing issues when setting up IB on the D03s

Looking at different binaries on Mellanox drivers

Talking about benchmarks, noise and how to use perf to find issues

IPoIB tests too slow (15GB) while pure IB are fast (45GB+)

When we get IB jobs running and stable, we'll look at OpenMPI's MTT

2018-08-31

Pak

Infiniband:

Lustre:

Renato

Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.

Tried building Lustre server on an x86 VM and got some weird build errors (AVX512 on a 10y.o server), bay be auto-detect.

Baptiste said there's a way to copy the host CPU features into the VM, will try that next. If it doesn't work, try to force configure options to disable AVX512.

That work will be updated in HPC-322.

2018-08-30

Takahiro

Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.

Current patch doesn't help other loops under investigation, will need additional work for those later.

Takeharu

Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.

Not getting full speed on Mellanox fabric. May help to use auxiliary card on a PCI lane managed by the second CPU. Will need Socket Direct support (only on closed source drivers).

Would prefer to upstream the Ansible recipes into another repository (Linaro, OpenHPC) instead of having his own being the upstream.

Post-K uses a custom Lustre client/server, so they don't have the same problems we do with the server's kernel modules.

Fujitsu will use commercial version of Mellanox drivers, but also the freedom to use the open source ones.

We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).

Masaki

Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).

Renato

Infiniband progress in the lab:

Following up with Mellanox to upstream required features:

Testing Lustre:

Slides