[ Attendance ] [ Dial in Information ] [ Agenda ] [ Minutes ]
Attendance
Engineering Members
Name | Present |
---|---|
Paul Isaac's (HPC Tech Lead, Linaro) | |
Baptiste Gerondeau (HPC Engineer, Linaro) | |
Masakazu Ueno (Fujitsu) | |
Masakai Arai (Fujitsu) | |
Not present
resent
Optional / Guests
Name | Present |
---|---|
Mark Orvek (VP Engineering, Linaro) | |
Elsie Wahlig (Sr Director LDCG, Linaro) | |
Graeme Gregory (LDCG Engineering Mgr, Linaro) | |
Victor Duan (Japan Country Mgr, Linaro) | |
Jammy Zhou (China Country Mgr, Linaro) | |
Dial in Information
Paul Isaac's is inviting you to a scheduled Zoom meeting.
...
Dial by your location
+1 646 558 8656 US (New York)
+1 720 707 2699 US
+1 877 853 5247 US Toll-free
+1 888 788 0099 US Toll-free
Meeting ID: 611 276 1834
Find your local number: https://zoom.us/u/axpe6BG2s
Location | Local Time | Time Zone | UTC Offset |
---|---|---|---|
San Jose (USA - California) | Thursday, October 31, 2019 at 6:00:00 am | PDT | UTC-7 hours |
London (United Kingdom - England) | Thursday, October 31, 2019 at 1:00:00 pm | GMT | UTC+0 hours |
Paris (France - Île-de-France) | Thursday, October 31, 2019 at 2:00:00 pm | CEST | UTC+1 hour |
Tokyo (Japan) | Thursday, October 31, 2019 at 10:00:00 pm | JST | UTC+9 hours |
Corresponding UTC (GMT) | Thursday, October 31, 2019 at 13:00:00 |
Agenda
- Previous meeting notes:
- Topic
- Previous format/topics
- Lab Updates
- Jinshui/Peter Lui (Futurewei) -
Back to Aug/Sept when Joshua Terry as the Linaro HPC-SIG technical lead, Peter discussed with him about the ARM HPC system integration and testing environment options. One of the options is Futurewei may be able to host an ARM64 cluster (for example, 16-32 2-socket nodes). But then Joshua left Linaro and the discussion paused.
- Masaki Arai (Fujitsu) - Compiler Updates
- AOB
- Next Meetings:
- Meeting 14 November 2019
- SC'19 November 16-22 2019
Minutes
- Previous meeting notes:
- Topic
- Lab Updates
- Futurewei hosting
- Compiler Updates
- AOB
- Next Meetings:
- Meeting 14 November 2019
- SC'19 November 16-22 2019
Recording Link
2019-06-05
Attending
- Baptiste
- Graeme
- Elsie
- Masaki Arai
- Ueno
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2019-02-21
(Renato) My last Asia Engineering sync
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2019-02-07
Ansible OpenHPC
- Fujitsu has finished their work, Linaro has validated and using in production
- We have a new community contributor, sending patches, issues
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2019-01-24
LLVM
- HPC-154: RFC document ready, preparing test files
- HPC-155: updating the tree, new pragma "loop pipeline" may be interesting
- Linaro will be moving to the new GitHub repo in a few months, with updated scripts
...
View file | ||||
---|---|---|---|---|
|
2019-01-11
Astra found race condition in RedHat kernel 4.14
...
- Not a big deal, though
2019-01-10
Lab move complete
- Arm machines working, network operational
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-12-20
Lab update
- QDF, TX2 and D05 operational and PXE booting
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-12-14
Lab update
- New machines, new internal network, IB everywhere
- Amberwing, ThunderX2 machines fully operational
- D05 still not booting (PXE-E07)
- may need separate network card for PXE boot
- flash a different firmware on the NICs directly
- x86_64 machines still not operational (broken UEFI BIOS)
...
- Able to install SUSE Leap 15.1 on D05, needed support for AutoYast in Mr-Provisioner
- Adding support into hpc_lab_setup
2018-12-06
Meeting on January 3rd 2019 has been voted to be pushed back to January 10th, calendar update pending.
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-11-30
Lab move
- Lab is currently down, lots of new machines added
- New internal switch for warewulf provisioning
- Nothing working yet, will work throughout next week to bring it up
...
- Pak working on PAPI support, wants to upstream needs to find out how
- Would be good to have that for other arches, so we can enable OpenHPC packages
- Testing in the Linaro lab would help making sure of that
2018-11-22
OpenBLAS
- Cleanup of the Arm builds, simplifying ARMv8 vs cores and adding support for more cores
- Performance improvement across the board and guaranteeing ARMv8 only holds ARMv8.0 code (not potentially v8.1 as before)
- Tested on Synquacer (A53), D03 (A57), D05 (A72), ThunderX, Amberwing (Falkor), ThunderX2, Moonshot (XGene)
- Pull request: https://github.com/xianyi/OpenBLAS/pull/1876
- Would be good for Fujitsu to test that code on Post-K
- ThunderX2 builds might actually be good for Post-K (larger caches)
- Will need to add
march=armv8.2+sve
(inMakefile.arm64
) to see SVE code coming out - We can later add Post-K mode when cpuinfo/cache/TLB details are public
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-11-08
OpenHPC Mellanox Ansible task
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-10-26
IB performance issues
- Software issues are being resolved (HPC-341), we need to push them upstream
- Need to test on D03, D05, QDF, etc, to make sure it's not TX2 specific
- @Renato: check who can upstream the Mellanox patch (Ilias?)
- Hardware timing issues will need time to be resolved and we can't do anything
- We can identify them (by running on different hardware, investigating)
- And report back to the vendors, if they haven't seen it yet
- Intel writes directly to cache (bypasses memory)
- Can we do that, too? This would speed up considerably
- We're adding an IB performance job to Jenkins
...
- We're only running dual-node for now, could add single node (loopback, shared mem)
- Could also add UCX perf tests to the same job
2018-10-25
Infiniband installation on OpenHPC tracking on HPC-351
...
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-10-19
Contacting Mellanox for upstreaming OpenSM virtualisation
...
- In our lab and the main lab, so not an issue with our setup
- Haven't tested on others, will try next week
2018-10-12
IB perf automation going strong, just finished the Ansible module to parse results into JUnit XML for Jenkins
...
- Will upstream to llvm and gnu, so Arm can pick up and release on their compiler
- Working with Mellanox on v8.1 atomics on MPI libraries
- ISVs seem to be finally joining the bandwagon, doing local tests and validation
2018-10-11
Work on InfiniBand Ansible automation in the OpenHPC recipes starting now
...
Baptiste is doing the arduous (and very valuable) work of refactoring our Jenkins jobs, Mr-Provisioner's client and overall lab stability tasks.
Slides
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|
2018-10-05
OpenSM still causing issues when setting up IB on the D03s
...
- Goal is to upload (some) results to OpenMPI's website
2018-08-31
Pak
Infiniband:
- Seems to be working on D05/D03 cluster, but due to the big difference between the two machines, it's not good for testing latency/bandwidth.
- If we had at least two D05s for cluster setup, it would be enough, but our other D05 runs Debian and benchmarks and doesn't have a Mellanox card.
- Action: to update HPC-294 with the tests and expected results to make sure IB is working and of good enough quality.
- Need to understand what we can do with our switch regarding subnet manager, and what we will have to use opensm
- Action: to work on HPC-292 trying to setup a subnet in the switch, and if not, listing the opensm setup during cluster provisioning
- Requested upstreaming for the feature needed for our clusters to work: socket direct and multi-host (see 2018-08-30), no response yet.
...
- Usually needs at least 4 servers for redundancy (two disks, metadata), but made it work on single x86 machine, server and client working
- Client builds and installs on Arm, but fails to communicate with the server. May be card issues (ConnectX5 on Arm vs X3/4 on x86).
- Building the server on Arm has some build issues (platform not recognised), may be due to old autoconf scripts.
- Action: try different cards on the x86 side and try a newer autoconf script, update HPC-321
Renato
Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.
...
That work will be updated in HPC-322.
2018-08-30
Takahiro
Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.
Current patch doesn't help other loops under investigation, will need additional work for those later.
Takeharu
Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.
...
We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).
Masaki
Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).
Renato
Infiniband progress in the lab:
...
- Client from whamcloud builds on Arm (both Huawei and Qualcomm) and packages install successfully
- Server needs kernel drivers that were removed from staging, so we will start with Intel server
- We don't have a spare x86_64 server, so we'll probably create a new VM on our admin server (really bad performance)
Slides
View file | ||||
---|---|---|---|---|
|
View file | ||||
---|---|---|---|---|
|