2018-08-31

Pak

Infiniband:

Lustre:

Renato

Replicated Pak's Amberwing setup with multi-node using MLNX_OFED drivers, works fine, but install process is cumbersome. Working to automate it.

Tried building Lustre server on an x86 VM and got some weird build errors (AVX512 on a 10y.o server), bay be auto-detect.

Baptiste said there's a way to copy the host CPU features into the VM, will try that next. If it doesn't work, try to force configure options to disable AVX512.

That work will be updated in HPC-322.

2018-08-30

Takahiro

Had to move back to help Post-K development, didn't have time to continue working on upstream reviewed patch.

Current patch doesn't help other loops under investigation, will need additional work for those later.

Takeharu

Having trouble with Infiniband setup, which has delayed adding support for IB configuration in the Ansible recipes.

Not getting full speed on Mellanox fabric. May help to use auxiliary card on a PCI lane managed by the second CPU. Will need Socket Direct support (only on closed source drivers).

Would prefer to upstream the Ansible recipes into another repository (Linaro, OpenHPC) instead of having his own being the upstream.

Post-K uses a custom Lustre client/server, so they don't have the same problems we do with the server's kernel modules.

Fujitsu will use commercial version of Mellanox drivers, but also the freedom to use the open source ones.

We may need special handling in the Ansible recipes to choose which ones to install, or to leave that aside (ie. not overwrite existing drivers).

Masaki

Progress on LLVM and HCQC work reported in his YVR18 slides. Will share the source, so that we can merge with other compiler work (Takahiro, Renato, TCWG?).

Renato

Infiniband progress in the lab:

Following up with Mellanox to upstream required features:

Testing Lustre:

Slides