2019-01-11
Astra found race condition in RedHat kernel 4.14
- On rhash, "defer_work" function sometimes hang on 100% CPU for one core
- Hard to hit: ~4 nodes out of hundreds over many days running
- Will try different kernels (upstream 4.14, newer, ERP?)
- @Kevin to send more info, @Renato to contact the kernel guys, copy Sandia
Mellanox OFED 4.5 seems to shave off a few usec latency
- 2usec with 4.4, 1.8usec with 4.5, but building UCX/OpenMPI with TX2 opts makes it 1.5usec
- @Renato to ask Ilias again about upstreaming that to kernel, so that we can have it in OSS
Weird "kernel hung for 30 seconds" message in /var/log/system
- Sporadically, every few days on hundreds of nodes
- Toolchain has seem the same on APM, Tegra buildbots
- Seems to be when the nodes are 100% busy for too long
- Could be power scheduling?
- Not a big deal, though
2019-01-10
Lab move complete
- Arm machines working, network operational
...