Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

2019-01-11

Astra found race condition in RedHat kernel 4.14

 - On rhash, "defer_work" function sometimes hang on 100% CPU for one core

 - Hard to hit: ~4 nodes out of hundreds over many days running

 - Will try different kernels (upstream 4.14, newer, ERP?)

 - @Kevin to send more info, @Renato to contact the kernel guys, copy Sandia

Mellanox OFED 4.5 seems to shave off a few usec latency

 - 2usec with 4.4, 1.8usec with 4.5, but building UCX/OpenMPI with TX2 opts makes it 1.5usec

 - @Renato to ask Ilias again about upstreaming that to kernel, so that we can have it in OSS

Weird "kernel hung for 30 seconds" message in /var/log/system

 - Sporadically, every few days on hundreds of nodes

 - Toolchain has seem the same on APM, Tegra buildbots

 - Seems to be when the nodes are 100% busy for too long

 - Could be power scheduling?

 - Not a big deal, though

2019-01-10

Lab move complete

 - Arm machines working, network operational

...