Intermittent OpenCI LAVA Failure (timeouts)

Description

On some of our OpenCI test runs I see LAVA errors like

2021-09-15T14:44:45 lava-test-interactive-retry failed: 1 of 1 attempts. 'lava-test-interactive timed out after 900 seconds'
2021-09-15T14:44:45 lava-test-interactive timed out after 900 seconds

in lava logs like https://ci-builds.trustedfirmware.org/static-files/gRa8QQWPi_HnlWwULQFsICHMgv-DmRkioSu8HUb2IYcxNjMxNzIzNTYxNDk5OjE2OmpvYW5uYWZhcmxleS1hcm06am9iL3RmLWEtYnVpbGRlci80Mzk3MjYvYXJ0aWZhY3Q=/lava.log from a L2 run on a partner patch under review.

An L2 job showing this https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/9856 shows 4 failures of this type.

On a re run of the 4 failed tests 3 of then passed https://ci.trustedfirmware.org/job/tf-ci-gateway/13459/ the failed one had the above error.

On re running the single failed test it failed again, then again on a further rerun and finally passed on the rerun after.

L1, L2 and main jobs all suffer this issue.

Do we know what’s going on in LAVA? This is not new and as seen can be worked around but having to spend time re-running failed tests to what looks like a LAVA issue is time consuming and annoying.

Further discussion with Leonardo he provided the following input:

Yes, I have also observed similar behaviour. What I suspect is when a L1|2 job is launched, LAVA lab gets a burst of jobs to be processed, in turn, LAVA process these 8 at a time at the same physical machine, and at some point, the execution of each job slows down, giving timeouts. The naive approach here is to increase the timeout value (now it is 900 seconds, 15 minutes) but I am not sure if this is the best solution. Another option is to reduce the number of concurrent jobs, which in theory, would process faster.

Environment

None

Engineering Progress Update

None

Attachments

uart1_full.txt

13 Oct, 2021

model_log.txt

13 Oct, 2021

build.log

13 Oct, 2021

run-11.sh

13 Oct, 2021

uart0_full-2.txt

13 Oct, 2021

Activity

Glen Valante

February 15, 2022 at 10:32 PM

Bulk close of resolved issues.

Leonardo Sandoval

December 14, 2021 at 6:26 PM

All LAVA intermittent jobs have been fixed, together with some bugs found on the lava-expect scripts. If there is an issue on this area in the future, let's track it as a separate ticket.

Paul Sokolovskyy

December 13, 2021 at 8:34 PM

I checked 2 previously mentioned configs:

And didn’t see failures due to timeout after https://review.trustedfirmware.org/c/ci/tf-a-ci-scripts/+/12843 was merged.

So, from my side, this ticked is clear to be resolved. (And actually, @Leonardo Sandoval , I would suggest to resolve it, and then create new more specific tickets is such issues reappear.)

Leonardo Sandoval

December 6, 2021 at 6:00 PM

As I can’t run the original scripts as of now, I won’t haste with that. My plan is to look into learning to run original scripts during holiday slowdown (unless I get other tasks, or vice-versa, won’t have other tasks in queue).

You cant but Arm CI can, so we can move in parallel at this point. Up to you, np.

Paul Sokolovskyy

December 6, 2021 at 5:47 PM

if you propose the latter, make sure you also do the same task of the (original) expect scripts

As I can’t run the original scripts as of now, I won’t haste with that. My plan is to look into learning to run original scripts during holiday slowdown (unless I get other tasks, or vice-versa, won’t have other tasks in queue).

Resize work item view side panel

Delivered

Details

Assignee

Leonardo Sandoval(Deactivated)

Reporter

Joanna Farley

Labels

LAB

Upstream

Share Visibility

Dave Pigott

Don Harbin

Maria Högberg

Original estimate

Add estimate

Time tracking

1h logged

Components

LAB

Priority

Major

Checklist

Sentry

Created September 16, 2021 at 10:51 AM

Updated February 15, 2022 at 10:32 PM

Resolved December 14, 2021 at 6:26 PM