Intermittent OpenCI LAVA Failure (timeouts)

Description

On some of our OpenCI test runs I see LAVA errors like

2021-09-15T14:44:45 lava-test-interactive-retry failed: 1 of 1 attempts. 'lava-test-interactive timed out after 900 seconds'
2021-09-15T14:44:45 lava-test-interactive timed out after 900 seconds

in lava logs like https://ci-builds.trustedfirmware.org/static-files/gRa8QQWPi_HnlWwULQFsICHMgv-DmRkioSu8HUb2IYcxNjMxNzIzNTYxNDk5OjE2OmpvYW5uYWZhcmxleS1hcm06am9iL3RmLWEtYnVpbGRlci80Mzk3MjYvYXJ0aWZhY3Q=/lava.log from a L2 run on a partner patch under review.

An L2 job showing this https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/9856 shows 4 failures of this type.

On a re run of the 4 failed tests 3 of then passed https://ci.trustedfirmware.org/job/tf-ci-gateway/13459/ the failed one had the above error.

On re running the single failed test it failed again, then again on a further rerun and finally passed on the rerun after.

L1, L2 and main jobs all suffer this issue.

Do we know what’s going on in LAVA? This is not new and as seen can be worked around but having to spend time re-running failed tests to what looks like a LAVA issue is time consuming and annoying.

Further discussion with Leonardo he provided the following input:

Yes, I have also observed similar behaviour. What I suspect is when a L1|2 job is launched, LAVA lab gets a burst of jobs to be processed, in turn, LAVA process these 8 at a time at the same physical machine, and at some point, the execution of each job slows down, giving timeouts. The naive approach here is to increase the timeout value (now it is 900 seconds, 15 minutes) but I am not sure if this is the best solution. Another option is to reduce the number of concurrent jobs, which in theory, would process faster.

Environment

None

Engineering Progress Update

None

Attachments

5

Activity

Glen Valante 
February 15, 2022 at 10:32 PM

Bulk close of resolved issues.

Leonardo Sandoval 
December 14, 2021 at 6:26 PM

All LAVA intermittent jobs have been fixed, together with some bugs found on the lava-expect scripts. If there is an issue on this area in the future, let's track it as a separate ticket.

Paul Sokolovskyy 
December 13, 2021 at 8:34 PM

I checked 2 previously mentioned configs:

And didn’t see failures due to timeout after https://review.trustedfirmware.org/c/ci/tf-a-ci-scripts/+/12843 was merged.

So, from my side, this ticked is clear to be resolved. (And actually, , I would suggest to resolve it, and then create new more specific tickets is such issues reappear.)

Leonardo Sandoval 
December 6, 2021 at 6:00 PM

As I can’t run the original scripts as of now, I won’t haste with that. My plan is to look into learning to run original scripts during holiday slowdown (unless I get other tasks, or vice-versa, won’t have other tasks in queue).

You cant but Arm CI can, so we can move in parallel at this point. Up to you, np.

Paul Sokolovskyy 
December 6, 2021 at 5:47 PM

if you propose the latter, make sure you also do the same task of the (original) expect scripts

As I can’t run the original scripts as of now, I won’t haste with that. My plan is to look into learning to run original scripts during holiday slowdown (unless I get other tasks, or vice-versa, won’t have other tasks in queue).

Delivered

Details

Assignee

Reporter

Labels

Upstream

Share Visibility

Dave Pigott
Don Harbin
Maria Högberg

Original estimate

Time tracking

1h logged

Components

Priority

Checklist

Sentry

Created September 16, 2021 at 10:51 AM
Updated February 15, 2022 at 10:32 PM
Resolved December 14, 2021 at 6:26 PM