On some of our OpenCI test runs I see LAVA errors like
2021-09-15T14:44:45 lava-test-interactive-retry failed: 1 of 1 attempts. 'lava-test-interactive timed out after 900 seconds' 2021-09-15T14:44:45 lava-test-interactive timed out after 900 seconds
On re running the single failed test it failed again, then again on a further rerun and finally passed on the rerun after.
L1, L2 and main jobs all suffer this issue.
Do we know what’s going on in LAVA? This is not new and as seen can be worked around but having to spend time re-running failed tests to what looks like a LAVA issue is time consuming and annoying.
Further discussion with Leonardo he provided the following input:
Yes, I have also observed similar behaviour. What I suspect is when a L1|2 job is launched, LAVA lab gets a burst of jobs to be processed, in turn, LAVA process these 8 at a time at the same physical machine, and at some point, the execution of each job slows down, giving timeouts. The naive approach here is to increase the timeout value (now it is 900 seconds, 15 minutes) but I am not sure if this is the best solution. Another option is to reduce the number of concurrent jobs, which in theory, would process faster.
Environment
None
Engineering Progress Update
None
Attachments
5
uart1_full.txt
13 Oct, 2021
model_log.txt
13 Oct, 2021
build.log
13 Oct, 2021
run-11.sh
13 Oct, 2021
uart0_full-2.txt
13 Oct, 2021
Activity
Glen Valante
February 15, 2022 at 10:32 PM
Bulk close of resolved issues.
Leonardo Sandoval
December 14, 2021 at 6:26 PM
All LAVA intermittent jobs have been fixed, together with some bugs found on the lava-expect scripts. If there is an issue on this area in the future, let's track it as a separate ticket.
So, from my side, this ticked is clear to be resolved. (And actually, @Leonardo Sandoval , I would suggest to resolve it, and then create new more specific tickets is such issues reappear.)
Leonardo Sandoval
December 6, 2021 at 6:00 PM
As I can’t run the original scripts as of now, I won’t haste with that. My plan is to look into learning to run original scripts during holiday slowdown (unless I get other tasks, or vice-versa, won’t have other tasks in queue).
You cant but Arm CI can, so we can move in parallel at this point. Up to you, np.
Paul Sokolovskyy
December 6, 2021 at 5:47 PM
if you propose the latter, make sure you also do the same task of the (original) expect scripts
As I can’t run the original scripts as of now, I won’t haste with that. My plan is to look into learning to run original scripts during holiday slowdown (unless I get other tasks, or vice-versa, won’t have other tasks in queue).
On some of our OpenCI test runs I see LAVA errors like
2021-09-15T14:44:45 lava-test-interactive-retry failed: 1 of 1 attempts. 'lava-test-interactive timed out after 900 seconds'
2021-09-15T14:44:45 lava-test-interactive timed out after 900 seconds
in lava logs like https://ci-builds.trustedfirmware.org/static-files/gRa8QQWPi_HnlWwULQFsICHMgv-DmRkioSu8HUb2IYcxNjMxNzIzNTYxNDk5OjE2OmpvYW5uYWZhcmxleS1hcm06am9iL3RmLWEtYnVpbGRlci80Mzk3MjYvYXJ0aWZhY3Q=/lava.log from a L2 run on a partner patch under review.
An L2 job showing this https://review.trustedfirmware.org/c/TF-A/trusted-firmware-a/+/9856 shows 4 failures of this type.
On a re run of the 4 failed tests 3 of then passed https://ci.trustedfirmware.org/job/tf-ci-gateway/13459/ the failed one had the above error.
On re running the single failed test it failed again, then again on a further rerun and finally passed on the rerun after.
L1, L2 and main jobs all suffer this issue.
Do we know what’s going on in LAVA? This is not new and as seen can be worked around but having to spend time re-running failed tests to what looks like a LAVA issue is time consuming and annoying.
Further discussion with Leonardo he provided the following input:
Yes, I have also observed similar behaviour. What I suspect is when a L1|2 job is launched, LAVA lab gets a burst of jobs to be processed, in turn, LAVA process these 8 at a time at the same physical machine, and at some point, the execution of each job slows down, giving timeouts. The naive approach here is to increase the timeout value (now it is 900 seconds, 15 minutes) but I am not sure if this is the best solution. Another option is to reduce the number of concurrent jobs, which in theory, would process faster.