TF-A CI / tf-l3-code-coverage tests random failures

Description

Hi,

We’re experiencing random failures with the tf-l3-code-coverage test group since around mid August.

This is currently resulting in the TF-A CI main job failing.

Not always the same test affected in this group, but fail modes looks similar.

It seems like the job builds, runs and produces results but never ends and timeouts.

The lava log shows

2023-08-31T01:29:17 covtrace-FVP_Base_RevC_2xAEMvA.cluster0.cpu0.log 88005504 13 4 2023-08-31T01:29:17 covtrace-FVP_Base_RevC_2xAEMvA.cluster0.cpu0.log 88005508 2 4 2023-08-31T01:29:17 covtrace-FVP_Base_RevC_2xAEMvA.cluster0.cpu0.log 8800550c 2 4 2023-08-31T01:29:17 covtrace-FVP_Base_RevC_2xAEMvA.cluster0.cpu0.log 88005510 3631 4 2023-08-31T01:29:17 Stopping container lava-1889970-2.1.2 from action run-fvp 2023-08-31T01:29:17 Calling: 'nice' 'docker' 'stop' 'lava-1889970-2.1.2' 2023-08-31T01:33:11 Failed to clean after action 'run-fvp': job timed out after 300 seconds 2023-08-31T01:33:11 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/lava_dispatcher/action.py", line 206, in cleanup child.cleanup(connection) File "/usr/lib/python3/dist-packages/lava_dispatcher/actions/boot/fvp.py", line 347, in cleanup super().cleanup(connection) File "/usr/lib/python3/dist-packages/lava_dispatcher/actions/boot/fvp.py", line 189, in cleanup return_value = self.run_cmd(["docker", "stop", self.container], allow_fail=True) File "/usr/lib/python3/dist-packages/lava_dispatcher/action.py", line 674, in run_cmd proc.expect(pexpect.EOF) File "/usr/lib/python3/dist-packages/pexpect/spawnbase.py", line 343, in expect return self.expect_list(compiled_pattern_list, File "/usr/lib/python3/dist-packages/pexpect/spawnbase.py", line 372, in expect_list return exp.expect_loop(timeout) File "/usr/lib/python3/dist-packages/pexpect/expect.py", line 169, in expect_loop incoming = spawn.read_nonblocking(spawn.maxread, timeout) File "/usr/lib/python3/dist-packages/pexpect/pty_spawn.py", line 500, in read_nonblocking if (timeout != 0) and select(timeout): File "/usr/lib/python3/dist-packages/pexpect/pty_spawn.py", line 450, in select return select_ignore_interrupts([self.child_fd], [], [], timeout)[0] File "/usr/lib/python3/dist-packages/pexpect/utils.py", line 143, in select_ignore_interrupts return select.select(iwtd, owtd, ewtd, timeout) File "/usr/lib/python3/dist-packages/lava_common/timeout.py", line 76, in _timed_out raise self.exception("%s timed out after %s seconds" % (self.name, duration)) lava_common.exceptions.JobError: job timed out after 300 seconds 2023-08-31T01:33:11 Failed to clean after action 'boot-fvp-main': Failed to clean after job 2023-08-31T01:33:11 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/lava_dispatcher/action.py", line 206, in cleanup child.cleanup(connection) File "/usr/lib/python3/dist-packages/lava_dispatcher/action.py", line 844, in cleanup self.pipeline.cleanup(connection) File "/usr/lib/python3/dist-packages/lava_dispatcher/action.py", line 215, in cleanup raise InfrastructureError("Failed to clean after job") lava_common.exceptions.InfrastructureError: Failed to clean after job 2023-08-31T01:33:11 Failed to clean after action 'boot-fvp': Failed to clean after job 2023-08-31T01:33:11 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/lava_dispatcher/action.py", line 206, in cleanup child.cleanup(connection) File "/usr/lib/python3/dist-packages/lava_dispatcher/action.py", line 844, in cleanup self.pipeline.cleanup(connection) File "/usr/lib/python3/dist-packages/lava_dispatcher/action.py", line 215, in cleanup raise InfrastructureError("Failed to clean after job") lava_common.exceptions.InfrastructureError: Failed to clean after job 2023-08-31T01:33:11 InfrastructureError: The Infrastructure is not working correctly. Please report this error to LAVA admins. 2023-08-31T01:33:11 {'case': 'job', 'definition': 'lava', 'error_msg': 'Failed to clean after job', 'error_type': 'Infrastructure', 'result': 'fail'}

The lava job is not much explicit https://tf.validation.linaro.org/scheduler/job/1889970

Example of failing jobs since last few days:

 

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/47109/

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/47046/

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/46966/

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/46898/

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/46720/

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/46544/

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/46390/

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/46352/

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/46275/

Environment

None

Engineering Progress Update

None

Activity

Paul Sokolovskyy 
October 5, 2023 at 11:35 AM

:

This issue does not seem to happen any longer (after LAVA upgrade a few adjustements we did for this test group) , so I believe we could conclude and close this ticket.

First of all, sorry for lack of updates, this issue was backlogged by Linaro’s offsite meeting and then my vacation. I actually tried to look into this issue quickly before my vacation, but what I saw at that time seemed like exactly an issue which could be attributed to the LAVA upgrade, so put off looking into it as it seemed rather involved. It’s a miracle that it resolved itself during this time, I guess it’s actually the tweaks you guys made.

So, for completeness, looking at https://ci.trustedfirmware.org/job/tf-a-main/857/ , we have

 

 

Code coverage

 

 

 

 

tf-a-ci-gateway (tf-l3-code-coverage)

build #50029

( 15 min )

(15 mins! How cool is that!) And what’s important that there’s actual code coverage starts are there and looks sane: https://ci.trustedfirmware.org/job/tf-a-ci-gateway/50029/ (what I see previously looked as if trace data, source for codecov, wasn’t produced on LAVA side).

So, as long as you checked it for sanity either, I guess we indeed can close it, thanks!

Olivier Deprez 
October 4, 2023 at 9:04 AM

This issue does not seem to happen any longer (after LAVA upgrade a few adjustements we did for this test group) , so I believe we could conclude and close this ticket. if you agree.

Olivier Deprez 
September 13, 2023 at 10:03 AM
(edited)

Last couple of runs seem to no longer have random failures since Sep, 7th but the job is still failing perhaps for another reason

https://ci.trustedfirmware.org/job/tf-a-ci-gateway/48076/console

00:12:38.441 Writing directory view page. 00:12:38.441 Overall coverage rate: 00:12:38.441 lines......: 69.0% (3128 of 4531 lines) 00:12:38.441 functions..: 58.9% (622 of 1056 functions) 00:12:38.441 branches...: 55.5% (826 of 1489 branches) 00:12:38.444 ++ generate_header /home/buildslave/workspace/tf-a-ci-gateway/report.html 00:12:38.444 ++ local cov_html=/home/buildslave/workspace/tf-a-ci-gateway/merge/outdir/lcov/index.html 00:12:38.444 ++ local out_report=/home/buildslave/workspace/tf-a-ci-gateway/report.html 00:12:38.444 ++ python3 - 00:12:38.479 Traceback (most recent call last): 00:12:38.479 File "<stdin>", line 17, in <module> 00:12:38.479 FileNotFoundError: [Errno 2] No such file or directory: '/home/buildslave/workspace/tf-a-ci-gateway/merge/outdir/lcov/index.html' 00:12:38.504 Build step 'Execute scripts' changed build result to FAILURE 00:12:38.504 Build step 'Execute scripts' marked build as failure 00:12:38.505 Archiving artifacts 00:12:45.947 Finished: FAILURE

Benjamin Copeland 
September 12, 2023 at 12:53 PM

FYI LAVA upgrade is done.

Paul Sokolovskyy 
September 1, 2023 at 9:55 AM

:

Ok but let’s be careful from now on as we’ll start our pre-release activities for an Oct/Nov

Added note to https://linaro.atlassian.net/browse/STG-4919 . The current plan is to upgrade next week, so should be sustainable plan.

Fixed

Details

Assignee

Reporter

Upstream

Priority

Checklist

Sentry

Created August 31, 2023 at 9:08 AM
Updated October 23, 2023 at 1:44 PM
Resolved October 5, 2023 at 11:35 AM

Flag notifications