In-development systems (like jenkins.openci-test.arm.com) should not have automatically triggered jobs running, and should avoid side effects

Description

Att: ,

This is a matter which was brought up a few times already, e.g. I believe I mentioned it around our first meeting, and most recently in the response to Sufyan on ECLAIR matters, but I appreciate that it’s one of the many “bootstrap” tasks whose priority is easy to overlook. So:

It’s a rule that OpenCI systems have to follow that any non-production systems should avoid automatically running any jobs which may put any noticeable load on the CI backend. CI backend includes test services (LAVA and TuxSuite) and ECLAIR license server, among possible others. That’s because most of backend services are either scarce or metered resources. The production server generates enough load for the backend, actually, it was a common problem that we had peak usage patterns which overloaded backend, with the entire system going into the “death spiral” due to positive feedback. A lot of effort during last couple of years was put into making sure that pattern is broken and backend is tuned to cope with typical production load patterns. However, if non-production systems add load, the balance may be broken.

Thus, the pattern we follow is that non-production job configs should be patched to disabled any triggers (we commonly use 2 types: timed (aka cron) triggers and Gerrit review triggers). Instead, non production systems, which are then used for development and testing, should have the jobs started manually on a case by case basis by the individual developers working on particular tasks.

Additionally, some jobs in the CI have large fan-out factor, like starting one job can spawn 100 sub-jobs, and that’s what causes the actual load. Running such should be avoided unless strictly (and rarely enough) necessary. Instead, single individual subjob (running CI for a single configuration for example) should be routinely used, or higher-level job, but running ~5 configs instead of 100.

All these rules are followed on the current OpenCI staging instance, https://ci.staging.trustedfirmware.org/ , this ticket is to ensure that jenkins.openci-test.arm.com, and any other OpenCI system set up at Arm follows the same protocol while it’s in “development and testing” phase. When it is ready for end-to-end integration testing of actual production-level jobs, it should be done on special arrangement.

Comments provide discussion of various points on how to achieve that.

Environment

None

Engineering Progress Update

None

Activity

Show:

Arthur She December 18, 2024 at 4:05 AM

Hi , I think it’s a guideline for migrating jobs. and I already know it. we can close it now.

Karen Power December 17, 2024 at 6:26 AM

Hi Have you had a chance to look at this ticket yet? Thanks.

Saheer Babu December 6, 2024 at 11:34 AM

Thanks for providing the existing workflow.

>We don’t need a meeting to disabling triggers once, but we need to come up with a solution which will keep them disabled sustainably.

Okay, understood, please create one for Tuesday 9:30.

Paul Sokolovskyy December 6, 2024 at 10:23 AM

I’ll describe how it’s done currently for OpenCI staging (arguably, not in ideal way)

Ok, I think this is important information, so I’m going ahead. That’s because the OpenCI consists of both production and staging servers, so the migration would (eventually) include setup of both. And as we touch of how to make sure that production is not suffocated by other instances, it’s the right time to bring it up. So:

There’re production job configs repos under:

Each for TF-A, TF-M, and admin/maint jobs respectively. Then there’s corresponding “staging” counterpart under the “next/” namespace (original idea is that staging is where things are developed/tested for the next version of the production):

How the staging repos were initially set up should be assumed to be: a) production repo was cloned; b) changes applied to disable automatic triggers, redact email notifications, apply other adhoc changes, as discussed in this ticket.

How the staging repos are intended to be used is: changes to the production-named jobs are not recommended, for obvious reason: if multiple people do that in parallel, they conflict each other. Instead, individual developers are expected to make a copy of upstream job with their username as prefix (e.g. I’d copy tf-a-main to pfalcon-tf-a-main), and make further needed changes there.

Now how the staging repos are maintained. From time to time, someone (and that’s usually me) syncs some of jobs manually. Literally, if I need to debug a TF-A issue, I start with syncing TF-A job set to staging repos. I just copy over the files, then do git co -p or git add -p to keep only relevant changes without disturbing earlier applied fixups. Then I copy over files over my personal copies and repeat the process of selecting/discarding changes.

Syncing all of personal jobs is not part of the process, it’s up to individual developers to sync up their jobs if, and when, they need it again.

That’s it, the process is manual and can be called cumbersome. There’re definitely thoughts of automating it somehow, but that was never requested. If anything, I’m surprised that a few people complain about it, which means that everyone is familiar with the process and understands pros and cons of further changes (e.g. manual sync is actually a chance of additional review of the accumulated changes; it also means that some stupid quickly hacked-up in haste script won’t overwrite your precious changes behind your back, etc.).

So, I obviously don’t suggest that maintenance of the branch for http://jenkins.openci-test.arm.com should follow it. Instead, just show what process there has “always” (definitely before me) in place to maintain requirement that staging is synced up with prod somehow, while forbidden elements are redacted, and the git workflow is the same familiar to everyone. If, as seems to be hinted at , you’re ok to following workflow where the upstream git branch is rebased to keep its maintenance lean and mean, at the expense of everyone requiring to use “git pull --rebase“ with it, I’m all for it, I’d liked to see an alternative approach for a while.

Paul Sokolovskyy December 6, 2024 at 6:57 AM

I have disabled the triggers.

Thanks! Triggers were of the immediate concern, but there may be other configuration bits which may require changing while testing. The wide category is “post-build notification/handlers”. Specific example is [build results] email notifications. I don’t say they should be disabled right away, but then if actual TF developers get “not real” (not production) notification, they may get confused first, annoyed if it comes in volume. I’d ask the TF development team about it.

Other example of “post-build handler“ which we have somewhere in the TF-A loop (at tf-a-main level IIRC), is that if the build passes, development branch merged into the main branch. Of course, that would need to be disabled too. I can remember other things right away, but that doesn’t mean there’s none.

So, stay tuned for the description of how it’s handled for OpenCI staging, which is not ideal again, but all the small details and corner cases like above is the reason it is how it is.

Done

Details

Assignee

Reporter

Upstream

No

Priority

Checklist

Sentry

Created December 5, 2024 at 2:03 PM
Updated December 18, 2024 at 4:05 AM
Resolved December 18, 2024 at 4:05 AM