Add optional keep-alive to reconnect if the Jenkins controller forgot about the running agent#749
Open
jimklimov wants to merge 13 commits intojenkinsci:masterfrom
Open
Add optional keep-alive to reconnect if the Jenkins controller forgot about the running agent#749jimklimov wants to merge 13 commits intojenkinsci:masterfrom
jimklimov wants to merge 13 commits intojenkinsci:masterfrom
Conversation
…laveExists` on server side, to reconnect an agent dropped by the server unilaterally [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
…196] Signed-off-by: Jim Klimov <[email protected]>
… than default "INFO" [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
… stderrLogPath, and provide logContains() methods for String and regex.Pattern [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
… test case from unilateral loss by server (while client thinks it is connected), and from the checks for probe-sending [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
Good to know in testing whether the client crashed or was told to exit. Signed-off-by: Jim Klimov <[email protected]>
219a9d0 to
a61636c
Compare
jimklimov
added a commit
to jimklimov/jenkins-swarm-nutci
that referenced
this pull request
Jan 22, 2026
Introduced by jenkinsci/swarm-plugin#749 Signed-off-by: Jim Klimov <[email protected]>
… ungraceful disconnection test code [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
Signed-off-by: Jim Klimov <[email protected]>
…connected Node and Computer are same as seen before (should not be) [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
…t from here; make @link into plain @code [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
… and associated Computer are not null [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
7c86403 to
3edf800
Compare
…tsDissociatedComputer() [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
27d039c to
3a9b024
Compare
Contributor
Author
|
https://ci.jenkins.io/job/Plugins/job/swarm-plugin/job/PR-749/8/pipeline-overview/?selected-node=87 Does not seem related to the new feature and its tests?.. |
…D if associated computer is null [JENKINS-75196] Signed-off-by: Jim Klimov <[email protected]>
3a9b024 to
13a4fe9
Compare
Contributor
Author
|
Ready for review from my side; incrementals' HPI&JAR deployed to the server and its agents where the problem was manifesting, to gather some real-life results too. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Originally posted as https://issues.jenkins.io/browse/JENKINS-75196
There are practical situations that bite me often enough to care, when the swarm agent continues running and deems itself connected, while the server does not list it among the nodes any more; reasons for such situation do vary (for me it most often manifests due to severe lags with near-full ZFS, and to access point restarts with agents at home and server in the cloud, etc.)
The new keep-alive ability is optional, off by default (no-op until plugin users enable it on the client side).
Testing done
Added unit tests that reproduce the behavior, and are fixed by the proposed reconnection code.
Enhanced work with logging in
SwarmClientRuleso the agent logs can be not only seen, but seeked and matched as part of the checks.Live testing on the deployment that suffered to begin shortly.
Submitter checklist
Other notes
For full disclosure, the core stub of changes (especially the server-side handling of
isCheckSlaveExistsSupported(), and its use from the client) was suggested by AI as I was testing if it can make useful code. From my layman description, it worked out a surprisingly reasonable solution - the sort of logic which I wanted to have but did not really know how to achieve ;) This generated code served as a stump for lots of manual chiseling in the following days, mostly on tests to make sure I can reproduce the "ungraceful" forgetting of the agent (ended up using reflection a bit). Possibly the short and somewhat wrong first version of the tests also started there by AI.I am concerned that the new server endpoint may want some Authorization protection (e.g. so that only swarm-agent users may poke it?) but this could complicate the checks and so far I do not know how to add that. Maybe a task for another PR, or reason for another talk with AI ;) First of all - is it needed? Many other endpoints seem anonymous anyway.