Skip to content

Add optional keep-alive to reconnect if the Jenkins controller forgot about the running agent#749

Open
jimklimov wants to merge 13 commits intojenkinsci:masterfrom
jimklimov:JENKINS-75196
Open

Add optional keep-alive to reconnect if the Jenkins controller forgot about the running agent#749
jimklimov wants to merge 13 commits intojenkinsci:masterfrom
jimklimov:JENKINS-75196

Conversation

@jimklimov
Copy link
Contributor

@jimklimov jimklimov commented Jan 22, 2026

Originally posted as https://issues.jenkins.io/browse/JENKINS-75196

There are practical situations that bite me often enough to care, when the swarm agent continues running and deems itself connected, while the server does not list it among the nodes any more; reasons for such situation do vary (for me it most often manifests due to severe lags with near-full ZFS, and to access point restarts with agents at home and server in the cloud, etc.)

The new keep-alive ability is optional, off by default (no-op until plugin users enable it on the client side).

Testing done

Added unit tests that reproduce the behavior, and are fixed by the proposed reconnection code.

Enhanced work with logging in SwarmClientRule so the agent logs can be not only seen, but seeked and matched as part of the checks.

Live testing on the deployment that suffered to begin shortly.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

Other notes

For full disclosure, the core stub of changes (especially the server-side handling of isCheckSlaveExistsSupported(), and its use from the client) was suggested by AI as I was testing if it can make useful code. From my layman description, it worked out a surprisingly reasonable solution - the sort of logic which I wanted to have but did not really know how to achieve ;) This generated code served as a stump for lots of manual chiseling in the following days, mostly on tests to make sure I can reproduce the "ungraceful" forgetting of the agent (ended up using reflection a bit). Possibly the short and somewhat wrong first version of the tests also started there by AI.

I am concerned that the new server endpoint may want some Authorization protection (e.g. so that only swarm-agent users may poke it?) but this could complicate the checks and so far I do not know how to add that. Maybe a task for another PR, or reason for another talk with AI ;) First of all - is it needed? Many other endpoints seem anonymous anyway.

…laveExists` on server side, to reconnect an agent dropped by the server unilaterally [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
… than default "INFO" [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
… stderrLogPath, and provide logContains() methods for String and regex.Pattern [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
… test case from unilateral loss by server (while client thinks it is connected), and from the checks for probe-sending [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
Good to know in testing whether the client crashed or was told to exit.

Signed-off-by: Jim Klimov <[email protected]>
@jimklimov jimklimov requested a review from a team as a code owner January 22, 2026 20:37
jimklimov added a commit to jimklimov/jenkins-swarm-nutci that referenced this pull request Jan 22, 2026
… ungraceful disconnection test code [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
…connected Node and Computer are same as seen before (should not be) [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
…t from here; make @link into plain @code [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
… and associated Computer are not null [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
…tsDissociatedComputer() [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
@jimklimov
Copy link
Contributor Author

jimklimov commented Jan 25, 2026

https://ci.jenkins.io/job/Plugins/job/swarm-plugin/job/PR-749/8/pipeline-overview/?selected-node=87

21:42:07  [ERROR] Failures: 
21:42:07  [ERROR]   SwarmClientIntegrationTest.keepAliveReconnectsRemoved Unable to clean up temporary folder Z:\Temp\junit16713350743250440301

Does not seem related to the new feature and its tests?..

…D if associated computer is null [JENKINS-75196]

Signed-off-by: Jim Klimov <[email protected]>
@jimklimov
Copy link
Contributor Author

Ready for review from my side; incrementals' HPI&JAR deployed to the server and its agents where the problem was manifesting, to gather some real-life results too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant