Jenkins 31084 graceful shutdown by akomakom · Pull Request #115 · jenkinsci/swarm-plugin

akomakom · 2019-06-05T20:58:53Z

Work in progress. Seems to be working so far.

Question: should doGetSlaveReadyForShutdown also check secrets/permissions?

basil

This looks great so far! I like the use of a ShutdownHook, and the backend APIs are pretty much exactly what I expected.

Have you given any thought as to how this functionality might be activated or deactivated? The Java API states:

Shutdown hooks should also finish their work quickly. When a program invokes exit the expectation is that the virtual machine will promptly shut down and exit. When the virtual machine is terminated due to user logoff or system shutdown the underlying operating system may only allow a fixed amount of time in which to shut down and exit. It is therefore inadvisable to attempt any user interaction or to perform a long-running computation in a shutdown hook.

For this reason as well as to preserve backwards compatibility of the existing command-line API, I think this new functionality should probably be opt-in rather than opt-out by default. What do you think?

basil · 2019-06-06T02:37:54Z

client/src/main/java/hudson/plugins/swarm/Client.java

+
+
+    public static class ShutdownHook extends Thread {
+        private SwarmClient client;


[nit] These can be private final.

basil · 2019-06-06T02:42:44Z

client/src/main/java/hudson/plugins/swarm/Client.java

+            try {
+                client.takeSlaveOfflineAndWait(target, "JVM is shutting down");
+            } catch (Exception e) {
+                logger.log(Level.SEVERE, "Unable to perform graceful slave shutdown: ", e);


What would happen if you started a shutdown operation (e.g., by sending a SIGINT with Ctrl-C or a SIGTERM with kill(1)) and then sent another signal (e.g., another SIGINT by pressing Ctrl-C again) while we were in the middle of takeSlaveOfflineAndWait()? I think the expected behavior is that we would stop immediately without continuing to wait in takeSlaveOfflineAndWait(). I think that your current code probably already handles this in that an InterruptedException would be thrown (and you are catching Exception here). It might be worth validating this with a manual test. Even if the code works as expected, might it be worth printing an additional message in this case (e.g., "terminating with extreme prejudice")?

basil · 2019-06-06T02:43:31Z

plugin/src/main/java/hudson/plugins/swarm/PluginImpl.java

+
+        node.toComputer().setTemporarilyOffline(true, new OfflineCause.UserCause(User.current(), reason));
+
+        normalResponse(req, rsp, "<response/>"); //TODO: not sure what to send here


I'm not sure either offhand. Assuming that the default response status is javax.servlet.http.HttpServletResponse.SC_OK, can we rely on the response status alone and simply avoid calling rsp.getCompressedWriter() in the first place for such responses? I can look into this further if you're having trouble.

basil · 2019-06-06T02:44:02Z

plugin/src/main/java/hudson/plugins/swarm/PluginImpl.java

+    }
+
+    public void doGetSlaveReadyForShutdown(StaplerRequest req, StaplerResponse rsp, @QueryParameter String name) throws IOException {
+        Jenkins jenkins = Jenkins.getInstance();


Regarding checking for secrets/permissions: I can't see any disadvantages to checking, and I can't see any advantages to not checking. In contrast, if we don't check, we increase the attack surface for potential security issues. Based on the above reasoning, I think I'd prefer to check secrets/permissions.

akomakom · 2019-06-06T17:10:52Z

This looks great so far! I like the use of a ShutdownHook, and the backend APIs are pretty much exactly what I expected.

Have you given any thought as to how this functionality might be activated or deactivated? The Java API states:

Shutdown hooks should also finish their work quickly. When a program invokes exit the expectation is that the virtual machine will promptly shut down and exit. When the virtual machine is terminated due to user logoff or system shutdown the underlying operating system may only allow a fixed amount of time in which to shut down and exit. It is therefore inadvisable to attempt any user interaction or to perform a long-running computation in a shutdown hook.

For this reason as well as to preserve backwards compatibility of the existing command-line API, I think this new functionality should probably be opt-in rather than opt-out by default. What do you think?

I have mixed feeling about the design.
First off, the expected use case is the opposite of "finish work quickly". I have jobs that take 8 hours, which is how long the ShutdownHook would have to delay VM termination.

Then there is the opt-in. I see two main approaches:

A new flag or signal to enable graceful shutdown behavior.
A second kill required to terminate with prejudice.

Number 2 is not really opt-in since it will require instrumentation changes. It's also not trivial to do using only java's shutdown hooks (without explicitly registering signal hooks), but we might as well require kill -9 the second time to make it easy.

Which leaves these options:

-gracefulShutdown=true changes SIGINT behavior to what I have in this PR. This still violates the "quick" directive for shutdown hooks.
kill -SIGUSR1 (for example) as opposed to a SIGINT/SIGTERM which still kill instantly. (But I have yet to find a signal handling approach that doesn't rely on sun.misc.*)
Some other mechanism, ie an HTTP listener, a file monitor, etc.

I'm in favor of choice 3 because I am not confident about the suitability of shutdown hooks for this use case. HTTP listener is probably too heavy and opens some security vulnerabilities, but a command file (a la nagios.cmd) would work, eg -commandFile /some/path. Then echo 'gracefulShutdown' > /some/path initiates shutdown (details of the file can be hidden by java also, ie java -jar swarm-client.jar gracefulShutdown). Maybe even 2 files - command and status, so that the new process can request shutdown and monitor the results, blocking until ready. Status file can potentially be used for monitoring by external tools.

akomakom · 2019-06-06T18:11:17Z

Another half-baked idea:

Run the jar to shut down the other process, ie java -jar swarm-client.jar .... -gracefulShutdown (same command line + shutdown switch)
Communicates with jenkins
Jenkins marks original slave offline
The original instance determines that it should shut down by also communicating with Jenkins (by polling? by piggybacking on some existing channel?) and exits once jobs finish.

This might not work with unique client names.

Alternatively:

Run the jar as above, so that it:
Communicates to Jenkins, marks old slave offline.
Blocks until jobs finish.
Kills the original process.

Again unique client names makes finding the right one to work with difficult if multiple instances are present.

akomakom added 2 commits June 5, 2019 16:55

New endpoints and shutdown hook

b5f32f6

cleanup

f989b07

basil reviewed Jun 6, 2019

View reviewed changes

basil added the feature New features and improvements label Jun 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jenkins 31084 graceful shutdown#115

Jenkins 31084 graceful shutdown#115
akomakom wants to merge 2 commits intojenkinsci:masterfrom
akomakom:JENKINS-31084-graceful-shutdown

akomakom commented Jun 5, 2019

Uh oh!

basil left a comment

Uh oh!

basil Jun 6, 2019

Uh oh!

basil Jun 6, 2019

Uh oh!

basil Jun 6, 2019

Uh oh!

basil Jun 6, 2019

Uh oh!

akomakom commented Jun 6, 2019 •

edited

Loading

Uh oh!

akomakom commented Jun 6, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		public static class ShutdownHook extends Thread {
		private SwarmClient client;


		node.toComputer().setTemporarilyOffline(true, new OfflineCause.UserCause(User.current(), reason));

		normalResponse(req, rsp, "<response/>"); //TODO: not sure what to send here

Uh oh!

Conversation

akomakom commented Jun 5, 2019

Uh oh!

basil left a comment

Choose a reason for hiding this comment

Uh oh!

basil Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

basil Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

basil Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

basil Jun 6, 2019

Choose a reason for hiding this comment

Uh oh!

akomakom commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akomakom commented Jun 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akomakom commented Jun 6, 2019 •

edited

Loading

akomakom commented Jun 6, 2019 •

edited

Loading