Support shutting down celery workers in various broker migration states #37224

gherceg · 2025-12-30T19:24:59Z

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-19047

How this command is used

We have a process that wraps celery worker processes, with the main goal of catching SIGTERM signals, calling this command on the underlying celery worker, and exiting to spin up a new celery worker while giving the old worker time to shutdown gracefully.

The problem

The way this shutdown is communicated to the celery worker is via a "broadcast". In celery, broadcasts are sent via the celery broker, and the worker reads this message from that broker.

As part of a broker migration, some machines are configured as "bridges" where they read from the old broker and write to the new broker, and other machines are configured to read from and write to the new broker entirely.

This means this command needs to successfully broadcast the shutdown message to the right broker in various scenarios, or else the worker will never receive the signal to shutdown.

Scenarios

In each of these scenarios we need to consider machines that act as the "bridge" and machines that don't.

Start of migration

At the start of a migration from broker A to broker B, CELERY_BROKER_READ_URL is updated to broker A on the bridge machine responsible for draining broker A, and CELERY_BROKER_URL is set to broker B on non-bridge machines.

On the bridge machine, settings will be updated to define CELERY_BROKER_READ_URL and CELERY_BROKER_WRITE_URL, the shutdown command will run with these updated settings while the celery worker is still running with CELERY_BROKER_URL set to broker A. This means this command needs to broadcast the shutdown to broker A.

On machines that do not act as the bridge between two brokers, CELERY_BROKER_URL will be set to broker A for existing workers, and the command to shutdown workers will run with this set to broker B. Similarly, this command needs to broadcast the shutdown to broker A.

Cut over

When we cut over to broker B, CELERY_BROKER_URL gets updated to broker B, and the READ and WRITE vars are removed because of how rabbitmq_migration_bridge works. This means the command will run with CELERY_BROKER_URL set to broker B, and will not communicate shutdowns to celery workers running with CELERY_BROKER_READ_URL set to broker A on bridge machines. Again, the solution is for this command to broadcast the shutdown to broker A.

For non-bridge machines, this is fine since the workers are already configured to read from and write to broker B.

Revert

The third scenario is unlikely, but still worth considering. After cutting over to broker B, if we decide we need to revert, ideally we would reset CELERY_BROKER_READ_URL to broker A, and keep CELERY_BROKER_URL set to broker B. In this case, on bridge machines the command will run with CELERY_BROKER_READ_URL set to broker A, but the existing celery workers will be running with CELERY_BROKER_URL set to broker B.

Solution

While we can come up with different solutions for each of these, ultimately the easiest solution is to depend on OLD_BROKER_URL. As long as this is set, shutdowns should be communicated to both broker A and B. This would mean there would be an imperative step at the end of the migration when cutting over, which is to leave OLD_BROKER_URL set even when removing rabbitmq_migration_bridge so that the shutdown command works properly. Once all workers are running against broker B, OLD_BROKER_URL can be removed.

Feature Flag

Safety Assurance

Safety story

Tested on staging. This command is very important as it is used to shutdown a celery worker whenever a SIGTERM is received by the supervisor process. If this fails, it means workers will not shutdown, and we will accumulate stale workers over time and run out of memory on the machine.

Automated test coverage

QA Plan

No

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

https://docs.celeryq.dev/en/stable/reference/celery.app.control.html#celery.app.control.Control.broadcast

- ping worker before attempting shutdown - avoid pinging worker if using custom broker_conn - log helpful messages if ping fails in either case

corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py

+            broker_conn = Connection(broker_url)
+            succeeded = self._shutdown(hostname, broker_conn)


sentry · 2025-12-31T14:48:11Z

corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py

+        for broker_url in [current_broker_url, old_broker_url]:
+            broker_conn = Connection(broker_url)
+            succeeded = self._shutdown(hostname, broker_conn)
+            broker_conn.release()
+            if succeeded:
+                print(
+                    '[Broker Migration In Progress] Initiated warm shutdown '
+                    f'of {hostname}'
+                )


Bug: During broker migration, the command sends a redundant shutdown signal on success and incorrectly exits with a success code (0) if both shutdown attempts fail.
_{Severity: CRITICAL | Confidence: High}

🔍 Detailed Analysis

In the broker migration scenario, the command iterates through two broker URLs to send a shutdown signal. If the first attempt is successful, the code does not exit the loop and proceeds to send a second, redundant shutdown signal to the other broker. Furthermore, if both shutdown attempts fail, the loop completes and the command exits with a success code (0). This is inconsistent with the non-migration path which exits with an error code (1) on failure. This will cause wrapper processes, which rely on the exit code, to incorrectly believe a failed shutdown was successful.

💡 Suggested Fix

After a successful shutdown attempt within the for loop, add a return statement to exit immediately. After the loop, if no attempt was successful, call exit(1) to signal failure, making the behavior consistent with the non-migration path.

🤖 Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py#L34-L42 Potential issue: In the broker migration scenario, the command iterates through two broker URLs to send a shutdown signal. If the first attempt is successful, the code does not exit the loop and proceeds to send a second, redundant shutdown signal to the other broker. Furthermore, if both shutdown attempts fail, the loop completes and the command exits with a success code (0). This is inconsistent with the non-migration path which exits with an error code (1) on failure. This will cause wrapper processes, which rely on the exit code, to incorrectly believe a failed shutdown was successful.

_{Did we get this right? 👍 / 👎 to inform future reviews.}
_{Reference ID: 8068718}

gherceg · 2025-12-31T23:25:13Z

The more I think about this, the more inclined I am to just accept the state of the world as is, and say that when migrating brokers (which we so rarely do), remote control commands will not work, so we should avoid worker restarts in this time. The primary motivation being that no matter what we are going to enter this state at some point so the added complexity doesn't even mitigate the risk entirely, and therefore doesn't seem worth it. Likely going to close this, but will leave open until others have a chance to weigh in.

gherceg added 3 commits December 19, 2025 14:10

Lint file

0f3b2d2

Broadcast shutdowns to both brokers during migrations

cbea049

https://docs.celeryq.dev/en/stable/reference/celery.app.control.html#celery.app.control.Control.broadcast

Improvements to checking if worker is up

88b5495

- ping worker before attempting shutdown - avoid pinging worker if using custom broker_conn - log helpful messages if ping fails in either case

sentry bot reviewed Dec 30, 2025

View reviewed changes

corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py Show resolved Hide resolved

gherceg added the product/invisible Change has no end-user visible impact label Dec 30, 2025

Remove incorrect broker_conn argument

4ea867d

sentry bot reviewed Dec 30, 2025

View reviewed changes

corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py

Comment on lines +35 to +36

broker_conn = Connection(broker_url)

succeeded = self._shutdown(hostname, broker_conn)

This comment was marked as outdated.

Sign in to view

Release broker_conn after use

0a5abda

sentry bot reviewed Dec 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support shutting down celery workers in various broker migration states #37224

Support shutting down celery workers in various broker migration states #37224

gherceg commented Dec 30, 2025

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

sentry bot Dec 31, 2025

Uh oh!

gherceg commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		broker_conn = Connection(broker_url)
		succeeded = self._shutdown(hostname, broker_conn)

Uh oh!

Support shutting down celery workers in various broker migration states #37224

Are you sure you want to change the base?

Support shutting down celery workers in various broker migration states #37224

Conversation

gherceg commented Dec 30, 2025

Product Description

Technical Summary

How this command is used

The problem

Scenarios

Start of migration

Cut over

Revert

Solution

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Labels & Review

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

sentry bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

gherceg commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants