Skip to content

Conversation

@gherceg
Copy link
Contributor

@gherceg gherceg commented Dec 30, 2025

Product Description

Technical Summary

https://dimagi.atlassian.net/browse/SAAS-19047

How this command is used

We have a process that wraps celery worker processes, with the main goal of catching SIGTERM signals, calling this command on the underlying celery worker, and exiting to spin up a new celery worker while giving the old worker time to shutdown gracefully.

The problem

The way this shutdown is communicated to the celery worker is via a "broadcast". In celery, broadcasts are sent via the celery broker, and the worker reads this message from that broker.

As part of a broker migration, some machines are configured as "bridges" where they read from the old broker and write to the new broker, and other machines are configured to read from and write to the new broker entirely.

This means this command needs to successfully broadcast the shutdown message to the right broker in various scenarios, or else the worker will never receive the signal to shutdown.

Scenarios

In each of these scenarios we need to consider machines that act as the "bridge" and machines that don't.

Start of migration

At the start of a migration from broker A to broker B, CELERY_BROKER_READ_URL is updated to broker A on the bridge machine responsible for draining broker A, and CELERY_BROKER_URL is set to broker B on non-bridge machines.

On the bridge machine, settings will be updated to define CELERY_BROKER_READ_URL and CELERY_BROKER_WRITE_URL, the shutdown command will run with these updated settings while the celery worker is still running with CELERY_BROKER_URL set to broker A. This means this command needs to broadcast the shutdown to broker A.

On machines that do not act as the bridge between two brokers, CELERY_BROKER_URL will be set to broker A for existing workers, and the command to shutdown workers will run with this set to broker B. Similarly, this command needs to broadcast the shutdown to broker A.

Cut over

When we cut over to broker B, CELERY_BROKER_URL gets updated to broker B, and the READ and WRITE vars are removed because of how rabbitmq_migration_bridge works. This means the command will run with CELERY_BROKER_URL set to broker B, and will not communicate shutdowns to celery workers running with CELERY_BROKER_READ_URL set to broker A on bridge machines. Again, the solution is for this command to broadcast the shutdown to broker A.

For non-bridge machines, this is fine since the workers are already configured to read from and write to broker B.

Revert

The third scenario is unlikely, but still worth considering. After cutting over to broker B, if we decide we need to revert, ideally we would reset CELERY_BROKER_READ_URL to broker A, and keep CELERY_BROKER_URL set to broker B. In this case, on bridge machines the command will run with CELERY_BROKER_READ_URL set to broker A, but the existing celery workers will be running with CELERY_BROKER_URL set to broker B.

Solution

While we can come up with different solutions for each of these, ultimately the easiest solution is to depend on OLD_BROKER_URL. As long as this is set, shutdowns should be communicated to both broker A and B. This would mean there would be an imperative step at the end of the migration when cutting over, which is to leave OLD_BROKER_URL set even when removing rabbitmq_migration_bridge so that the shutdown command works properly. Once all workers are running against broker B, OLD_BROKER_URL can be removed.

Feature Flag

Safety Assurance

Safety story

Tested on staging. This command is very important as it is used to shutdown a celery worker whenever a SIGTERM is received by the supervisor process. If this fails, it means workers will not shutdown, and we will accumulate stale workers over time and run out of memory on the machine.

Automated test coverage

QA Plan

No

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

- ping worker before attempting shutdown
- avoid pinging worker if using custom broker_conn
- log helpful messages if ping fails in either case
@gherceg gherceg added the product/invisible Change has no end-user visible impact label Dec 30, 2025
Comment on lines +35 to +36
broker_conn = Connection(broker_url)
succeeded = self._shutdown(hostname, broker_conn)

This comment was marked as outdated.

Comment on lines +34 to +42
for broker_url in [current_broker_url, old_broker_url]:
broker_conn = Connection(broker_url)
succeeded = self._shutdown(hostname, broker_conn)
broker_conn.release()
if succeeded:
print(
'[Broker Migration In Progress] Initiated warm shutdown '
f'of {hostname}'
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: During broker migration, the command sends a redundant shutdown signal on success and incorrectly exits with a success code (0) if both shutdown attempts fail.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

In the broker migration scenario, the command iterates through two broker URLs to send a shutdown signal. If the first attempt is successful, the code does not exit the loop and proceeds to send a second, redundant shutdown signal to the other broker. Furthermore, if both shutdown attempts fail, the loop completes and the command exits with a success code (0). This is inconsistent with the non-migration path which exits with an error code (1) on failure. This will cause wrapper processes, which rely on the exit code, to incorrectly believe a failed shutdown was successful.

💡 Suggested Fix

After a successful shutdown attempt within the for loop, add a return statement to exit immediately. After the loop, if no attempt was successful, call exit(1) to signal failure, making the behavior consistent with the non-migration path.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location:
corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py#L34-L42

Potential issue: In the broker migration scenario, the command iterates through two
broker URLs to send a shutdown signal. If the first attempt is successful, the code does
not exit the loop and proceeds to send a second, redundant shutdown signal to the other
broker. Furthermore, if both shutdown attempts fail, the loop completes and the command
exits with a success code (0). This is inconsistent with the non-migration path which
exits with an error code (1) on failure. This will cause wrapper processes, which rely
on the exit code, to incorrectly believe a failed shutdown was successful.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 8068718

@gherceg
Copy link
Contributor Author

gherceg commented Dec 31, 2025

The more I think about this, the more inclined I am to just accept the state of the world as is, and say that when migrating brokers (which we so rarely do), remote control commands will not work, so we should avoid worker restarts in this time. The primary motivation being that no matter what we are going to enter this state at some point so the added complexity doesn't even mitigate the risk entirely, and therefore doesn't seem worth it. Likely going to close this, but will leave open until others have a chance to weigh in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

product/invisible Change has no end-user visible impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants