-
-
Notifications
You must be signed in to change notification settings - Fork 230
Support shutting down celery workers in various broker migration states #37224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- ping worker before attempting shutdown - avoid pinging worker if using custom broker_conn - log helpful messages if ping fails in either case
corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py
Show resolved
Hide resolved
| for broker_url in [current_broker_url, old_broker_url]: | ||
| broker_conn = Connection(broker_url) | ||
| succeeded = self._shutdown(hostname, broker_conn) | ||
| broker_conn.release() | ||
| if succeeded: | ||
| print( | ||
| '[Broker Migration In Progress] Initiated warm shutdown ' | ||
| f'of {hostname}' | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: During broker migration, the command sends a redundant shutdown signal on success and incorrectly exits with a success code (0) if both shutdown attempts fail.
Severity: CRITICAL | Confidence: High
🔍 Detailed Analysis
In the broker migration scenario, the command iterates through two broker URLs to send a shutdown signal. If the first attempt is successful, the code does not exit the loop and proceeds to send a second, redundant shutdown signal to the other broker. Furthermore, if both shutdown attempts fail, the loop completes and the command exits with a success code (0). This is inconsistent with the non-migration path which exits with an error code (1) on failure. This will cause wrapper processes, which rely on the exit code, to incorrectly believe a failed shutdown was successful.
💡 Suggested Fix
After a successful shutdown attempt within the for loop, add a return statement to exit immediately. After the loop, if no attempt was successful, call exit(1) to signal failure, making the behavior consistent with the non-migration path.
🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location:
corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py#L34-L42
Potential issue: In the broker migration scenario, the command iterates through two
broker URLs to send a shutdown signal. If the first attempt is successful, the code does
not exit the loop and proceeds to send a second, redundant shutdown signal to the other
broker. Furthermore, if both shutdown attempts fail, the loop completes and the command
exits with a success code (0). This is inconsistent with the non-migration path which
exits with an error code (1) on failure. This will cause wrapper processes, which rely
on the exit code, to incorrectly believe a failed shutdown was successful.
Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 8068718
|
The more I think about this, the more inclined I am to just accept the state of the world as is, and say that when migrating brokers (which we so rarely do), remote control commands will not work, so we should avoid worker restarts in this time. The primary motivation being that no matter what we are going to enter this state at some point so the added complexity doesn't even mitigate the risk entirely, and therefore doesn't seem worth it. Likely going to close this, but will leave open until others have a chance to weigh in. |
Product Description
Technical Summary
https://dimagi.atlassian.net/browse/SAAS-19047
How this command is used
We have a process that wraps celery worker processes, with the main goal of catching SIGTERM signals, calling this command on the underlying celery worker, and exiting to spin up a new celery worker while giving the old worker time to shutdown gracefully.
The problem
The way this shutdown is communicated to the celery worker is via a "broadcast". In celery, broadcasts are sent via the celery broker, and the worker reads this message from that broker.
As part of a broker migration, some machines are configured as "bridges" where they read from the old broker and write to the new broker, and other machines are configured to read from and write to the new broker entirely.
This means this command needs to successfully broadcast the shutdown message to the right broker in various scenarios, or else the worker will never receive the signal to shutdown.
Scenarios
In each of these scenarios we need to consider machines that act as the "bridge" and machines that don't.
Start of migration
At the start of a migration from broker A to broker B,
CELERY_BROKER_READ_URLis updated to broker A on the bridge machine responsible for draining broker A, andCELERY_BROKER_URLis set to broker B on non-bridge machines.On the bridge machine, settings will be updated to define
CELERY_BROKER_READ_URLandCELERY_BROKER_WRITE_URL, the shutdown command will run with these updated settings while the celery worker is still running withCELERY_BROKER_URLset to broker A. This means this command needs to broadcast the shutdown to broker A.On machines that do not act as the bridge between two brokers,
CELERY_BROKER_URLwill be set to broker A for existing workers, and the command to shutdown workers will run with this set to broker B. Similarly, this command needs to broadcast the shutdown to broker A.Cut over
When we cut over to broker B,
CELERY_BROKER_URLgets updated to broker B, and the READ and WRITE vars are removed because of howrabbitmq_migration_bridgeworks. This means the command will run withCELERY_BROKER_URLset to broker B, and will not communicate shutdowns to celery workers running withCELERY_BROKER_READ_URLset to broker A on bridge machines. Again, the solution is for this command to broadcast the shutdown to broker A.For non-bridge machines, this is fine since the workers are already configured to read from and write to broker B.
Revert
The third scenario is unlikely, but still worth considering. After cutting over to broker B, if we decide we need to revert, ideally we would reset
CELERY_BROKER_READ_URLto broker A, and keepCELERY_BROKER_URLset to broker B. In this case, on bridge machines the command will run withCELERY_BROKER_READ_URLset to broker A, but the existing celery workers will be running withCELERY_BROKER_URLset to broker B.Solution
While we can come up with different solutions for each of these, ultimately the easiest solution is to depend on
OLD_BROKER_URL. As long as this is set, shutdowns should be communicated to both broker A and B. This would mean there would be an imperative step at the end of the migration when cutting over, which is to leaveOLD_BROKER_URLset even when removingrabbitmq_migration_bridgeso that the shutdown command works properly. Once all workers are running against broker B,OLD_BROKER_URLcan be removed.Feature Flag
Safety Assurance
Safety story
Tested on staging. This command is very important as it is used to shutdown a celery worker whenever a SIGTERM is received by the supervisor process. If this fails, it means workers will not shutdown, and we will accumulate stale workers over time and run out of memory on the machine.
Automated test coverage
QA Plan
No
Labels & Review