Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,24 +1,73 @@
from celery import Celery
from django.conf import settings
from django.core.management.base import BaseCommand

from celery import Celery
from kombu import Connection

from corehq.apps.hqadmin.utils import parse_celery_pings


class Command(BaseCommand):
help = "Gracefully shutsdown a celery worker"
help = "Gracefully shuts down a celery worker"

def add_arguments(self, parser):
parser.add_argument('hostname')

def handle(self, hostname, **options):
celery = Celery()
celery.config_from_object(settings)
celery.control.broadcast('shutdown', destination=[hostname])
worker_responses = celery.control.ping(timeout=10, destination=[hostname])
pings = parse_celery_pings(worker_responses)
if hostname in pings:
print('Did not shutdown worker')
self.celery = Celery()
self.celery.config_from_object(settings)

current_broker_url = getattr(settings, 'CELERY_BROKER_URL', None)
assert current_broker_url is not None, "CELERY_BROKER_URL is not set"

# as long as OLD_BROKER_URL is set, we are going to send shutdown
# broadcasts to both the old and new broker urls
old_broker_url = getattr(settings, 'OLD_BROKER_URL', None)
migration_in_progress = old_broker_url is not None

if not migration_in_progress:
succeeded = self._shutdown(hostname)
if succeeded:
print(f'Successfully initiated warm shutdown of {hostname}')
return
exit(1)
print('Successfully initiated warm shutdown')

for broker_url in [current_broker_url, old_broker_url]:
broker_conn = Connection(broker_url)
succeeded = self._shutdown(hostname, broker_conn)
Comment on lines +35 to +36

This comment was marked as outdated.

broker_conn.release()
if succeeded:
print(
'[Broker Migration In Progress] Initiated warm shutdown '
f'of {hostname}'
)
Comment on lines +34 to +42
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: During broker migration, the command sends a redundant shutdown signal on success and incorrectly exits with a success code (0) if both shutdown attempts fail.
Severity: CRITICAL | Confidence: High

🔍 Detailed Analysis

In the broker migration scenario, the command iterates through two broker URLs to send a shutdown signal. If the first attempt is successful, the code does not exit the loop and proceeds to send a second, redundant shutdown signal to the other broker. Furthermore, if both shutdown attempts fail, the loop completes and the command exits with a success code (0). This is inconsistent with the non-migration path which exits with an error code (1) on failure. This will cause wrapper processes, which rely on the exit code, to incorrectly believe a failed shutdown was successful.

💡 Suggested Fix

After a successful shutdown attempt within the for loop, add a return statement to exit immediately. After the loop, if no attempt was successful, call exit(1) to signal failure, making the behavior consistent with the non-migration path.

🤖 Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location:
corehq/apps/hqadmin/management/commands/shutdown_celery_worker_by_hostname.py#L34-L42

Potential issue: In the broker migration scenario, the command iterates through two
broker URLs to send a shutdown signal. If the first attempt is successful, the code does
not exit the loop and proceeds to send a second, redundant shutdown signal to the other
broker. Furthermore, if both shutdown attempts fail, the loop completes and the command
exits with a success code (0). This is inconsistent with the non-migration path which
exits with an error code (1) on failure. This will cause wrapper processes, which rely
on the exit code, to incorrectly believe a failed shutdown was successful.

Did we get this right? 👍 / 👎 to inform future reviews.
Reference ID: 8068718


def _shutdown(self, hostname, broker_conn=None):
# if using a custom broker connection, it is unlikely a ping will
# work properly since the worker might be writing to a different
# broker than the one it is reading from
check_worker_up = broker_conn is None

if check_worker_up and not self._is_worker_up(hostname):
print(f'{hostname} did not respond to ping. Aborted shutdown.')
return False

kwargs = {'destination': [hostname]}
if broker_conn is not None:
# use a custom broker connection
kwargs['connection'] = broker_conn
self.celery.control.broadcast('shutdown', **kwargs)

if check_worker_up and self._is_worker_up(hostname):
# if worker is still up, the shutdown likely did not succeed
# or it is just a slow shutdown
print(f'{hostname} responded to ping after initiating shutdown.')
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that this branch results in sys.exit(1) (command failure), and that means that the shutdown either failed entirely or some workers are still in the process of shutting down and therefore it might not have failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I was attempting to preserve behavior since this is how it functioned before as well, though I'm not sure I agree with it. In practice, I don't think the exit code here actually makes any difference since this is called in the context of a trap in the celery bash runner, and we don't do anything special to handle a failure here. Perhaps it would be more clear to remove the exit(1) from this command entirely.


return True

def _is_worker_up(self, hostname):
worker_responses = self.celery.control.ping(
timeout=10, destination=[hostname]
)
pings = parse_celery_pings(worker_responses)
return hostname in pings
Loading