fix: emit heartbeat in GenericTransfer pagination loop to prevent zombie detection (#64658)#64709
Conversation
|
Don’t think it’s a good idea to let an specific operator handle heartbeats imho. |
|
Thanks for the feedback @dabla, and totally fair point from an architectural perspective! I agree that having a specific operator manage its own heartbeats is not ideal as a general pattern. The reason I went this route here is that So this PR is really meant as a practical short-term fix for users hitting this today, while a proper That said I am fully open to whatever direction the maintainers prefer:
Would love guidance from a core maintainer on which approach fits better with the project's direction. |
|
The failure in Could a maintainer please re-run the failed job? Thanks! |
Closes #64658
Problem
Long-running
GenericTransfertasks (>3 hours) were being incorrectlykilled by the scheduler due to a heartbeat timeout / zombie detection
false positive.
During the paginated transfer loop, the operator performs long blocking
work (bulk inserts via
executemany) without emitting any heartbeat tothe Airflow metadata DB. The scheduler's
_find_and_purge_task_instances_without_heartbeatsroutine inscheduler_job_runner.pycheckslast_heartbeat_atperiodically — ifit goes stale beyond
task_instance_heartbeat_timeout, the task istreated as a zombie and terminated, even though it is actively
processing data.
This affects both:
execute_complete— called per page when deferred)execute— iterates over a list of SQL statements)Fix
_emit_transfer_heartbeat()helper that callsti.heartbeat()or
ti.update_heartbeat()(first match wins viagetattr) after eachpage in
execute_complete()and after each SQL batch inexecute()task instance (no regression for older runtimes)
long-running transfers:
[scheduler] task_instance_heartbeat_timeout[celery_broker_transport_options] visibility_timeout[scheduler] task_instance_heartbeat_sectest_heartbeat_called_during_paginated_transferto verifyheartbeat is called once per page during a paginated transfer
Testing
Related Issues