Skip to content

fix: emit heartbeat in GenericTransfer pagination loop to prevent zombie detection (#64658)#64709

Open
Pranaykarvi wants to merge 3 commits intoapache:mainfrom
Pranaykarvi:fix/64658-long-running-task-heartbeat-mismatch
Open

fix: emit heartbeat in GenericTransfer pagination loop to prevent zombie detection (#64658)#64709
Pranaykarvi wants to merge 3 commits intoapache:mainfrom
Pranaykarvi:fix/64658-long-running-task-heartbeat-mismatch

Conversation

@Pranaykarvi
Copy link
Copy Markdown
Contributor

Closes #64658

Problem

Long-running GenericTransfer tasks (>3 hours) were being incorrectly
killed by the scheduler due to a heartbeat timeout / zombie detection
false positive.

During the paginated transfer loop, the operator performs long blocking
work (bulk inserts via executemany) without emitting any heartbeat to
the Airflow metadata DB. The scheduler's
_find_and_purge_task_instances_without_heartbeats routine in
scheduler_job_runner.py checks last_heartbeat_at periodically — if
it goes stale beyond task_instance_heartbeat_timeout, the task is
treated as a zombie and terminated, even though it is actively
processing data.

This affects both:

  • The paginated path (execute_complete — called per page when deferred)
  • The non-paginated multi-SQL path (execute — iterates over a list of SQL statements)

Fix

  • Added _emit_transfer_heartbeat() helper that calls ti.heartbeat()
    or ti.update_heartbeat() (first match wins via getattr) after each
    page in execute_complete() and after each SQL batch in execute()
  • Helper is best-effort — no-ops cleanly if neither method exists on the
    task instance (no regression for older runtimes)
  • Added docstring note on tuning the following config values for
    long-running transfers:
    • [scheduler] task_instance_heartbeat_timeout
    • [celery_broker_transport_options] visibility_timeout
    • [scheduler] task_instance_heartbeat_sec
  • Added test_heartbeat_called_during_paginated_transfer to verify
    heartbeat is called once per page during a paginated transfer

Testing

uv run --project providers/common/sql pytest \
  providers/common/sql/tests/unit/common/sql/operators/test_generic_transfer.py \
  -xvs

Related Issues

@dabla
Copy link
Copy Markdown
Contributor

dabla commented Apr 4, 2026

Don’t think it’s a good idea to let an specific operator handle heartbeats imho.

@Pranaykarvi
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback @dabla, and totally fair point from an architectural perspective!

I agree that having a specific operator manage its own heartbeats is not ideal as a general pattern.
A cleaner long-term solution would be something like a background heartbeat thread inside LocalTaskJob
that fires independently of what the operator's main thread is doing — that way every long-running
operator benefits without needing operator-level changes.

The reason I went this route here is that GenericTransfer blocks the Python thread completely during
executemany() across potentially thousands of paginated batches. During that time there is simply no
gap for any framework-level mechanism to fire a heartbeat, which is what causes the scheduler to
incorrectly treat the task as a zombie.

So this PR is really meant as a practical short-term fix for users hitting this today, while a proper
framework-level solution could be tracked as a separate improvement.

That said I am fully open to whatever direction the maintainers prefer:

  • Keep this as a targeted operator-level fix for now
  • Or close this and open a dedicated issue proposing a background heartbeat thread at the LocalTaskJob
    level as the right long-term fix

Would love guidance from a core maintainer on which approach fits better with the project's direction.
Happy to do the work either way!

@Pranaykarvi
Copy link
Copy Markdown
Contributor Author

The failure in Tests / Low dep tests: providers / All-prov:LowestDeps:14:3.10:exasol...odbc
appears to be a transient CI infrastructure issue — the Docker image failed to load from cache
with Error when loading image: None, which is unrelated to the changes in this PR.

Could a maintainer please re-run the failed job? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Long running task gets incorrectly killed in long-running DAG due to scheduler state mismatch

2 participants