fix: emit heartbeat in GenericTransfer pagination loop to prevent zombie detection (#64658) by Pranaykarvi · Pull Request #64709 · apache/airflow

Pranaykarvi · 2026-04-04T05:37:13Z

Problem

Long-running GenericTransfer tasks (>3 hours) were being incorrectly
killed by the scheduler due to a heartbeat timeout / zombie detection
false positive.

During the paginated transfer loop, the operator performs long blocking
work (bulk inserts via executemany) without emitting any heartbeat to
the Airflow metadata DB. The scheduler's
_find_and_purge_task_instances_without_heartbeats routine in
scheduler_job_runner.py checks last_heartbeat_at periodically — if
it goes stale beyond task_instance_heartbeat_timeout, the task is
treated as a zombie and terminated, even though it is actively
processing data.

This affects both:

The paginated path (execute_complete — called per page when deferred)
The non-paginated multi-SQL path (execute — iterates over a list of SQL statements)

Fix

Added _emit_transfer_heartbeat() helper that calls ti.heartbeat()
or ti.update_heartbeat() (first match wins via getattr) after each
page in execute_complete() and after each SQL batch in execute()
Helper is best-effort — no-ops cleanly if neither method exists on the
task instance (no regression for older runtimes)
Added docstring note on tuning the following config values for
long-running transfers:
- [scheduler] task_instance_heartbeat_timeout
- [celery_broker_transport_options] visibility_timeout
- [scheduler] task_instance_heartbeat_sec
Added test_heartbeat_called_during_paginated_transfer to verify
heartbeat is called once per page during a paginated transfer

Testing

uv run --project providers/common/sql pytest \
  providers/common/sql/tests/unit/common/sql/operators/test_generic_transfer.py \
  -xvs

Related Issues

…bie detection (apache#64658)

dabla · 2026-04-04T07:28:54Z

Don’t think it’s a good idea to let an specific operator handle heartbeats imho.

Pranaykarvi · 2026-04-04T09:08:03Z

Thanks for the feedback @dabla, and totally fair point from an architectural perspective!

I agree that having a specific operator manage its own heartbeats is not ideal as a general pattern.
A cleaner long-term solution would be something like a background heartbeat thread inside LocalTaskJob
that fires independently of what the operator's main thread is doing — that way every long-running
operator benefits without needing operator-level changes.

The reason I went this route here is that GenericTransfer blocks the Python thread completely during
executemany() across potentially thousands of paginated batches. During that time there is simply no
gap for any framework-level mechanism to fire a heartbeat, which is what causes the scheduler to
incorrectly treat the task as a zombie.

So this PR is really meant as a practical short-term fix for users hitting this today, while a proper
framework-level solution could be tracked as a separate improvement.

That said I am fully open to whatever direction the maintainers prefer:

Keep this as a targeted operator-level fix for now
Or close this and open a dedicated issue proposing a background heartbeat thread at the LocalTaskJob
level as the right long-term fix

Would love guidance from a core maintainer on which approach fits better with the project's direction.
Happy to do the work either way!

Pranaykarvi · 2026-04-04T10:07:22Z

The failure in Tests / Low dep tests: providers / All-prov:LowestDeps:14:3.10:exasol...odbc
appears to be a transient CI infrastructure issue — the Docker image failed to load from cache
with Error when loading image: None, which is unrelated to the changes in this PR.

Could a maintainer please re-run the failed job? Thanks!

fix: emit heartbeat in GenericTransfer pagination loop to prevent zom…

b2c4494

…bie detection (apache#64658)

boring-cyborg bot added area:providers provider:common-sql labels Apr 4, 2026

fix: newsfragment to single line for static check

814df9d

fix: rename newsfragment to match PR number 64709

c982dc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: emit heartbeat in GenericTransfer pagination loop to prevent zombie detection (#64658)#64709

fix: emit heartbeat in GenericTransfer pagination loop to prevent zombie detection (#64658)#64709
Pranaykarvi wants to merge 3 commits intoapache:mainfrom
Pranaykarvi:fix/64658-long-running-task-heartbeat-mismatch

Pranaykarvi commented Apr 4, 2026

Uh oh!

dabla commented Apr 4, 2026

Uh oh!

Pranaykarvi commented Apr 4, 2026

Uh oh!

Pranaykarvi commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Pranaykarvi commented Apr 4, 2026

Problem

Fix

Testing

Related Issues

Uh oh!

dabla commented Apr 4, 2026

Uh oh!

Pranaykarvi commented Apr 4, 2026

Uh oh!

Pranaykarvi commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants