Skip to content

Fix OTel metrics lost in forked task processes#64703

Merged
potiuk merged 1 commit intoapache:mainfrom
MichaelRBlack:fix/otel-metrics-lost-in-forked-and-short-lived-processes
Apr 4, 2026
Merged

Fix OTel metrics lost in forked task processes#64703
potiuk merged 1 commit intoapache:mainfrom
MichaelRBlack:fix/otel-metrics-lost-in-forked-and-short-lived-processes

Conversation

@MichaelRBlack
Copy link
Copy Markdown
Contributor

Summary

Fixes #64690 — task-level OTel metrics (ti.finish, ti.start) are silently dropped in forked task subprocesses (LocalExecutor, CeleryExecutor).

Root cause

stats.py correctly detects PID mismatches after fork and re-initializes the Stats instance by calling get_otel_logger(). This creates a fresh MeterProvider and calls metrics.set_meter_provider().

However, the OTel Python SDK uses a Once() guard on set_meter_provider() that only allows it to be called once per process. The Once._done = True flag from the parent survives fork(), so the child's set_meter_provider() silently fails with:

WARNING - Overriding of current MeterProvider is not allowed

The child ends up using the parent's stale MeterProvider whose PeriodicExportingMetricReader background thread is dead after fork.

Fix

Reset the OTel SDK's _METER_PROVIDER_SET_ONCE._done and _METER_PROVIDER in get_otel_logger() before calling set_meter_provider(). Since get_otel_logger() always intends to create and register a new provider, this is safe:

  • First call (no fork): _done is already False, so the reset is a no-op.
  • Re-init after fork: _done is True (inherited from parent), so the reset allows the new provider to be registered.

Changes

  • shared/observability/src/.../otel_logger.py — reset Once() guard before set_meter_provider()
  • shared/observability/tests/.../test_otel_logger.py — add test_reinit_after_fork_exports_metrics that calls get_otel_logger() twice and verifies metrics from the second initialization are exported

Test plan

  • Existing test_atexit_flush_on_process_exit continues to pass (single init path unchanged)
  • New test_reinit_after_fork_exports_metrics passes — verifies that calling get_otel_logger() twice (simulating post-fork re-init) correctly exports metrics from the second provider
  • Manual verification: with LocalExecutor or CeleryExecutor and otel_on = True, ti.finish metrics appear in the OTel collector after task completion

Reset the OTel SDK's Once() guard on _METER_PROVIDER_SET_ONCE before
calling set_meter_provider() in get_otel_logger(). When a forked child
process re-initializes Stats (detected via PID mismatch in stats.py),
the inherited Once._done = True flag prevents the new MeterProvider from
being registered. The child falls back to the parent's stale provider
whose PeriodicExportingMetricReader thread is dead after fork, causing
task-level metrics like ti.finish to be silently dropped.

The fix resets _done and _METER_PROVIDER before each set_meter_provider()
call. On first initialization (no fork), _done is already False so this
is a no-op. On re-initialization after fork, it allows the new provider
to be set correctly.

Closes: apache#64690
@potiuk potiuk added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label Apr 4, 2026
@potiuk potiuk added this to the Airflow 3.2.1 milestone Apr 4, 2026
@potiuk potiuk merged commit ff77bd2 into apache:main Apr 4, 2026
66 checks passed
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 4, 2026

Thanks!

github-actions bot pushed a commit that referenced this pull request Apr 4, 2026
Reset the OTel SDK's Once() guard on _METER_PROVIDER_SET_ONCE before
calling set_meter_provider() in get_otel_logger(). When a forked child
process re-initializes Stats (detected via PID mismatch in stats.py),
the inherited Once._done = True flag prevents the new MeterProvider from
being registered. The child falls back to the parent's stale provider
whose PeriodicExportingMetricReader thread is dead after fork, causing
task-level metrics like ti.finish to be silently dropped.

The fix resets _done and _METER_PROVIDER before each set_meter_provider()
call. On first initialization (no fork), _done is already False so this
is a no-op. On re-initialization after fork, it allows the new provider
to be set correctly.
(cherry picked from commit ff77bd2)

Co-authored-by: Michael Black <4128408+MichaelRBlack@users.noreply.github.com>
Closes: #64690
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

Backport successfully created: v3-2-test

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test PR Link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTel task-level metrics (ti.finish, ti.start) lost — forked processes and KubernetesExecutor

2 participants