Fix OTel metrics lost in forked task processes#64703
Merged
potiuk merged 1 commit intoapache:mainfrom Apr 4, 2026
Merged
Conversation
Reset the OTel SDK's Once() guard on _METER_PROVIDER_SET_ONCE before calling set_meter_provider() in get_otel_logger(). When a forked child process re-initializes Stats (detected via PID mismatch in stats.py), the inherited Once._done = True flag prevents the new MeterProvider from being registered. The child falls back to the parent's stale provider whose PeriodicExportingMetricReader thread is dead after fork, causing task-level metrics like ti.finish to be silently dropped. The fix resets _done and _METER_PROVIDER before each set_meter_provider() call. On first initialization (no fork), _done is already False so this is a no-op. On re-initialization after fork, it allows the new provider to be set correctly. Closes: apache#64690
potiuk
approved these changes
Apr 4, 2026
potiuk
approved these changes
Apr 4, 2026
Member
|
Thanks! |
github-actions bot
pushed a commit
that referenced
this pull request
Apr 4, 2026
Reset the OTel SDK's Once() guard on _METER_PROVIDER_SET_ONCE before calling set_meter_provider() in get_otel_logger(). When a forked child process re-initializes Stats (detected via PID mismatch in stats.py), the inherited Once._done = True flag prevents the new MeterProvider from being registered. The child falls back to the parent's stale provider whose PeriodicExportingMetricReader thread is dead after fork, causing task-level metrics like ti.finish to be silently dropped. The fix resets _done and _METER_PROVIDER before each set_meter_provider() call. On first initialization (no fork), _done is already False so this is a no-op. On re-initialization after fork, it allows the new provider to be set correctly. (cherry picked from commit ff77bd2) Co-authored-by: Michael Black <4128408+MichaelRBlack@users.noreply.github.com> Closes: #64690
Backport successfully created: v3-2-testNote: As of Merging PRs targeted for Airflow 3.X In matter of doubt please ask in #release-management Slack channel.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #64690 — task-level OTel metrics (
ti.finish,ti.start) are silently dropped in forked task subprocesses (LocalExecutor, CeleryExecutor).Root cause
stats.pycorrectly detects PID mismatches after fork and re-initializes the Stats instance by callingget_otel_logger(). This creates a freshMeterProviderand callsmetrics.set_meter_provider().However, the OTel Python SDK uses a
Once()guard onset_meter_provider()that only allows it to be called once per process. TheOnce._done = Trueflag from the parent survivesfork(), so the child'sset_meter_provider()silently fails with:The child ends up using the parent's stale
MeterProviderwhosePeriodicExportingMetricReaderbackground thread is dead after fork.Fix
Reset the OTel SDK's
_METER_PROVIDER_SET_ONCE._doneand_METER_PROVIDERinget_otel_logger()before callingset_meter_provider(). Sinceget_otel_logger()always intends to create and register a new provider, this is safe:_doneis alreadyFalse, so the reset is a no-op._doneisTrue(inherited from parent), so the reset allows the new provider to be registered.Changes
shared/observability/src/.../otel_logger.py— resetOnce()guard beforeset_meter_provider()shared/observability/tests/.../test_otel_logger.py— addtest_reinit_after_fork_exports_metricsthat callsget_otel_logger()twice and verifies metrics from the second initialization are exportedTest plan
test_atexit_flush_on_process_exitcontinues to pass (single init path unchanged)test_reinit_after_fork_exports_metricspasses — verifies that callingget_otel_logger()twice (simulating post-fork re-init) correctly exports metrics from the second providerotel_on = True,ti.finishmetrics appear in the OTel collector after task completion