Retry PET refresh on process crash (ConnectionError), not just timeout#1447
Open
StellaHuang95 wants to merge 1 commit intomicrosoft:mainfrom
Open
Retry PET refresh on process crash (ConnectionError), not just timeout#1447StellaHuang95 wants to merge 1 commit intomicrosoft:mainfrom
StellaHuang95 wants to merge 1 commit intomicrosoft:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Telemetry from Kusto shows that ~99% of all manager registration failures happen at the
nativeFinderRefreshstage. These failures break down into two categories:spawn_timeoutprocess_crashThe
spawn_timeoutfailures are already retried — when aRpcTimeoutErroris caught, the code kills the hung process, setsprocessExited = true, and retries on a fresh PET instance. This retry path has existed since the retry logic was introduced.However,
process_crashfailures get zero retries. When PET crashes mid-request, thevscode-jsonrpclibrary throws aConnectionError. The existing retry condition only checks forRpcTimeoutError, soConnectionErrorfalls straight through tothrow ex— the request fails permanently even though the exact same restart infrastructure could recover it.This means ~20% of all failures are unrecoverable for no architectural reason. The restart machinery (
ensureProcessRunning→restart()with exponential backoff) is already built and works — it just wasn't wired up for crash recovery.Fix
Extend the retry condition in three locations to also handle
rpc.ConnectionError:doRefresh()retry loop — The outer loop that decides whether tocontinueto the next attempt. AddingConnectionErrorhere lets crashed requests get a retry with a fresh PET process, identical to how timeouts are already handled.doRefreshAttempt()catch block — The inner catch that marks the process as exited before rethrowing. Without this, aConnectionErrorcould propagate withprocessExitedstillfalse(if the asyncexitevent handler hasn't fired yet), causingensureProcessRunning()on the retry to skip the restart.resolve()catch block —resolve()has no retry loop, but it needs to mark the process as exited onConnectionErrorso the next request triggers a restart instead of trying to use the dead connection.How the retry works
The retry doesn't reuse the crashed process — it starts a brand new one:
doRefreshcatchesConnectionError→killProcess()(no-op if already dead) +processExited = true→continuedoRefreshAttempt()→ensureProcessRunning()ensureProcessRunningseesprocessExited === true→ callsrestart()restart()spawns a fresh PET child process with a new JSON-RPC connection (with exponential backoff)This is the exact same path already taken for timeouts. The
MAX_RESTART_ATTEMPTS = 3limit still applies — if PET keeps crashing, we don't retry forever.Safety
killProcess()is idempotent — checksproc.exitCode === nullbefore killing; no-op on an already-dead process.ConnectionErrorpropagating — the only consumer isclassifyError()in the telemetry error classifier, which still works correctly since errors only reach callers when retries are exhausted.MAX_REFRESH_RETRIES = 1(one retry) andMAX_RESTART_ATTEMPTS = 3limits. No change to worst-case timing.Expected impact