Skip to content

Retry PET refresh on process crash (ConnectionError), not just timeout#1447

Open
StellaHuang95 wants to merge 1 commit intomicrosoft:mainfrom
StellaHuang95:retryPET
Open

Retry PET refresh on process crash (ConnectionError), not just timeout#1447
StellaHuang95 wants to merge 1 commit intomicrosoft:mainfrom
StellaHuang95:retryPET

Conversation

@StellaHuang95
Copy link
Copy Markdown
Contributor

Problem

Telemetry from Kusto shows that ~99% of all manager registration failures happen at the nativeFinderRefresh stage. These failures break down into two categories:

Error Type % of Failures Meaning
spawn_timeout ~70% PET process started but didn't respond within the 30s timeout
process_crash ~20% PET process died mid-request (JSON-RPC connection dropped)

The spawn_timeout failures are already retried — when a RpcTimeoutError is caught, the code kills the hung process, sets processExited = true, and retries on a fresh PET instance. This retry path has existed since the retry logic was introduced.

However, process_crash failures get zero retries. When PET crashes mid-request, the vscode-jsonrpc library throws a ConnectionError. The existing retry condition only checks for RpcTimeoutError, so ConnectionError falls straight through to throw ex — the request fails permanently even though the exact same restart infrastructure could recover it.

This means ~20% of all failures are unrecoverable for no architectural reason. The restart machinery (ensureProcessRunningrestart() with exponential backoff) is already built and works — it just wasn't wired up for crash recovery.

Fix

Extend the retry condition in three locations to also handle rpc.ConnectionError:

  1. doRefresh() retry loop — The outer loop that decides whether to continue to the next attempt. Adding ConnectionError here lets crashed requests get a retry with a fresh PET process, identical to how timeouts are already handled.

  2. doRefreshAttempt() catch block — The inner catch that marks the process as exited before rethrowing. Without this, a ConnectionError could propagate with processExited still false (if the async exit event handler hasn't fired yet), causing ensureProcessRunning() on the retry to skip the restart.

  3. resolve() catch blockresolve() has no retry loop, but it needs to mark the process as exited on ConnectionError so the next request triggers a restart instead of trying to use the dead connection.

How the retry works

The retry doesn't reuse the crashed process — it starts a brand new one:

  1. doRefresh catches ConnectionErrorkillProcess() (no-op if already dead) + processExited = truecontinue
  2. Next iteration → doRefreshAttempt()ensureProcessRunning()
  3. ensureProcessRunning sees processExited === true → calls restart()
  4. restart() spawns a fresh PET child process with a new JSON-RPC connection (with exponential backoff)
  5. The refresh request runs against the new process

This is the exact same path already taken for timeouts. The MAX_RESTART_ATTEMPTS = 3 limit still applies — if PET keeps crashing, we don't retry forever.

Safety

  • killProcess() is idempotent — checks proc.exitCode === null before killing; no-op on an already-dead process.
  • No callers depend on ConnectionError propagating — the only consumer is classifyError() in the telemetry error classifier, which still works correctly since errors only reach callers when retries are exhausted.
  • No new retry budget — uses the existing MAX_REFRESH_RETRIES = 1 (one retry) and MAX_RESTART_ATTEMPTS = 3 limits. No change to worst-case timing.

Expected impact

  • ~20% of currently-unrecoverable failures become recoverable (process crash → restart → retry succeeds)
  • No change to timeout behavior (existing path unchanged)
  • Improved log messages distinguish "crashed" vs "timed out" for easier diagnostics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant