Retry PET refresh on process crash (ConnectionError), not just timeout by StellaHuang95 · Pull Request #1447 · microsoft/vscode-python-environments

StellaHuang95 · 2026-04-08T18:59:43Z

Problem

Telemetry from Kusto shows that ~99% of all manager registration failures happen at the nativeFinderRefresh stage. These failures break down into two categories:

Error Type	% of Failures	Meaning
`spawn_timeout`	~70%	PET process started but didn't respond within the 30s timeout
`process_crash`	~20%	PET process died mid-request (JSON-RPC connection dropped)

The spawn_timeout failures are already retried — when a RpcTimeoutError is caught, the code kills the hung process, sets processExited = true, and retries on a fresh PET instance. This retry path has existed since the retry logic was introduced.

However, process_crash failures get zero retries. When PET crashes mid-request, the vscode-jsonrpc library throws a ConnectionError. The existing retry condition only checks for RpcTimeoutError, so ConnectionError falls straight through to throw ex — the request fails permanently even though the exact same restart infrastructure could recover it.

This means ~20% of all failures are unrecoverable for no architectural reason. The restart machinery (ensureProcessRunning → restart() with exponential backoff) is already built and works — it just wasn't wired up for crash recovery.

Fix

Extend the retry condition in three locations to also handle rpc.ConnectionError:

doRefresh() retry loop — The outer loop that decides whether to continue to the next attempt. Adding ConnectionError here lets crashed requests get a retry with a fresh PET process, identical to how timeouts are already handled.
doRefreshAttempt() catch block — The inner catch that marks the process as exited before rethrowing. Without this, a ConnectionError could propagate with processExited still false (if the async exit event handler hasn't fired yet), causing ensureProcessRunning() on the retry to skip the restart.
resolve() catch block — resolve() has no retry loop, but it needs to mark the process as exited on ConnectionError so the next request triggers a restart instead of trying to use the dead connection.

How the retry works

The retry doesn't reuse the crashed process — it starts a brand new one:

doRefresh catches ConnectionError → killProcess() (no-op if already dead) + processExited = true → continue
Next iteration → doRefreshAttempt() → ensureProcessRunning()
ensureProcessRunning sees processExited === true → calls restart()
restart() spawns a fresh PET child process with a new JSON-RPC connection (with exponential backoff)
The refresh request runs against the new process

This is the exact same path already taken for timeouts. The MAX_RESTART_ATTEMPTS = 3 limit still applies — if PET keeps crashing, we don't retry forever.

Safety

killProcess() is idempotent — checks proc.exitCode === null before killing; no-op on an already-dead process.
No callers depend on ConnectionError propagating — the only consumer is classifyError() in the telemetry error classifier, which still works correctly since errors only reach callers when retries are exhausted.
No new retry budget — uses the existing MAX_REFRESH_RETRIES = 1 (one retry) and MAX_RESTART_ATTEMPTS = 3 limits. No change to worst-case timing.

Expected impact

~20% of currently-unrecoverable failures become recoverable (process crash → restart → retry succeeds)
No change to timeout behavior (existing path unchanged)
Improved log messages distinguish "crashed" vs "timed out" for easier diagnostics

StellaHuang95 requested review from eleanorjboyd and karthiknadig April 8, 2026 18:59

StellaHuang95 added the no-changelog label Apr 8, 2026

retry pet refresh on process crash

4e0edbd

StellaHuang95 force-pushed the retryPET branch from 01ea0e2 to 4e0edbd Compare April 8, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry PET refresh on process crash (ConnectionError), not just timeout#1447

Retry PET refresh on process crash (ConnectionError), not just timeout#1447
StellaHuang95 wants to merge 1 commit intomicrosoft:mainfrom
StellaHuang95:retryPET

StellaHuang95 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StellaHuang95 commented Apr 8, 2026

Problem

Fix

How the retry works

Safety

Expected impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant