fix(ci): retry npm publish on transient registry errors#833
Conversation
npm publish with OIDC provenance occasionally fails with transient E404 errors from the registry. Wrap both platform and main package publish calls with a retry helper (3 attempts, exponential backoff starting at 10s) to handle these without requiring a manual workflow re-run.
|
Claude finished @carlos-alm's task —— View job PR Review: Retry Logic for npm PublishTodo List
Phase 1: Root Cause Analysis ✅Problem Being Solved: The 3.9.0 release failed with a transient Why This Approach: Adding retry logic with exponential backoff (3 attempts, starting at 10s) directly addresses transient registry failures. The approach preserves existing idempotency guards ( Risk Assessment: LOW - This is purely defensive code that doesn't change core functionality, only adds resilience to known transient failures. Phase 2: Configuration Changes ✅✅ No high-risk configuration changes detected
Phase 3: Test Integrity ✅✅ No test integrity issues
Phase 4: Code Quality AnalysisStrengths:
Technical Issues:
Minor Concerns:
Phase 5: Backlog Compliance AssessmentZero-dep: ✅ - No new runtime dependencies added, uses existing bash/npm tooling Foundation-aligned: ✅ - Fully compliant with all principles:
Problem-fit: 2/5 - This is build infrastructure reliability, not core "AI navigation" functionality. While important for releases, it doesn't directly address the core problem of AI coding assistants wasting tokens or missing blast radius. Breaking: No - Purely additive, no behavior changes for successful publishes Tier: Tier 1 (zero-dep ✓ + foundation-aligned ✓) Root Cause AnalysisProblem Being Solved: Transient npm registry failures during stable releases (specifically E404 PUT failures after successful OIDC signing) Backlog Compliance
Critical Concerns
Recommendations
Final Recommendation
Suggested follow-up: Consider refactoring to eliminate duplication and add error-specific retry logic in a future PR. Job Run: |
Greptile SummaryThis PR fixes transient npm registry publish failures (the specific trigger was an Two previous review concerns have been fully addressed:
Key changes:
Confidence Score: 5/5Safe to merge — both previous P1 concerns (missing between-attempt idempotency and duplicated retry function) are fully resolved. All prior review findings are addressed. The retry logic correctly handles the motivating ETIMEDOUT/slow-success scenario: idempotency is checked after every failed attempt (including the last), preventing E403 self-defeat. No remaining P0 or P1 issues were found. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[npm_publish_retry called
pkg_spec + publish cmd] --> B[attempt N]
B --> C{npm publish
succeeds?}
C -->|yes| D[return 0 ✓]
C -->|no| E{npm view pkg_spec
version succeeds?}
E -->|yes — slow-success
package already landed| F[notice: treating as success
return 0 ✓]
E -->|no| G{attempt < max_attempts?}
G -->|yes| H[warning: retrying
sleep delay
delay *= 2]
H --> B
G -->|no — all attempts exhausted| I[error: failed after N attempts
return 1 ✗]
Reviews (2): Last reviewed commit: "fix(ci): add between-attempt idempotency..." | Re-trigger Greptile |
.github/workflows/publish.yml
Outdated
| npm_publish_retry() { | ||
| local max_attempts=3 | ||
| local delay=10 | ||
| for attempt in $(seq 1 $max_attempts); do | ||
| if "$@"; then | ||
| return 0 | ||
| fi | ||
| if [ "$attempt" -lt "$max_attempts" ]; then | ||
| echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..." | ||
| sleep "$delay" | ||
| delay=$((delay * 2)) | ||
| fi | ||
| done | ||
| echo "::error::npm publish failed after $max_attempts attempts" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
Retry loop has no between-attempt idempotency check
The PR description states that "existing idempotency guards (npm view skip check) prevent double-publishing if a slow success is followed by a retry," but those guards are one-shot pre-checks executed before the retry loop begins — not between retry attempts.
Consider the exact failure mode that motivated this PR: npm publish uploads the tarball successfully server-side, but the connection drops before the 200 OK response arrives (ETIMEDOUT). The CLI exits non-zero, so the retry fires. Attempt 2 calls npm publish for a version that already exists, which fails with E403 You cannot publish over the previously published versions — a non-transient, permanent error. Attempts 2 and 3 both fail for the same reason, and the step exits 1, reproducing the outage.
The same issue applies to the platform-packages loop (lines 511–518): the npm view skip check runs once before calling npm_publish_retry, not between retries.
A robust fix is to npm view the package before each retry (not before attempt 1, since the pre-check covers that):
npm_publish_retry() {
local pkg_spec="$1" # e.g. "@optave/codegraph@3.9.0"
shift # remaining args are the full publish command
local max_attempts=3
local delay=10
for attempt in $(seq 1 $max_attempts); do
if "$@"; then
return 0
fi
# Before retrying, check if the publish already landed (slow-success / ETIMEDOUT)
if npm view "$pkg_spec" version 2>/dev/null; then
echo "::notice::$pkg_spec is now visible on the registry — treating as success"
return 0
fi
if [ "$attempt" -lt "$max_attempts" ]; then
echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..."
sleep "$delay"
delay=$((delay * 2))
fi
done
echo "::error::npm publish failed after $max_attempts attempts"
return 1
}There was a problem hiding this comment.
Fixed. The retry function now accepts a pkg_spec argument (e.g. @optave/codegraph@3.9.0) and runs npm view "$pkg_spec" version between retry attempts. If the package is already visible on the registry (slow-success / ETIMEDOUT scenario), it returns 0 instead of retrying into an E403. See scripts/npm-publish-retry.sh and both call sites in publish.yml.
.github/workflows/publish.yml
Outdated
| npm_publish_retry() { | ||
| local max_attempts=3 | ||
| local delay=10 | ||
| for attempt in $(seq 1 $max_attempts); do | ||
| if "$@"; then | ||
| return 0 | ||
| fi | ||
| if [ "$attempt" -lt "$max_attempts" ]; then | ||
| echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..." | ||
| sleep "$delay" | ||
| delay=$((delay * 2)) | ||
| fi | ||
| done | ||
| echo "::error::npm publish failed after $max_attempts attempts" | ||
| return 1 | ||
| } | ||
| npm_publish_retry npm publish --access public --provenance --tag "$NPM_TAG" |
There was a problem hiding this comment.
Duplicate function definition across two steps
npm_publish_retry is defined identically in both the "Publish platform packages" step (lines 444–459) and the "Publish main package" step (lines 527–542). Each GitHub Actions run: block spawns a fresh shell, so the function can't be shared — but the duplication means any future change to the retry policy (e.g. changing max_attempts or delay) must be applied in both places.
Consider extracting it to a checked-in shell script, e.g. scripts/npm-publish-retry.sh, and sourcing it in each step:
. scripts/npm-publish-retry.sh
npm_publish_retry "${pkg_name}@${VERSION}" npm publish "${PKG_DIR}/$platform" --access public --provenance --tag "$NPM_TAG"This eliminates the duplication and makes the retry policy a single, reviewable, testable artifact.
There was a problem hiding this comment.
Fixed. Extracted npm_publish_retry to scripts/npm-publish-retry.sh and both publish steps now source it. The retry policy is a single reviewable artifact — no more duplication.
…logic Extract npm_publish_retry to scripts/npm-publish-retry.sh to eliminate duplication between the two publish steps. Add npm view check between retry attempts to handle the slow-success scenario where the server accepts the upload but the client times out — without this, retries would fail with E403 on an already-published version.
Codegraph Impact Analysis1 functions changed → 0 callers affected across 0 files
|
Summary
npm publishcalls in the stable publish job with a retry helper (3 attempts, exponential backoff starting at 10s)npm viewskip check) prevent double-publishing if a slow success is followed by a retryContext
The 3.9.0 publish failed with a transient
E404 Not FoundonPUT @optave/codegraph-darwin-arm64— the OIDC provenance signed successfully but the registry rejected the PUT. This is a known intermittent npm issue.Test plan