fix(ci): retry npm publish on transient registry errors by carlos-alm · Pull Request #833 · optave/ops-codegraph-tool

carlos-alm · 2026-04-04T11:06:51Z

Summary

Wraps npm publish calls in the stable publish job with a retry helper (3 attempts, exponential backoff starting at 10s)
Handles transient npm registry errors (E404, E500, ETIMEDOUT) that caused the 3.9.0 release to fail
Existing idempotency guards (npm view skip check) prevent double-publishing if a slow success is followed by a retry

Context

The 3.9.0 publish failed with a transient E404 Not Found on PUT @optave/codegraph-darwin-arm64 — the OIDC provenance signed successfully but the registry rejected the PUT. This is a known intermittent npm issue.

Test plan

Verify workflow YAML is valid (CI will parse it)
Next stable release should auto-retry on transient failures instead of failing the whole workflow

npm publish with OIDC provenance occasionally fails with transient E404 errors from the registry. Wrap both platform and main package publish calls with a retry helper (3 attempts, exponential backoff starting at 10s) to handle these without requiring a manual workflow re-run.

claude · 2026-04-04T11:07:05Z

Claude finished @carlos-alm's task —— View job

PR Review: Retry Logic for npm Publish

Todo List

Read and analyze the changed workflow file
Conduct root cause analysis
Check for configuration changes and test integrity
Assess code quality
Evaluate backlog compliance
Provide final assessment with rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: The 3.9.0 release failed with a transient E404 Not Found error on PUT @optave/codegraph-darwin-arm64. This is a documented npm registry issue where OIDC provenance signs successfully but the registry rejects the PUT operation due to temporary issues.

Why This Approach: Adding retry logic with exponential backoff (3 attempts, starting at 10s) directly addresses transient registry failures. The approach preserves existing idempotency guards (npm view checks) to prevent double-publishing.

Risk Assessment: LOW - This is purely defensive code that doesn't change core functionality, only adds resilience to known transient failures.

Phase 2: Configuration Changes ✅

✅ No high-risk configuration changes detected

Changes are purely additive (retry wrapper function)
No rules disabled or test thresholds lowered
Workflow structure preserved with existing safety checks

Phase 3: Test Integrity ✅

✅ No test integrity issues

No test files modified
No assertions weakened
Existing idempotency guards maintained

Phase 4: Code Quality Analysis

Strengths:

Clean implementation: Lines 444-459 and 527-542 implement identical retry logic
Proper error handling: Exponential backoff (10s → 20s → 40s) with clear logging
Preserves safety: Existing npm view checks prevent double-publishing
Good separation: Only wraps the actual publish commands, not setup logic

Technical Issues:

Code duplication (lines 444-459 vs 527-542): Identical retry functions defined twice
Shell function scope: Functions are defined inside bash heredoc which could be fragile
Missing error context: Retry logic doesn't capture/log the specific error type that triggered retry

Minor Concerns:

No explicit check that failures are actually transient (E404, E500, ETIMEDOUT) vs permanent errors
Could potentially retry on authentication failures or network issues unrelated to registry

Phase 5: Backlog Compliance Assessment

Zero-dep: ✅ - No new runtime dependencies added, uses existing bash/npm tooling

Foundation-aligned: ✅ - Fully compliant with all principles:

P1 (always-current): Doesn't affect core pipeline
P4 (zero-cost core): Pure CI/build reliability improvement
P8 (not an agent): Defensive infrastructure code only
P5 (embeddable-first): No API contract changes

Problem-fit: 2/5 - This is build infrastructure reliability, not core "AI navigation" functionality. While important for releases, it doesn't directly address the core problem of AI coding assistants wasting tokens or missing blast radius.

Breaking: No - Purely additive, no behavior changes for successful publishes

Tier: Tier 1 (zero-dep ✓ + foundation-aligned ✓)

Root Cause Analysis

Problem Being Solved: Transient npm registry failures during stable releases (specifically E404 PUT failures after successful OIDC signing)
Why This Approach: Retry with exponential backoff is the standard solution for transient service failures
Risk Assessment: Low risk - purely defensive with existing idempotency safeguards

Backlog Compliance

Zero-dep: ✅ - uses existing bash/npm tooling
Foundation-aligned: ✅ - pure infrastructure reliability improvement
Problem-fit: 2/5 - build reliability, not core AI navigation problem
Breaking: No - purely additive
Tier: 1

Critical Concerns

Code duplication: Identical npm_publish_retry function defined twice (.github/workflows/publish.yml:444-459, 527-542)
Missing error discrimination: Should ideally only retry on specific transient errors (E404, E500, ETIMEDOUT) rather than any failure
Function scope: Bash functions defined in heredoc could be fragile across shell environments

Recommendations

Extract retry function: Move npm_publish_retry to a shared script to eliminate duplication
Add error filtering: Parse npm error output to only retry on known transient errors
Enhanced logging: Log the specific error that triggered each retry attempt

Final Recommendation

Rating: ⭐⭐⭐☆☆ (3/5)
Action: APPROVE WITH CONDITIONS
Reasoning: Solves a real problem with appropriate technique, but has code quality issues (duplication) that should be addressed. The retry logic is sound and the existing safety mechanisms prevent double-publishing. While not high-impact for core functionality, it's important for release reliability.

Suggested follow-up: Consider refactoring to eliminate duplication and add error-specific retry logic in a future PR.

Job Run:
Branch: fix/publish-retry

greptile-apps · 2026-04-04T11:09:39Z

Greptile Summary

This PR fixes transient npm registry publish failures (the specific trigger was an E404 during the 3.9.0 release of @optave/codegraph-darwin-arm64) by introducing a retry wrapper with exponential backoff for all npm publish calls in the stable publish job.

Two previous review concerns have been fully addressed:

Between-attempt idempotency — npm_publish_retry now calls npm view \"$pkg_spec\" version after each failed attempt (not just as a one-shot pre-check), so a slow-success/ETIMEDOUT scenario where the registry accepted the upload but the client never got the 200 OK will be detected and treated as success rather than triggering a self-defeating E403 retry.
Duplicated function definition — the retry function has been extracted to scripts/npm-publish-retry.sh and both publish steps source it; the retry policy is now a single reviewable artifact.

Key changes:

scripts/npm-publish-retry.sh: New script defining npm_publish_retry(pkg_spec, ...cmd) — 3 attempts, 10 s / 20 s backoff, with idempotency check between each attempt.
.github/workflows/publish.yml: "Publish platform packages" and "Publish main package" steps now source scripts/npm-publish-retry.sh and call npm_publish_retry with the appropriate package spec and publish command.

Confidence Score: 5/5

Safe to merge — both previous P1 concerns (missing between-attempt idempotency and duplicated retry function) are fully resolved.

All prior review findings are addressed. The retry logic correctly handles the motivating ETIMEDOUT/slow-success scenario: idempotency is checked after every failed attempt (including the last), preventing E403 self-defeat. No remaining P0 or P1 issues were found.

No files require special attention.

Important Files Changed

Filename	Overview
scripts/npm-publish-retry.sh	New retry helper that correctly handles slow-success/ETIMEDOUT via per-attempt idempotency checks; logic and exit-code handling are sound.
.github/workflows/publish.yml	Both stable publish steps now source the shared retry script and pass the correct pkg_spec; previous duplication and missing between-attempt idempotency are resolved.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[npm_publish_retry called
pkg_spec + publish cmd] --> B[attempt N]
    B --> C{npm publish
succeeds?}
    C -->|yes| D[return 0 ✓]
    C -->|no| E{npm view pkg_spec
version succeeds?}
    E -->|yes — slow-success
package already landed| F[notice: treating as success
return 0 ✓]
    E -->|no| G{attempt < max_attempts?}
    G -->|yes| H[warning: retrying
sleep delay
delay *= 2]
    H --> B
    G -->|no — all attempts exhausted| I[error: failed after N attempts
return 1 ✗]

_{Reviews (2): Last reviewed commit: "fix(ci): add between-attempt idempotency..." | Re-trigger Greptile}

greptile-apps · 2026-04-04T11:09:43Z

.github/workflows/publish.yml

+          npm_publish_retry() {
+            local max_attempts=3
+            local delay=10
+            for attempt in $(seq 1 $max_attempts); do
+              if "$@"; then
+                return 0
+              fi
+              if [ "$attempt" -lt "$max_attempts" ]; then
+                echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..."
+                sleep "$delay"
+                delay=$((delay * 2))
+              fi
+            done
+            echo "::error::npm publish failed after $max_attempts attempts"
+            return 1
+          }


Retry loop has no between-attempt idempotency check

The PR description states that "existing idempotency guards (npm view skip check) prevent double-publishing if a slow success is followed by a retry," but those guards are one-shot pre-checks executed before the retry loop begins — not between retry attempts.

Consider the exact failure mode that motivated this PR: npm publish uploads the tarball successfully server-side, but the connection drops before the 200 OK response arrives (ETIMEDOUT). The CLI exits non-zero, so the retry fires. Attempt 2 calls npm publish for a version that already exists, which fails with E403 You cannot publish over the previously published versions — a non-transient, permanent error. Attempts 2 and 3 both fail for the same reason, and the step exits 1, reproducing the outage.

The same issue applies to the platform-packages loop (lines 511–518): the npm view skip check runs once before calling npm_publish_retry, not between retries.

A robust fix is to npm view the package before each retry (not before attempt 1, since the pre-check covers that):

npm_publish_retry() { local pkg_spec="$1" # e.g. "@optave/codegraph@3.9.0" shift # remaining args are the full publish command local max_attempts=3 local delay=10 for attempt in $(seq 1 $max_attempts); do if "$@"; then return 0 fi # Before retrying, check if the publish already landed (slow-success / ETIMEDOUT) if npm view "$pkg_spec" version 2>/dev/null; then echo "::notice::$pkg_spec is now visible on the registry — treating as success" return 0 fi if [ "$attempt" -lt "$max_attempts" ]; then echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..." sleep "$delay" delay=$((delay * 2)) fi done echo "::error::npm publish failed after $max_attempts attempts" return 1 }

Fixed. The retry function now accepts a pkg_spec argument (e.g. @optave/codegraph@3.9.0) and runs npm view "$pkg_spec" version between retry attempts. If the package is already visible on the registry (slow-success / ETIMEDOUT scenario), it returns 0 instead of retrying into an E403. See scripts/npm-publish-retry.sh and both call sites in publish.yml.

greptile-apps · 2026-04-04T11:09:44Z

.github/workflows/publish.yml

+          npm_publish_retry() {
+            local max_attempts=3
+            local delay=10
+            for attempt in $(seq 1 $max_attempts); do
+              if "$@"; then
+                return 0
+              fi
+              if [ "$attempt" -lt "$max_attempts" ]; then
+                echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..."
+                sleep "$delay"
+                delay=$((delay * 2))
+              fi
+            done
+            echo "::error::npm publish failed after $max_attempts attempts"
+            return 1
+          }
+          npm_publish_retry npm publish --access public --provenance --tag "$NPM_TAG"


Duplicate function definition across two steps

npm_publish_retry is defined identically in both the "Publish platform packages" step (lines 444–459) and the "Publish main package" step (lines 527–542). Each GitHub Actions run: block spawns a fresh shell, so the function can't be shared — but the duplication means any future change to the retry policy (e.g. changing max_attempts or delay) must be applied in both places.

Consider extracting it to a checked-in shell script, e.g. scripts/npm-publish-retry.sh, and sourcing it in each step:

. scripts/npm-publish-retry.sh npm_publish_retry "${pkg_name}@${VERSION}" npm publish "${PKG_DIR}/$platform" --access public --provenance --tag "$NPM_TAG"

This eliminates the duplication and makes the retry policy a single, reviewable, testable artifact.

Fixed. Extracted npm_publish_retry to scripts/npm-publish-retry.sh and both publish steps now source it. The retry policy is a single reviewable artifact — no more duplication.

…logic Extract npm_publish_retry to scripts/npm-publish-retry.sh to eliminate duplication between the two publish steps. Add npm view check between retry attempts to handle the slow-success scenario where the server accepts the upload but the client times out — without this, retries would fail with E403 on an already-published version.

carlos-alm · 2026-04-04T20:45:34Z

@greptileai

github-actions · 2026-04-04T20:51:46Z

Codegraph Impact Analysis

1 functions changed → 0 callers affected across 0 files

npm_publish_retry in scripts/npm-publish-retry.sh:13 (0 transitive callers)

greptile-apps bot reviewed Apr 4, 2026

View reviewed changes

Merge branch 'main' into fix/publish-retry

d691cd6

carlos-alm merged commit 11871c6 into main Apr 4, 2026
12 checks passed

carlos-alm deleted the fix/publish-retry branch April 4, 2026 22:03

github-actions bot locked and limited conversation to collaborators Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): retry npm publish on transient registry errors#833

fix(ci): retry npm publish on transient registry errors#833
carlos-alm merged 3 commits intomainfrom
fix/publish-retry

carlos-alm commented Apr 4, 2026

Uh oh!

claude bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Apr 4, 2026

Uh oh!

carlos-alm Apr 4, 2026

Uh oh!

greptile-apps bot Apr 4, 2026

Uh oh!

carlos-alm Apr 4, 2026

Uh oh!

carlos-alm commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 4, 2026

Summary

Context

Test plan

Uh oh!

claude bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Retry Logic for npm Publish

Todo List

Phase 1: Root Cause Analysis ✅

Phase 2: Configuration Changes ✅

Phase 3: Test Integrity ✅

Phase 4: Code Quality Analysis

Strengths:

Technical Issues:

Minor Concerns:

Phase 5: Backlog Compliance Assessment

Root Cause Analysis

Backlog Compliance

Critical Concerns

Recommendations

Final Recommendation

Uh oh!

greptile-apps bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Codegraph Impact Analysis

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Apr 4, 2026 •

edited

Loading

greptile-apps bot commented Apr 4, 2026 •

edited

Loading