Skip to content

fix(ci): retry npm publish on transient registry errors#833

Merged
carlos-alm merged 3 commits intomainfrom
fix/publish-retry
Apr 4, 2026
Merged

fix(ci): retry npm publish on transient registry errors#833
carlos-alm merged 3 commits intomainfrom
fix/publish-retry

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Wraps npm publish calls in the stable publish job with a retry helper (3 attempts, exponential backoff starting at 10s)
  • Handles transient npm registry errors (E404, E500, ETIMEDOUT) that caused the 3.9.0 release to fail
  • Existing idempotency guards (npm view skip check) prevent double-publishing if a slow success is followed by a retry

Context

The 3.9.0 publish failed with a transient E404 Not Found on PUT @optave/codegraph-darwin-arm64 — the OIDC provenance signed successfully but the registry rejected the PUT. This is a known intermittent npm issue.

Test plan

  • Verify workflow YAML is valid (CI will parse it)
  • Next stable release should auto-retry on transient failures instead of failing the whole workflow

npm publish with OIDC provenance occasionally fails with transient E404
errors from the registry. Wrap both platform and main package publish
calls with a retry helper (3 attempts, exponential backoff starting at
10s) to handle these without requiring a manual workflow re-run.
@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Retry Logic for npm Publish

Todo List

  • Read and analyze the changed workflow file
  • Conduct root cause analysis
  • Check for configuration changes and test integrity
  • Assess code quality
  • Evaluate backlog compliance
  • Provide final assessment with rating and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: The 3.9.0 release failed with a transient E404 Not Found error on PUT @optave/codegraph-darwin-arm64. This is a documented npm registry issue where OIDC provenance signs successfully but the registry rejects the PUT operation due to temporary issues.

Why This Approach: Adding retry logic with exponential backoff (3 attempts, starting at 10s) directly addresses transient registry failures. The approach preserves existing idempotency guards (npm view checks) to prevent double-publishing.

Risk Assessment: LOW - This is purely defensive code that doesn't change core functionality, only adds resilience to known transient failures.

Phase 2: Configuration Changes ✅

No high-risk configuration changes detected

  • Changes are purely additive (retry wrapper function)
  • No rules disabled or test thresholds lowered
  • Workflow structure preserved with existing safety checks

Phase 3: Test Integrity ✅

No test integrity issues

  • No test files modified
  • No assertions weakened
  • Existing idempotency guards maintained

Phase 4: Code Quality Analysis

Strengths:

  • Clean implementation: Lines 444-459 and 527-542 implement identical retry logic
  • Proper error handling: Exponential backoff (10s → 20s → 40s) with clear logging
  • Preserves safety: Existing npm view checks prevent double-publishing
  • Good separation: Only wraps the actual publish commands, not setup logic

Technical Issues:

  1. Code duplication (lines 444-459 vs 527-542): Identical retry functions defined twice
  2. Shell function scope: Functions are defined inside bash heredoc which could be fragile
  3. Missing error context: Retry logic doesn't capture/log the specific error type that triggered retry

Minor Concerns:

  • No explicit check that failures are actually transient (E404, E500, ETIMEDOUT) vs permanent errors
  • Could potentially retry on authentication failures or network issues unrelated to registry

Phase 5: Backlog Compliance Assessment

Zero-dep: ✅ - No new runtime dependencies added, uses existing bash/npm tooling

Foundation-aligned: ✅ - Fully compliant with all principles:

  • P1 (always-current): Doesn't affect core pipeline
  • P4 (zero-cost core): Pure CI/build reliability improvement
  • P8 (not an agent): Defensive infrastructure code only
  • P5 (embeddable-first): No API contract changes

Problem-fit: 2/5 - This is build infrastructure reliability, not core "AI navigation" functionality. While important for releases, it doesn't directly address the core problem of AI coding assistants wasting tokens or missing blast radius.

Breaking: No - Purely additive, no behavior changes for successful publishes

Tier: Tier 1 (zero-dep ✓ + foundation-aligned ✓)

Root Cause Analysis

Problem Being Solved: Transient npm registry failures during stable releases (specifically E404 PUT failures after successful OIDC signing)
Why This Approach: Retry with exponential backoff is the standard solution for transient service failures
Risk Assessment: Low risk - purely defensive with existing idempotency safeguards

Backlog Compliance

  • Zero-dep: ✅ - uses existing bash/npm tooling
  • Foundation-aligned: ✅ - pure infrastructure reliability improvement
  • Problem-fit: 2/5 - build reliability, not core AI navigation problem
  • Breaking: No - purely additive
  • Tier: 1

Critical Concerns

  1. Code duplication: Identical npm_publish_retry function defined twice (.github/workflows/publish.yml:444-459, 527-542)
  2. Missing error discrimination: Should ideally only retry on specific transient errors (E404, E500, ETIMEDOUT) rather than any failure
  3. Function scope: Bash functions defined in heredoc could be fragile across shell environments

Recommendations

  1. Extract retry function: Move npm_publish_retry to a shared script to eliminate duplication
  2. Add error filtering: Parse npm error output to only retry on known transient errors
  3. Enhanced logging: Log the specific error that triggered each retry attempt

Final Recommendation

  • Rating: ⭐⭐⭐☆☆ (3/5)
  • Action: APPROVE WITH CONDITIONS
  • Reasoning: Solves a real problem with appropriate technique, but has code quality issues (duplication) that should be addressed. The retry logic is sound and the existing safety mechanisms prevent double-publishing. While not high-impact for core functionality, it's important for release reliability.

Suggested follow-up: Consider refactoring to eliminate duplication and add error-specific retry logic in a future PR.


Job Run:
Branch: fix/publish-retry

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 4, 2026

Greptile Summary

This PR fixes transient npm registry publish failures (the specific trigger was an E404 during the 3.9.0 release of @optave/codegraph-darwin-arm64) by introducing a retry wrapper with exponential backoff for all npm publish calls in the stable publish job.

Two previous review concerns have been fully addressed:

  • Between-attempt idempotencynpm_publish_retry now calls npm view \"$pkg_spec\" version after each failed attempt (not just as a one-shot pre-check), so a slow-success/ETIMEDOUT scenario where the registry accepted the upload but the client never got the 200 OK will be detected and treated as success rather than triggering a self-defeating E403 retry.
  • Duplicated function definition — the retry function has been extracted to scripts/npm-publish-retry.sh and both publish steps source it; the retry policy is now a single reviewable artifact.

Key changes:

  • scripts/npm-publish-retry.sh: New script defining npm_publish_retry(pkg_spec, ...cmd) — 3 attempts, 10 s / 20 s backoff, with idempotency check between each attempt.
  • .github/workflows/publish.yml: "Publish platform packages" and "Publish main package" steps now source scripts/npm-publish-retry.sh and call npm_publish_retry with the appropriate package spec and publish command.

Confidence Score: 5/5

Safe to merge — both previous P1 concerns (missing between-attempt idempotency and duplicated retry function) are fully resolved.

All prior review findings are addressed. The retry logic correctly handles the motivating ETIMEDOUT/slow-success scenario: idempotency is checked after every failed attempt (including the last), preventing E403 self-defeat. No remaining P0 or P1 issues were found.

No files require special attention.

Important Files Changed

Filename Overview
scripts/npm-publish-retry.sh New retry helper that correctly handles slow-success/ETIMEDOUT via per-attempt idempotency checks; logic and exit-code handling are sound.
.github/workflows/publish.yml Both stable publish steps now source the shared retry script and pass the correct pkg_spec; previous duplication and missing between-attempt idempotency are resolved.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[npm_publish_retry called
pkg_spec + publish cmd] --> B[attempt N]
    B --> C{npm publish
succeeds?}
    C -->|yes| D[return 0 ✓]
    C -->|no| E{npm view pkg_spec
version succeeds?}
    E -->|yes — slow-success
package already landed| F[notice: treating as success
return 0 ✓]
    E -->|no| G{attempt < max_attempts?}
    G -->|yes| H[warning: retrying
sleep delay
delay *= 2]
    H --> B
    G -->|no — all attempts exhausted| I[error: failed after N attempts
return 1 ✗]
Loading

Reviews (2): Last reviewed commit: "fix(ci): add between-attempt idempotency..." | Re-trigger Greptile

Comment on lines +444 to +459
npm_publish_retry() {
local max_attempts=3
local delay=10
for attempt in $(seq 1 $max_attempts); do
if "$@"; then
return 0
fi
if [ "$attempt" -lt "$max_attempts" ]; then
echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..."
sleep "$delay"
delay=$((delay * 2))
fi
done
echo "::error::npm publish failed after $max_attempts attempts"
return 1
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Retry loop has no between-attempt idempotency check

The PR description states that "existing idempotency guards (npm view skip check) prevent double-publishing if a slow success is followed by a retry," but those guards are one-shot pre-checks executed before the retry loop begins — not between retry attempts.

Consider the exact failure mode that motivated this PR: npm publish uploads the tarball successfully server-side, but the connection drops before the 200 OK response arrives (ETIMEDOUT). The CLI exits non-zero, so the retry fires. Attempt 2 calls npm publish for a version that already exists, which fails with E403 You cannot publish over the previously published versions — a non-transient, permanent error. Attempts 2 and 3 both fail for the same reason, and the step exits 1, reproducing the outage.

The same issue applies to the platform-packages loop (lines 511–518): the npm view skip check runs once before calling npm_publish_retry, not between retries.

A robust fix is to npm view the package before each retry (not before attempt 1, since the pre-check covers that):

npm_publish_retry() {
    local pkg_spec="$1"   # e.g. "@optave/codegraph@3.9.0"
    shift                  # remaining args are the full publish command
    local max_attempts=3
    local delay=10
    for attempt in $(seq 1 $max_attempts); do
      if "$@"; then
        return 0
      fi
      # Before retrying, check if the publish already landed (slow-success / ETIMEDOUT)
      if npm view "$pkg_spec" version 2>/dev/null; then
        echo "::notice::$pkg_spec is now visible on the registry — treating as success"
        return 0
      fi
      if [ "$attempt" -lt "$max_attempts" ]; then
        echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..."
        sleep "$delay"
        delay=$((delay * 2))
      fi
    done
    echo "::error::npm publish failed after $max_attempts attempts"
    return 1
  }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The retry function now accepts a pkg_spec argument (e.g. @optave/codegraph@3.9.0) and runs npm view "$pkg_spec" version between retry attempts. If the package is already visible on the registry (slow-success / ETIMEDOUT scenario), it returns 0 instead of retrying into an E403. See scripts/npm-publish-retry.sh and both call sites in publish.yml.

Comment on lines +527 to +543
npm_publish_retry() {
local max_attempts=3
local delay=10
for attempt in $(seq 1 $max_attempts); do
if "$@"; then
return 0
fi
if [ "$attempt" -lt "$max_attempts" ]; then
echo "::warning::npm publish attempt $attempt/$max_attempts failed — retrying in ${delay}s..."
sleep "$delay"
delay=$((delay * 2))
fi
done
echo "::error::npm publish failed after $max_attempts attempts"
return 1
}
npm_publish_retry npm publish --access public --provenance --tag "$NPM_TAG"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate function definition across two steps

npm_publish_retry is defined identically in both the "Publish platform packages" step (lines 444–459) and the "Publish main package" step (lines 527–542). Each GitHub Actions run: block spawns a fresh shell, so the function can't be shared — but the duplication means any future change to the retry policy (e.g. changing max_attempts or delay) must be applied in both places.

Consider extracting it to a checked-in shell script, e.g. scripts/npm-publish-retry.sh, and sourcing it in each step:

. scripts/npm-publish-retry.sh
npm_publish_retry "${pkg_name}@${VERSION}" npm publish "${PKG_DIR}/$platform" --access public --provenance --tag "$NPM_TAG"

This eliminates the duplication and makes the retry policy a single, reviewable, testable artifact.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Extracted npm_publish_retry to scripts/npm-publish-retry.sh and both publish steps now source it. The retry policy is a single reviewable artifact — no more duplication.

…logic

Extract npm_publish_retry to scripts/npm-publish-retry.sh to eliminate
duplication between the two publish steps. Add npm view check between
retry attempts to handle the slow-success scenario where the server
accepts the upload but the client times out — without this, retries
would fail with E403 on an already-published version.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Codegraph Impact Analysis

1 functions changed0 callers affected across 0 files

  • npm_publish_retry in scripts/npm-publish-retry.sh:13 (0 transitive callers)

@carlos-alm carlos-alm merged commit 11871c6 into main Apr 4, 2026
12 checks passed
@carlos-alm carlos-alm deleted the fix/publish-retry branch April 4, 2026 22:03
@github-actions github-actions bot locked and limited conversation to collaborators Apr 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant