Skip to content

fix: add npm auth debug step and fix publish retry logic#835

Merged
carlos-alm merged 3 commits intomainfrom
fix/publish-retry-and-debug
Apr 4, 2026
Merged

fix: add npm auth debug step and fix publish retry logic#835
carlos-alm merged 3 commits intomainfrom
fix/publish-retry-and-debug

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Debug step: Adds npm whoami before publish to surface OIDC auth failures early instead of getting a cryptic E404
  • Fix retry logic: GitHub Actions sets set -eo pipefail by default for bash steps, which kills the script on the first npm publish failure before the retry wrapper can fire. Now disables errexit and captures exit codes explicitly
  • Grouped output: Each retry attempt is wrapped in ::group:: for cleaner CI logs

Context

The 3.9.0 publish failed with E404 on platform packages. The retry logic (3 attempts, exponential backoff) never fired — the entire step completed in 6s, confirming errexit killed the script immediately. This PR fixes the retries and adds diagnostic output to identify the root cause of the auth failure.

Test plan

  • Merge and re-trigger the 3.9.0 publish workflow
  • Verify the debug step shows npm whoami output (confirms OIDC auth works or surfaces the failure)
  • If publish still fails, verify retry warnings appear in the logs (confirms retry logic fires)

GitHub Actions sets errexit by default, which kills the bash script
on first npm publish failure before the retry wrapper can fire.
Disable errexit and capture exit codes explicitly.

Also adds a debug step that runs npm whoami before publish to surface
OIDC auth issues early.
@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

Claude finished @carlos-alm's task —— View job


Reviewing PR: Fix publish retry logic and add debug step

Todo List:

  • Read and analyze the workflow changes
  • Apply Phase 1: Root Cause Analysis
  • Apply Phase 2: Configuration Change Detection
  • Apply Phase 3: Test Integrity
  • Apply Phase 4: Code Quality
  • Apply Phase 5: Backlog Compliance
  • Provide final assessment and rating

Analysis of Changes:
The PR modifies .github/workflows/publish.yml with three main changes:

  1. Debug step (lines 436-444): Adds npm whoami and .npmrc inspection to surface OIDC auth issues early
  2. Retry logic fix (lines 453, 545): Disables bash errexit mode with set +e so failed npm publish calls don't immediately kill the script
  3. Grouped output (lines 459, 462, 550, 553): Wraps retry attempts in ::group:: for cleaner CI logs

Phase 1: Root Cause Analysis ✅

Problem Being Solved: The 3.9.0 release failed with E404 errors on platform packages, and the retry logic (3 attempts with exponential backoff) never executed because GitHub Actions' default set -eo pipefail killed the script on the first failure.

Why This Approach: This addresses the actual root cause rather than symptoms. The E404 suggests an authentication issue (OIDC token problems), and the retry logic failing to execute is a legitimate bash behavior issue. The debug step will surface auth problems early, and disabling errexit allows the retry wrapper to function as intended.

Risk Assessment: ✅ LOW RISK - Changes are surgical and justified. Adding debug output and fixing broken retry logic are both defensive improvements. No functionality is being suppressed or hidden.

Phase 2: Configuration Change Detection ✅

Configuration Changes: None that weaken safety. The changes are:

  • Debug addition: Purely additive diagnostic output
  • Shell behavior fix: Changes errexit handling to allow retry logic to work as designed (this was a configuration bug, not a safety rule)
  • Log grouping: Cosmetic output improvement

Assessment: ✅ NO HIGH-RISK CHANGES - No linting rules disabled, no test thresholds lowered, no strict mode flags relaxed. The set +e change fixes broken functionality rather than silencing errors.

Phase 3: Test Integrity ✅

Test Changes: None. This PR only modifies the publish workflow.

RED FLAGS: None detected. No test files modified, no assertions weakened, no test logic changed.

Phase 4: Code Quality ✅

Code Analysis:

  • Security: No command injection risks. The debug step uses safe commands (npm whoami, cat) with no user input
  • Logic: Retry logic is sound - captures exit codes explicitly, implements exponential backoff, respects max attempts
  • Error handling: Proper - preserves exit codes and propagates final failure appropriately
  • Style: Clean bash scripting with clear variable names and appropriate quoting

Potential Issues: None significant. The implementation follows bash best practices.

Phase 5: Backlog Compliance

Zero-dep: ✓ - No new runtime dependencies added
Foundation-aligned: ✓ - Improves CI/build reliability without violating core principles
Problem-fit: 2 - This is infrastructure/tooling improvement that doesn't directly help AI agents navigate codebases, but it's necessary for reliable releases
Breaking: No - Pure workflow fix, no API or behavior changes

Tier: 1 (zero-dep ✓ + foundation-aligned ✓)


Final Assessment

Root Cause Analysis

Problem Being Solved: Failed 3.9.0 publish due to E404 auth errors + non-functional retry logic caused by bash errexit behavior
Why This Approach: Addresses actual technical debt in CI/CD - adds debugging for auth issues and fixes broken retry mechanism
Risk Assessment: Low risk defensive improvements with no functionality suppression

Backlog Compliance

  • Zero-dep: ✓ - No new dependencies
  • Foundation-aligned: ✓ - CI/build improvement, no principle violations
  • Problem-fit: 2 - Infrastructure fix, doesn't directly help AI code navigation
  • Breaking: No - Workflow-only changes
  • Tier: 1

Critical Concerns

None. This is a straightforward technical fix addressing legitimate CI/CD issues.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Well-engineered fix for real infrastructure problems. Clean implementation with proper error handling. While not directly advancing the core mission of helping AI agents navigate code, reliable CI/CD is foundational for any software project.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 4, 2026

Greptile Summary

This PR fixes the 3.9.0 publish failure by disabling both errexit and pipefail (set +eo pipefail) so the retry wrapper can fire on transient npm errors, and adds a Debug npm auth step that runs npm whoami and greps registry lines from .npmrc to surface OIDC failures early. All three previous review concerns (dual pipefail, duplicate fix for the main-package step, and .npmrc credential exposure) have been correctly addressed in 7a47aea.

Confidence Score: 5/5

Safe to merge — all prior P1 concerns resolved, no new issues found.

All three previously flagged issues (pipefail in platform publish, pipefail in main publish, and .npmrc credential exposure) are correctly fixed. The retry logic now correctly captures exit codes with local rc=$? under set +eo pipefail, and the debug step is well-scoped. No remaining P0 or P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
.github/workflows/publish.yml Adds npm auth debug step, fixes set +eo pipefail in both publish steps so retry logic fires correctly, and replaces cat ~/.npmrc with registry-only grep to avoid credential exposure.

Sequence Diagram

sequenceDiagram
    participant CI as GitHub Actions
    participant Debug as Debug npm auth
    participant Plat as Publish platform packages
    participant Main as Publish main package

    CI->>Debug: if not already_published
    Debug->>Debug: npm --version / node --version
    Debug->>Debug: npm whoami (OIDC check)
    Debug->>Debug: grep registry ~/.npmrc

    CI->>Plat: if not already_published (set +eo pipefail)
    loop Each platform artifact
        Plat->>Plat: npm view pkg@version (skip if exists)
        loop npm_publish_retry (max 3 attempts)
            Plat->>Plat: ::group:: attempt N/3
            Plat->>Plat: npm publish --provenance
            Plat->>Plat: capture rc / ::endgroup::
            alt rc == 0
                Plat-->>Plat: return 0
            else rc != 0 and attempt < 3
                Plat->>Plat: warn + sleep (15s → 30s → 60s)
            end
        end
        Plat->>Plat: FAILED=1 on retry exhaustion
    end
    Plat->>CI: exit $FAILED

    CI->>Main: if not already_published (set +eo pipefail)
    loop npm_publish_retry (max 3 attempts)
        Main->>Main: ::group:: attempt N/3
        Main->>Main: npm publish --provenance
        Main->>Main: capture rc / ::endgroup::
    end
    Main->>CI: exit 1 on retry exhaustion
Loading

Reviews (2): Last reviewed commit: "fix: address review feedback on publish ..." | Re-trigger Greptile

NPM_TAG: ${{ needs.compute-version.outputs.npm_tag }}
shell: bash
run: |
set +e # Disable errexit so retry logic works
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 pipefail still active after set +e

set +e disables errexit but leaves pipefail on (GitHub Actions enables both by default). The publish commands themselves are fine since they don't use pipes, but the platform=$(basename "$artifact_dir" | sed 's/^native-//') line uses a pipe: if basename or sed fails, pipefail would set the loop body's exit status to non-zero. With set +e that won't kill the step, but it could silently produce an empty $platform and trigger the "Unknown platform artifact" warning. Easy to prevent by also disabling pipefail.

Suggested change
set +e # Disable errexit so retry logic works
set +eo pipefail # Disable errexit and pipefail so retry logic works

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7a47aea — both publish steps now use set +eo pipefail to disable both errexit and pipefail, preventing silent empty-variable issues from the piped basename | sed command.

run: npm publish --access public --provenance --tag "$NPM_TAG"
shell: bash
run: |
set +e # Disable errexit so retry logic works
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Same pipefail caveat applies here

The "Publish main package" step should mirror the fix above for consistency. The npm_publish_retry function doesn't use pipes, but matching the header makes the intent explicit.

Suggested change
set +e # Disable errexit so retry logic works
set +eo pipefail # Disable errexit and pipefail so retry logic works

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7a47aea — the main package publish step now also uses set +eo pipefail to match the platform publish step.

Comment on lines +443 to +444
cat ~/.npmrc 2>/dev/null || echo "no ~/.npmrc"
cat .npmrc 2>/dev/null || echo "no .npmrc"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 .npmrc contents exposed in logs

cat ~/.npmrc dumps the full file to GitHub Actions logs, visible to anyone with repo read access. With OIDC the file holds only an env-var reference (not a literal token), so the current risk is low. If auth is ever switched to a static token the credential would be logged in plain text. Consider filtering to registry-URL lines rather than printing the whole file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7a47aea — replaced cat ~/.npmrc / cat .npmrc with grep 'registry' to only show registry-URL lines, avoiding potential credential exposure if auth is ever switched to static tokens.

- Disable pipefail alongside errexit in both publish steps so piped
  commands (basename | sed) cannot silently produce empty variables
- Filter .npmrc output to registry lines only instead of dumping the
  full file, avoiding potential credential exposure
- Guard mkdir/cp of .node artifacts with an explicit file check,
  setting FAILED=1 and skipping the platform if the copy fails
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed the outside-diff comment about silent cp failure (lines 500-501) in 7a47aea — added an explicit file-existence check after the mkdir -p/cp sequence. If the .node artifact copy fails, the loop now emits an ::error::, sets FAILED=1, and skips to the next platform instead of silently continuing to publish a broken package.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 8b32503 into main Apr 4, 2026
10 of 13 checks passed
@carlos-alm carlos-alm deleted the fix/publish-retry-and-debug branch April 4, 2026 22:24
@github-actions github-actions bot locked and limited conversation to collaborators Apr 4, 2026
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Resolved merge conflicts with main. The inline retry logic from this PR was replaced by main's shared scripts/npm-publish-retry.sh (merged via #833), which provides the same functionality with an improved API (package-spec idempotency checks). The PR's debug step (npm whoami + filtered .npmrc output) is preserved. Node version upgraded to 24 per main's #850.

@greptileai

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant