Skip to content

SPIKE Bootstrap should fail fast for bad keeper configuration #254

@v0lkan

Description

@v0lkan

SPIKE Bootstrap currently blocks indefinitely when a keeper is unreachable, using retry.Forever for each keeper sequentially. This provides poor operator experience during day-zero setup. Bootstrap should fail fast with a clear error message after a
configurable timeout, allowing the operator to fix the issue and rerun.

Current Behavior

// app/bootstrap/internal/net/broadcast.go:48-72
for keeperID := range env.KeepersVal() {
    keeperShare := state.KeeperShare(rs, keeperID)

    _, err := retry.Forever(ctx, func() (bool, *sdkErrors.SDKError) {
        err := api.Contribute(keeperShare, keeperID)
        if err != nil {
            // Retries forever on this single keeper
            // Never attempts other keepers
            return false, failErr
        }
        return true, nil
    })
}

If keeper 1 is unreachable:

  • Bootstrap blocks forever on keeper 1
  • Keepers 2 and 3 are never attempted
  • Operator has no clear indication which keeper is the problem
  • Operator must manually kill the process

Expected Behavior

Bootstrap should fail fast after a configurable timeout:

$ spike-bootstrap
Sending shard to keeper 1 at https://keeper1:8443... (attempt 1/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 2/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 3/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 4/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 5/5)
FATAL: Failed to reach keeper 1 at https://keeper1:8443 after 5 attempts (30s timeout).
       Ensure all keepers are running and rerun bootstrap.

Rationale

  • Bootstrap is a day-zero operation: The operator is actively watching and can fix issues
  • Bootstrap is idempotent: Safe to rerun after fixing the problem
  • Fail-fast philosophy: Consistent with how bootstrap handles other failures (SPIFFE source unavailable, verification failure)
  • Clear diagnostics: Operator knows exactly which keeper failed and why
  • Blocking forever wastes time: Obscures the problem rather than surfacing it

This differs from Nexus's sendShardsToKeepers, which correctly uses continue on errors because Nexus is a long-running service where transient failures are expected and periodic retries will eventually succeed.

Proposed Environment Variables

Add to docs-src/content/usage/configuration.md:

Component Environment Variable Description Default Value
SPIKE Bootstrap SPIKE_BOOTSTRAP_KEEPER_TIMEOUT Total timeout for reaching each keeper during bootstrap "30s"
SPIKE Bootstrap SPIKE_BOOTSTRAP_KEEPER_MAX_RETRIES Maximum retry attempts per keeper before failing 5

Suggested Implementation

func BroadcastKeepers(ctx context.Context, api *spike.API) {
    const fName = "BroadcastKeepers"

    validation.CheckContext(ctx, fName)
    rs := state.RootShares()

    timeout := env.BootstrapKeeperTimeoutVal()   // default: 30s
    maxRetries := env.BootstrapKeeperMaxRetriesVal() // default: 5

    for keeperID, keeperURL := range env.KeepersVal() {
        keeperShare := state.KeeperShare(rs, keeperID)

        keeperCtx, cancel := context.WithTimeout(ctx, timeout)
        defer cancel()

        _, err := retry.WithMaxAttempts(keeperCtx, maxRetries,
            func() (bool, *sdkErrors.SDKError) {
                log.Info(fName,
                    "message", "sending shard to keeper",
                    "keeper_id", keeperID,
                    "keeper_url", keeperURL,
                )

                err := api.Contribute(keeperShare, keeperID)
                if err != nil {
                    warnErr := sdkErrors.ErrAPIPostFailed.Wrap(err)
                    warnErr.Msg = "failed to send shard: will retry"
                    log.WarnErr(fName, *warnErr)
                    return false, warnErr
                }
                return true, nil
            })

        if err != nil {
            failErr := sdkErrors.ErrBootstrapKeeperUnreachable.Clone()
            failErr.Msg = fmt.Sprintf(
                "failed to reach keeper %s at %s after %d attempts; "+
                "ensure all keepers are running and rerun bootstrap",
                keeperID, keeperURL, maxRetries,
            )
            log.FatalErr(fName, *failErr)
            return
        }
    }
}

Files to Modify

  • app/bootstrap/internal/net/broadcast.go - Replace retry.Forever with bounded retry
  • spike-sdk-go/config/env/bootstrap.go - Add new environment variable readers (or create if doesn't exist)
  • docs-src/content/usage/configuration.md - Document new environment variables

Notes

  • May need to add retry.WithMaxAttempts or similar to the SDK if it doesn't exist
  • Consider adding a new SDK error sentinel ErrBootstrapKeeperUnreachable

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions