SPIKE Bootstrap should fail fast for bad keeper configuration

SPIKE Bootstrap currently blocks indefinitely when a keeper is unreachable, using retry.Forever for each keeper sequentially. This provides poor operator experience during day-zero setup. Bootstrap should fail fast with a clear error message after a
configurable timeout, allowing the operator to fix the issue and rerun.

Current Behavior

```
// app/bootstrap/internal/net/broadcast.go:48-72
for keeperID := range env.KeepersVal() {
    keeperShare := state.KeeperShare(rs, keeperID)

    _, err := retry.Forever(ctx, func() (bool, *sdkErrors.SDKError) {
        err := api.Contribute(keeperShare, keeperID)
        if err != nil {
            // Retries forever on this single keeper
            // Never attempts other keepers
            return false, failErr
        }
        return true, nil
    })
}
```

If keeper 1 is unreachable:
- Bootstrap blocks forever on keeper 1
- Keepers 2 and 3 are never attempted
- Operator has no clear indication which keeper is the problem
- Operator must manually kill the process

Expected Behavior

Bootstrap should fail fast after a configurable timeout:

```
$ spike-bootstrap
Sending shard to keeper 1 at https://keeper1:8443... (attempt 1/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 2/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 3/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 4/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 5/5)
FATAL: Failed to reach keeper 1 at https://keeper1:8443 after 5 attempts (30s timeout).
       Ensure all keepers are running and rerun bootstrap.
```

Rationale

- Bootstrap is a day-zero operation: The operator is actively watching and can fix issues
- Bootstrap is idempotent: Safe to rerun after fixing the problem
- Fail-fast philosophy: Consistent with how bootstrap handles other failures (SPIFFE source unavailable, verification failure)
- Clear diagnostics: Operator knows exactly which keeper failed and why
- Blocking forever wastes time: Obscures the problem rather than surfacing it

This differs from Nexus's sendShardsToKeepers, which correctly uses continue on errors because Nexus is a long-running service where transient failures are expected and periodic retries will eventually succeed.

Proposed Environment Variables

Add to docs-src/content/usage/configuration.md:


| Component       | Environment Variable               | Description                                             | Default Value |
|-----------------|------------------------------------|---------------------------------------------------------|---------------|
| SPIKE Bootstrap | SPIKE_BOOTSTRAP_KEEPER_TIMEOUT     | Total timeout for reaching each keeper during bootstrap | "30s"         |
| SPIKE Bootstrap | SPIKE_BOOTSTRAP_KEEPER_MAX_RETRIES | Maximum retry attempts per keeper before failing        | 5             |

Suggested Implementation

```
func BroadcastKeepers(ctx context.Context, api *spike.API) {
    const fName = "BroadcastKeepers"

    validation.CheckContext(ctx, fName)
    rs := state.RootShares()

    timeout := env.BootstrapKeeperTimeoutVal()   // default: 30s
    maxRetries := env.BootstrapKeeperMaxRetriesVal() // default: 5

    for keeperID, keeperURL := range env.KeepersVal() {
        keeperShare := state.KeeperShare(rs, keeperID)

        keeperCtx, cancel := context.WithTimeout(ctx, timeout)
        defer cancel()

        _, err := retry.WithMaxAttempts(keeperCtx, maxRetries,
            func() (bool, *sdkErrors.SDKError) {
                log.Info(fName,
                    "message", "sending shard to keeper",
                    "keeper_id", keeperID,
                    "keeper_url", keeperURL,
                )

                err := api.Contribute(keeperShare, keeperID)
                if err != nil {
                    warnErr := sdkErrors.ErrAPIPostFailed.Wrap(err)
                    warnErr.Msg = "failed to send shard: will retry"
                    log.WarnErr(fName, *warnErr)
                    return false, warnErr
                }
                return true, nil
            })

        if err != nil {
            failErr := sdkErrors.ErrBootstrapKeeperUnreachable.Clone()
            failErr.Msg = fmt.Sprintf(
                "failed to reach keeper %s at %s after %d attempts; "+
                "ensure all keepers are running and rerun bootstrap",
                keeperID, keeperURL, maxRetries,
            )
            log.FatalErr(fName, *failErr)
            return
        }
    }
}
```

Files to Modify

- app/bootstrap/internal/net/broadcast.go - Replace retry.Forever with bounded retry
- spike-sdk-go/config/env/bootstrap.go - Add new environment variable readers (or create if doesn't exist)
- docs-src/content/usage/configuration.md - Document new environment variables

Notes

- May need to add retry.WithMaxAttempts or similar to the SDK if it doesn't exist
- Consider adding a new SDK error sentinel ErrBootstrapKeeperUnreachable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPIKE Bootstrap should fail fast for bad keeper configuration #254

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Environment Variable	Description	Default Value
SPIKE Bootstrap	SPIKE_BOOTSTRAP_KEEPER_TIMEOUT	Total timeout for reaching each keeper during bootstrap	"30s"
SPIKE Bootstrap	SPIKE_BOOTSTRAP_KEEPER_MAX_RETRIES	Maximum retry attempts per keeper before failing	5

SPIKE Bootstrap should fail fast for bad keeper configuration #254

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions