-
Notifications
You must be signed in to change notification settings - Fork 25
Description
SPIKE Bootstrap currently blocks indefinitely when a keeper is unreachable, using retry.Forever for each keeper sequentially. This provides poor operator experience during day-zero setup. Bootstrap should fail fast with a clear error message after a
configurable timeout, allowing the operator to fix the issue and rerun.
Current Behavior
// app/bootstrap/internal/net/broadcast.go:48-72
for keeperID := range env.KeepersVal() {
keeperShare := state.KeeperShare(rs, keeperID)
_, err := retry.Forever(ctx, func() (bool, *sdkErrors.SDKError) {
err := api.Contribute(keeperShare, keeperID)
if err != nil {
// Retries forever on this single keeper
// Never attempts other keepers
return false, failErr
}
return true, nil
})
}
If keeper 1 is unreachable:
- Bootstrap blocks forever on keeper 1
- Keepers 2 and 3 are never attempted
- Operator has no clear indication which keeper is the problem
- Operator must manually kill the process
Expected Behavior
Bootstrap should fail fast after a configurable timeout:
$ spike-bootstrap
Sending shard to keeper 1 at https://keeper1:8443... (attempt 1/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 2/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 3/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 4/5)
Sending shard to keeper 1 at https://keeper1:8443... (attempt 5/5)
FATAL: Failed to reach keeper 1 at https://keeper1:8443 after 5 attempts (30s timeout).
Ensure all keepers are running and rerun bootstrap.
Rationale
- Bootstrap is a day-zero operation: The operator is actively watching and can fix issues
- Bootstrap is idempotent: Safe to rerun after fixing the problem
- Fail-fast philosophy: Consistent with how bootstrap handles other failures (SPIFFE source unavailable, verification failure)
- Clear diagnostics: Operator knows exactly which keeper failed and why
- Blocking forever wastes time: Obscures the problem rather than surfacing it
This differs from Nexus's sendShardsToKeepers, which correctly uses continue on errors because Nexus is a long-running service where transient failures are expected and periodic retries will eventually succeed.
Proposed Environment Variables
Add to docs-src/content/usage/configuration.md:
| Component | Environment Variable | Description | Default Value |
|---|---|---|---|
| SPIKE Bootstrap | SPIKE_BOOTSTRAP_KEEPER_TIMEOUT | Total timeout for reaching each keeper during bootstrap | "30s" |
| SPIKE Bootstrap | SPIKE_BOOTSTRAP_KEEPER_MAX_RETRIES | Maximum retry attempts per keeper before failing | 5 |
Suggested Implementation
func BroadcastKeepers(ctx context.Context, api *spike.API) {
const fName = "BroadcastKeepers"
validation.CheckContext(ctx, fName)
rs := state.RootShares()
timeout := env.BootstrapKeeperTimeoutVal() // default: 30s
maxRetries := env.BootstrapKeeperMaxRetriesVal() // default: 5
for keeperID, keeperURL := range env.KeepersVal() {
keeperShare := state.KeeperShare(rs, keeperID)
keeperCtx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
_, err := retry.WithMaxAttempts(keeperCtx, maxRetries,
func() (bool, *sdkErrors.SDKError) {
log.Info(fName,
"message", "sending shard to keeper",
"keeper_id", keeperID,
"keeper_url", keeperURL,
)
err := api.Contribute(keeperShare, keeperID)
if err != nil {
warnErr := sdkErrors.ErrAPIPostFailed.Wrap(err)
warnErr.Msg = "failed to send shard: will retry"
log.WarnErr(fName, *warnErr)
return false, warnErr
}
return true, nil
})
if err != nil {
failErr := sdkErrors.ErrBootstrapKeeperUnreachable.Clone()
failErr.Msg = fmt.Sprintf(
"failed to reach keeper %s at %s after %d attempts; "+
"ensure all keepers are running and rerun bootstrap",
keeperID, keeperURL, maxRetries,
)
log.FatalErr(fName, *failErr)
return
}
}
}
Files to Modify
- app/bootstrap/internal/net/broadcast.go - Replace retry.Forever with bounded retry
- spike-sdk-go/config/env/bootstrap.go - Add new environment variable readers (or create if doesn't exist)
- docs-src/content/usage/configuration.md - Document new environment variables
Notes
- May need to add retry.WithMaxAttempts or similar to the SDK if it doesn't exist
- Consider adding a new SDK error sentinel ErrBootstrapKeeperUnreachable