A failover system for Solana validators written in Go. Consists of two programs:
- Manager - runs on manager server, monitors validators and triggers failover
- Validator Agent - runs on each validator server, responds to health checks and executes failover commands
Before setting up the failover system, ensure the following requirements are met on each validator server:
etcd must be installed and running as a service on at least 3 servers for tower file synchronization (manager and both validators).
https://etcd.io/docs/v2.3/clustering/
Snapshots must be enabled on your Solana validator to allow faster restarts and state recovery.
All services (Solana validator, failover agent/manager, etcd) should be configured to start automatically after a system restart.
The failover agent needs to execute certain commands with sudo without password prompts:
sudo visudoAdd the following line (replace solana with your user):
solana ALL=(ALL) NOPASSWD: /usr/bin/systemctl stop failover-agent, /usr/bin/systemctl restart solana
In secure mode the identity keypair is located only on the manager server. This requires ssh passwordless access from manager to agent server. Please check the configuration below.
MANAGER SERVER
+-----------------------------------------------------+
| Manager Program |
| - Pings both validators every 5 seconds |
| - Auto-detects active/passive from gossip |
| - Checks process running + slot difference |
| - Triggers failover if active is unhealthy |
+------------------------+----------------------------+
|
|
HTTP | HTTP
|
+----------------------+---------------------+
| |
v v
+--------------+ +--------------+
| VALIDATOR 1 | | VALIDATOR 2 |
| | | |
| Agent | | Agent |
| agave-valid | | agave-valid |
| (active) | | (passive) |
+--------------+ +--------------+
- Auto-detects active/passive validators from gossip
- Pings two validator agents regularly
- Switches to passive if active doesn't respond (after N misses)
- Checks if validator process is running
- Checks slot difference (validator behind network)
- Telegram notifications for critical events
- Dry-run mode (test without actual failover)
- Remote agent shutdown command (
--shutdown-agent) - Manual failover trigger (
--trigger-failover)
- Auto-detects active state from gossip on startup
- Responds to manager health checks
- Reports process status and slot information
- Backs up tower file to etcd on each manager ping
- Monitors manager heartbeat - if manager goes offline, checks external network connectivity
- Executes identity change commands on failover
- Removes tower file when becoming passive (prevents stale tower)
- Remote shutdown endpoint
- Dry-run mode (logs commands without executing)
# Build both programs
go build -o failover-manager ./cmd/manager
go build -o failover-agent ./cmd/validatorFirst start passive validator agent, then active validator agent and afterward manager.
Create validator-config.json on each server (replace IDENTITY, IP, LEDGER_PATH, SOLANA_PATH and IDENTITY_PATH):
{
"listen_addr": ":8080",
"allowed_ips": ["MANAGER_IP"],
"local_rpc": "http://127.0.0.1:8899",
"process_name": "agave-validator",
"manager_timeout": "30s",
"tower_backup_command": "etcdctl put /solana/tower/active \"$(base64 -w0 LEDGER_PATH/tower-1_9-*.bin)\"",
"tower_restore_command": "etcdctl get /solana/tower/active --print-value-only | base64 -d > LEDGER_PATH/tower-1_9-IDENTITY.bin",
"identity_change_command": "SOLANA_PATH/agave-validator -l LEDGER_PATH set-identity IDENTITY_PATH/testnet-validator-keypair.json",
"identity_remove_command": "SOLANA_PATH/agave-validator -l LEDGER_PATH set-identity IDENTITY_PATH/unstaked-identity.json",
"dry_run": false,
"tower_file_path": "LEDGER_PATH/tower-1_9-{validator_identity}.bin",
"validator_identity": "IDENTITY",
"gossip_check_command": "SOLANA_PATH/solana -ut gossip | grep {validator_identity}",
"log_file": "/home/solana/failover/agent.log",
"validator_restart_command": "sudo systemctl restart solana",
"agent_stop_command": "sudo systemctl stop failover-agent",
"active_identity_symlink_command": "ln -sf IDENTITY_PATH/testnet-validator-keypair.json IDENTITY_PATH/identity.json",
"passive_identity_symlink_command": "ln -sf IDENTITY_PATH/unstaked-identity.json IDENTITY_PATH/identity.json"
}In case of secure mode the fields active_identity_symlink_command and identity_change_command are not needed.
Run as service:
./failover-agent --config validator-config.jsonCreate manager-config.json for identity located on agent's server, example is for testnet:
{
"validator1": {
"endpoint": "http://AGENT_1_IP:8080",
"ip": "AGENT_1_IP",
"ledger_path": "/home/solana/ledger"
},
"validator2": {
"endpoint": "http://AGENT_2_IP:8080",
"ip": "AGENT_2_IP",
"ledger_path": "/home/solana/ledger"
},
"gossip_check_command": "solana -ut gossip | grep IDENTITY",
"cluster_rpc": "https://api.testnet.solana.com",
"heartbeat_interval": "10s",
"misses_before_failover": 3,
"slot_diff_threshold": 100,
"request_timeout": "8s",
"dry_run": false,
"telegram_bot_token": "BOT_TOKEN",
"telegram_chat_id": "-CHAT_ID",
"log_file": "/home/solana/failover/manager.log",
"staked_identity_pubkey": "IDENTITY",
"vote_account_pubkey": "VOTE_ACCOUNT",
"stale_vote_slot_threshold": 75
}Run as service:
./failover-manager --config manager-config.jsonIn secure identity mode, the staked identity keypair is stored only on the manager server and never on the validator servers. When failover occurs, the manager sends the identity via SSH.
Add these fields to manager config:
{
"secure_identity_mode": true,
"identity_keypair_path": "IDENTITY_PATH/identity.json",
"ssh_user": "solana",
"ssh_key_path": "~/.ssh/failover_key",
"ssh_set_identity_command": "SOLANA_PATH/agave-validator --ledger {ledger} set-identity",
"ssh_authorized_voter_command": "SOLANA_PATH/agave-validator --ledger {ledger} authorized-voter add"
}| Field | Description |
|---|---|
secure_identity_mode |
Enable secure mode (default: false) |
identity_keypair_path |
Path to staked identity keypair on manager machine |
ssh_user |
SSH username for validator servers |
ssh_key_path |
Path to SSH private key (supports ~) |
ssh_set_identity_command |
Command template for set-identity. Use {ledger} placeholder |
ssh_authorized_voter_command |
Command template for authorized-voter. Use {ledger} placeholder |
ledger_path |
Ledger path on each validator (in validator1/validator2 config) |
# Generate SSH key on manager
ssh-keygen -t ed25519 -f ~/.ssh/failover_key -N ""
# Copy to validator servers
ssh-copy-id -i ~/.ssh/failover_key.pub solana@VALIDATOR1_IP
ssh-copy-id -i ~/.ssh/failover_key.pub solana@VALIDATOR2_IP- Manager sends
become_activeto agent withskip_identity=true - Agent only restores tower file (skips identity commands)
- Manager SSHs to validator with identity keypair redirected to stdin:
ssh user@host "agave-validator --ledger /path set-identity" < identity.jsonssh user@host "agave-validator --ledger /path authorized-voter add" < identity.json
In this mode, the agent's identity_change_command and active_identity_symlink_command are ignored.
- Manager detects unhealthy active (unreachable, process down, behind slots, etc.)
- Manager sends
become_passiveto old active:- Agent backs up tower file
- Agent removes identity (switches to unstaked)
- Agent deletes tower file
- Agent marks itself as passive
- Manager sends
become_activeto new active:- Agent restores tower file from backup
- Agent sets voting identity
- Agent marks itself as active
- Active agent detects no manager heartbeat for 15s
- Active agent checks external network connectivity (tests Cloudflare, Google, and Quad9 DNS endpoints)
- If network check fails (cannot reach 2+ endpoints): become passive to avoid split-brain
- If network is available: stay active and wait for manager to come back
To safely shutdown the failover system:
-
Stop the manager first:
# Stop the manager service sudo systemctl stop failover-manager -
Stop both agents simultaneously:
# From the manager server, send shutdown command to all agents ./failover-manager --config manager-config.json --shutdown-agent
This ensures both validator agents are stopped at the same time, preventing either from detecting the other as unavailable and triggering unnecessary failover logic.
Note: The --shutdown-agent command requires the manager binary but does not start the manager service. It only sends shutdown commands to the configured agent endpoints.
On manager box execute:
./failover-manager --trigger-failover --reason "manual failover triggered via CLI"| Endpoint | Method | Description |
|---|---|---|
/status |
POST | Returns validator status (used by manager) |
/failover |
POST | Execute failover command |
/shutdown |
POST | Shutdown the agent |
/identity |
GET | Returns current validator identity pubkey |
The manager can send notifications to Telegram for critical events:
- π Failover complete - when failover succeeds (with reason and validator info)
- π΄ Server unreachable - when a validator becomes unreachable (sent only once)
- π’ Server back online - when a validator becomes reachable again
- π’ Server status - sends status each 4 hours
- Create a bot with @BotFather and get the token
- Get your chat ID (send a message to your bot, then visit
https://api.telegram.org/bot<TOKEN>/getUpdates) - Add to config:
{
"telegram_bot_token": "123456789:ABCdefGHIjklMNOpqrsTUVwxyz",
"telegram_chat_id": "-1001234567890"
}For group chats, the chat ID is negative. For private chats, use your user ID.