Solana Validator Failover System

A failover system for Solana validators written in Go. Consists of two programs:

Manager - runs on manager server, monitors validators and triggers failover
Validator Agent - runs on each validator server, responds to health checks and executes failover commands

Prerequisites

Before setting up the failover system, ensure the following requirements are met on each validator server:

1. etcd Installation

etcd must be installed and running as a service on at least 3 servers for tower file synchronization (manager and both validators).

https://etcd.io/docs/v2.3/clustering/

2. Validator Snapshots

Snapshots must be enabled on your Solana validator to allow faster restarts and state recovery.

3. Autostart Configuration

All services (Solana validator, failover agent/manager, etcd) should be configured to start automatically after a system restart.

4. Sudoers Configuration

The failover agent needs to execute certain commands with sudo without password prompts:

sudo visudo

Add the following line (replace solana with your user):

solana ALL=(ALL) NOPASSWD: /usr/bin/systemctl stop failover-agent, /usr/bin/systemctl restart solana

5. Secure Identity mode (in case of enabled)

In secure mode the identity keypair is located only on the manager server. This requires ssh passwordless access from manager to agent server. Please check the configuration below.

Architecture

                                                             
                         MANAGER SERVER                          
                                                             
      +-----------------------------------------------------+   
      |                   Manager Program                   |   
      |  - Pings both validators every 5 seconds            |   
      |  - Auto-detects active/passive from gossip          |
      |  - Checks process running + slot difference         |   
      |  - Triggers failover if active is unhealthy         |   
      +------------------------+----------------------------+   
                               |                 
                               |                 
                          HTTP | HTTP
                               |                 
        +----------------------+---------------------+
        |                                            |
        v                                            v
  +--------------+                           +--------------+
  | VALIDATOR 1  |                           | VALIDATOR 2  |
  |              |                           |              |
  |  Agent       |                           |  Agent       |
  |  agave-valid |                           |  agave-valid |
  |  (active)    |                           |  (passive)   |
  +--------------+                           +--------------+

Features

Manager Program

Auto-detects active/passive validators from gossip
Pings two validator agents regularly
Switches to passive if active doesn't respond (after N misses)
Checks if validator process is running
Checks slot difference (validator behind network)
Telegram notifications for critical events
Dry-run mode (test without actual failover)
Remote agent shutdown command (--shutdown-agent)
Manual failover trigger (--trigger-failover)

Validator Agent Program

Auto-detects active state from gossip on startup
Responds to manager health checks
Reports process status and slot information
Backs up tower file to etcd on each manager ping
Monitors manager heartbeat - if manager goes offline, checks external network connectivity
Executes identity change commands on failover
Removes tower file when becoming passive (prevents stale tower)
Remote shutdown endpoint
Dry-run mode (logs commands without executing)

Building

# Build both programs
go build -o failover-manager ./cmd/manager
go build -o failover-agent ./cmd/validator

Quick Start

1. Starting order

First start passive validator agent, then active validator agent and afterward manager.

2. Configure Agents (on each validator server)

Create validator-config.json on each server (replace IDENTITY, IP, LEDGER_PATH, SOLANA_PATH and IDENTITY_PATH):

{
  "listen_addr": ":8080",
  "allowed_ips": ["MANAGER_IP"],
  "local_rpc": "http://127.0.0.1:8899",
  "process_name": "agave-validator",
  "manager_timeout": "30s",
  "tower_backup_command": "etcdctl put /solana/tower/active \"$(base64 -w0 LEDGER_PATH/tower-1_9-*.bin)\"",
  "tower_restore_command": "etcdctl get /solana/tower/active --print-value-only | base64 -d > LEDGER_PATH/tower-1_9-IDENTITY.bin",
  "identity_change_command": "SOLANA_PATH/agave-validator  -l LEDGER_PATH set-identity IDENTITY_PATH/testnet-validator-keypair.json",
  "identity_remove_command": "SOLANA_PATH/agave-validator  -l LEDGER_PATH set-identity IDENTITY_PATH/unstaked-identity.json",
  "dry_run": false,
  "tower_file_path": "LEDGER_PATH/tower-1_9-{validator_identity}.bin",
  "validator_identity": "IDENTITY",
  "gossip_check_command": "SOLANA_PATH/solana -ut gossip | grep {validator_identity}",
  "log_file": "/home/solana/failover/agent.log",
  "validator_restart_command": "sudo systemctl restart solana",
  "agent_stop_command": "sudo systemctl stop failover-agent",
  "active_identity_symlink_command": "ln -sf IDENTITY_PATH/testnet-validator-keypair.json IDENTITY_PATH/identity.json",
  "passive_identity_symlink_command": "ln -sf IDENTITY_PATH/unstaked-identity.json IDENTITY_PATH/identity.json"
}

In case of secure mode the fields active_identity_symlink_command and identity_change_command are not needed.

Run as service:

./failover-agent --config validator-config.json

2. Configure Manager (on manager server)

Create manager-config.json for identity located on agent's server, example is for testnet:

{
  "validator1": {
    "endpoint": "http://AGENT_1_IP:8080",
    "ip": "AGENT_1_IP",
    "ledger_path": "/home/solana/ledger"
  },
  "validator2": {
    "endpoint": "http://AGENT_2_IP:8080",
    "ip": "AGENT_2_IP",
    "ledger_path": "/home/solana/ledger"
  },
  "gossip_check_command": "solana -ut gossip | grep IDENTITY",
  "cluster_rpc": "https://api.testnet.solana.com",
  "heartbeat_interval": "10s",
  "misses_before_failover": 3,
  "slot_diff_threshold": 100,
  "request_timeout": "8s",
  "dry_run": false,
  "telegram_bot_token": "BOT_TOKEN",
  "telegram_chat_id": "-CHAT_ID",
  "log_file": "/home/solana/failover/manager.log",
  "staked_identity_pubkey": "IDENTITY",
  "vote_account_pubkey": "VOTE_ACCOUNT",
  "stale_vote_slot_threshold": 75
}

Run as service:

./failover-manager --config manager-config.json

Secure Identity Mode

In secure identity mode, the staked identity keypair is stored only on the manager server and never on the validator servers. When failover occurs, the manager sends the identity via SSH.

Configuration

Add these fields to manager config:

{
  "secure_identity_mode": true,
  "identity_keypair_path": "IDENTITY_PATH/identity.json",
  "ssh_user": "solana",
  "ssh_key_path": "~/.ssh/failover_key",
  "ssh_set_identity_command": "SOLANA_PATH/agave-validator --ledger {ledger} set-identity",
  "ssh_authorized_voter_command": "SOLANA_PATH/agave-validator --ledger {ledger} authorized-voter add"
}

Field	Description
`secure_identity_mode`	Enable secure mode (default: false)
`identity_keypair_path`	Path to staked identity keypair on manager machine
`ssh_user`	SSH username for validator servers
`ssh_key_path`	Path to SSH private key (supports `~`)
`ssh_set_identity_command`	Command template for set-identity. Use `{ledger}` placeholder
`ssh_authorized_voter_command`	Command template for authorized-voter. Use `{ledger}` placeholder
`ledger_path`	Ledger path on each validator (in validator1/validator2 config)

SSH Setup

# Generate SSH key on manager
ssh-keygen -t ed25519 -f ~/.ssh/failover_key -N ""

# Copy to validator servers
ssh-copy-id -i ~/.ssh/failover_key.pub solana@VALIDATOR1_IP
ssh-copy-id -i ~/.ssh/failover_key.pub solana@VALIDATOR2_IP

How It Works

Manager sends become_active to agent with skip_identity=true
Agent only restores tower file (skips identity commands)
Manager SSHs to validator with identity keypair redirected to stdin:
- ssh user@host "agave-validator --ledger /path set-identity" < identity.json
- ssh user@host "agave-validator --ledger /path authorized-voter add" < identity.json

In this mode, the agent's identity_change_command and active_identity_symlink_command are ignored.

Failover Process

When Active Becomes Unhealthy

Manager detects unhealthy active (unreachable, process down, behind slots, etc.)
Manager sends become_passive to old active:
- Agent backs up tower file
- Agent removes identity (switches to unstaked)
- Agent deletes tower file
- Agent marks itself as passive
Manager sends become_active to new active:
- Agent restores tower file from backup
- Agent sets voting identity
- Agent marks itself as active

When Manager Goes Offline

Active agent detects no manager heartbeat for 15s
Active agent checks external network connectivity (tests Cloudflare, Google, and Quad9 DNS endpoints)
If network check fails (cannot reach 2+ endpoints): become passive to avoid split-brain
If network is available: stay active and wait for manager to come back

Stopping the Failover System

To safely shutdown the failover system:

Stop the manager first:

# Stop the manager service
sudo systemctl stop failover-manager

Stop both agents simultaneously:

# From the manager server, send shutdown command to all agents
./failover-manager --config manager-config.json --shutdown-agent

This ensures both validator agents are stopped at the same time, preventing either from detecting the other as unavailable and triggering unnecessary failover logic.

Note: The --shutdown-agent command requires the manager binary but does not start the manager service. It only sends shutdown commands to the configured agent endpoints.

Manual Failover

On manager box execute:

./failover-manager --trigger-failover --reason "manual failover triggered via CLI"

API Endpoints

Validator Agent

Endpoint	Method	Description
`/status`	POST	Returns validator status (used by manager)
`/failover`	POST	Execute failover command
`/shutdown`	POST	Shutdown the agent
`/identity`	GET	Returns current validator identity pubkey

Telegram Notifications

The manager can send notifications to Telegram for critical events:

🔄 Failover complete - when failover succeeds (with reason and validator info)
🔴 Server unreachable - when a validator becomes unreachable (sent only once)
🟢 Server back online - when a validator becomes reachable again
🟢 Server status - sends status each 4 hours

Setup

Create a bot with @BotFather and get the token
Get your chat ID (send a message to your bot, then visit https://api.telegram.org/bot<TOKEN>/getUpdates)
Add to config:

{
  "telegram_bot_token": "123456789:ABCdefGHIjklMNOpqrsTUVwxyz",
  "telegram_chat_id": "-1001234567890"
}

For group chats, the chat ID is negative. For private chats, use your user ID.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.claude		.claude
cmd		cmd
configs		configs
pkg		pkg
scripts		scripts
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Solana Validator Failover System

Prerequisites

1. etcd Installation

2. Validator Snapshots

3. Autostart Configuration

4. Sudoers Configuration

5. Secure Identity mode (in case of enabled)

Architecture

Features

Manager Program

Validator Agent Program

Building

Quick Start

1. Starting order

2. Configure Agents (on each validator server)

2. Configure Manager (on manager server)

Secure Identity Mode

Configuration

SSH Setup

How It Works

Failover Process

When Active Becomes Unhealthy

When Manager Goes Offline

Stopping the Failover System

Manual Failover

API Endpoints

Validator Agent

Telegram Notifications

Setup

About

Uh oh!

Releases

Packages

Languages

oleksandrmarkelov/failover

Folders and files

Latest commit

History

Repository files navigation

Solana Validator Failover System

Prerequisites

1. etcd Installation

2. Validator Snapshots

3. Autostart Configuration

4. Sudoers Configuration

5. Secure Identity mode (in case of enabled)

Architecture

Features

Manager Program

Validator Agent Program

Building

Quick Start

1. Starting order

2. Configure Agents (on each validator server)

2. Configure Manager (on manager server)

Secure Identity Mode

Configuration

SSH Setup

How It Works

Failover Process

When Active Becomes Unhealthy

When Manager Goes Offline

Stopping the Failover System

Manual Failover

API Endpoints

Validator Agent

Telegram Notifications

Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages