Skip to content

Ansible playbooks for deploying a production-ready RKE2 Kubernetes cluster with High Availability (HA) configuration.

Notifications You must be signed in to change notification settings

vtngan/kubernetes-cookbook

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RKE2 High Availability Cluster - Ansible Playbooks

Ansible playbooks for deploying a production-ready RKE2 Kubernetes cluster with High Availability (HA) configuration.

πŸš€ Quick Start

# 1. Clone repository
git clone <repository-url>
cd kubernetes-cookbook

# 2. Configure environment
cp .env.example .env
vim .env  # Edit configuration

# ⚠️ IMPORTANT: Update these credentials in .env
# - RKE2_TOKEN: Change to a strong random value

# 3. Configure inventory
cp inventory/hosts.yml.example inventory/hosts.yml
vim inventory/hosts.yml  # Add your server IPs and SSH credentials

# 4. Setup SSH keys (if not already done)
ssh-keygen -t rsa -b 4096
ssh-copy-id root@<master-ip>
ssh-copy-id root@<worker-ip>

# 5. Test connectivity
set -a && source .env && set +a
ansible -i inventory/hosts.yml all -m ping

# 6. Deploy RKE2 cluster
ansible-playbook -i inventory/hosts.yml site.yml

# 7. Get kubeconfig
scp -i ~/.ssh/fci root@<master-ip>:/etc/rancher/rke2/rke2.yaml ~/.kube/rke2-config
# Edit the file and change server: https://127.0.0.1:6443 to your master IP
export KUBECONFIG=~/.kube/rke2-config
kubectl get nodes

Overview

RKE2 (Rancher Kubernetes Engine 2) is a CNCF-certified Kubernetes distribution focused on security and compliance. These playbooks automate the deployment of an HA RKE2 cluster with multiple control plane nodes.

🎯 Features

  • βœ… RKE2 HA Cluster: Multi-master Kubernetes cluster with etcd
  • βœ… Environment-based Configuration: All settings via .env file
  • βœ… Secure Credentials Management: Tokens not in git
  • βœ… Private Registry Support: Multiple registry authentication
  • βœ… Automated Installation: Complete cluster setup with Ansible

System Requirements

Minimum Hardware

Master Nodes (Control Plane):

  • CPU: 2 cores
  • RAM: 4GB
  • Disk: 50GB
  • Quantity: 1 node (minimum) or 3 nodes (for HA)

Worker Nodes:

  • CPU: 2 cores
  • RAM: 4GB
  • Disk: 50GB
  • Quantity: 2+ nodes

Supported Operating Systems

  • RHEL/CentOS 7.x, 8.x
  • Rocky Linux 8.x, 9.x
  • Ubuntu 18.04, 20.04, 22.04, 24.04
  • Debian 10, 11

Network Requirements

  • All nodes must have internet connectivity to download RKE2
  • Nodes must be able to communicate with each other
  • Load Balancer for RKE2 API Server (recommended for production)

Required Ports

Master Nodes:

  • 9345/tcp - RKE2 supervisor API
  • 6443/tcp - Kubernetes API
  • 10250/tcp - Kubelet metrics
  • 2379-2380/tcp - etcd
  • 8472/udp - VXLAN (Canal/Flannel)
  • 4789/udp - VXLAN (Flannel)
  • 9098/tcp - Canal (Calico health check)
  • 9099/tcp - Canal (Felix health check)

Worker Nodes:

  • 10250/tcp - Kubelet metrics
  • 8472/udp - VXLAN (Canal/Flannel)
  • 4789/udp - VXLAN (Flannel)

All Nodes:

  • 30000-32767/tcp - NodePort Services

Installing Ansible

macOS

brew install ansible

Ubuntu/Debian

sudo apt update
sudo apt install ansible -y

RHEL/CentOS

sudo yum install ansible -y
# or
sudo dnf install ansible -y

Install Required Ansible Collections

ansible-galaxy collection install ansible.posix
ansible-galaxy collection install community.general

πŸ“‹ Configuration

1. Configure Environment Variables (.env)

# Copy example file
cp .env.example .env

# Edit important variables
vim .env

Key variables to configure:

# API Server Load Balancer VIP or DNS
RKE2_API_IP="192.168.1.100"

# Cluster token - MUST CHANGE THIS
# Note: Use a simple password string, not K10<hash>::<user>:<pass> format
RKE2_TOKEN="your-secure-random-token-here"

# Network Plugin
RKE2_CNI="canal"  # canal, calico, cilium, flannel

# CIDR Ranges
RKE2_CLUSTER_CIDR="10.42.0.0/16"
RKE2_SERVICE_CIDR="10.43.0.0/16"

# Additional TLS SANs (comma-separated)
RKE2_TLS_SAN_EXTRA="rke2.example.com,kubernetes.example.com"

# Container Registry Configuration (JSON format for multiple registries)
REGISTRIES_JSON='[{"url":"registry.company.com","username":"admin","password":"secret"}]'

2. Configure Inventory (hosts.yml)

# Copy example file
cp inventory/hosts.yml.example inventory/hosts.yml

# Edit with your node information
vim inventory/hosts.yml

Example configuration:

all:
  children:
    master:
      hosts:
        k8s-master-01:
          ansible_host: 192.168.1.101
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/id_rsa
    worker:
      hosts:
        k8s-worker-01:
          ansible_host: 192.168.1.111
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/id_rsa
        k8s-worker-02:
          ansible_host: 192.168.1.112
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/id_rsa

SSH Options:

# Option 1: SSH Key (Recommended)
ansible_ssh_private_key_file: ~/.ssh/id_rsa

# Option 2: Password
ansible_ssh_pass: your_password

# Option 3: Sudo password (if non-root user)
ansible_become_pass: your_sudo_password

3. Setup SSH Key-based Authentication (Recommended)

# Generate SSH key if not exists
ssh-keygen -t rsa -b 4096

# Copy SSH key to all nodes
ssh-copy-id [email protected]
ssh-copy-id [email protected]
ssh-copy-id [email protected]

# Test SSH connectivity
ssh [email protected]

4. Test Ansible Connectivity

# Load environment variables
set -a && source .env && set +a

# Test with SSH key
ansible -i inventory/hosts.yml all -m ping

# Or test with password
ansible -i inventory/hosts.yml all -m ping --ask-pass

πŸ”§ Installing RKE2 Cluster

1. Verify Configuration

# Check syntax
ansible-playbook -i inventory/hosts.yml site.yml --syntax-check

# Dry run (no changes made)
ansible-playbook -i inventory/hosts.yml site.yml --check

# List tasks to be executed
ansible-playbook -i inventory/hosts.yml site.yml --list-tasks

2. Run Installation

# Load environment variables first
set -a && source .env && set +a

# Full cluster installation
ansible-playbook -i inventory/hosts.yml site.yml

# With verbose output (for debugging)
ansible-playbook -i inventory/hosts.yml site.yml -v
# or -vv, -vvv, -vvvv for more details

3. Install Specific Components

# Install master nodes only
ansible-playbook -i inventory/hosts.yml site.yml --limit master

# Install worker nodes only
ansible-playbook -i inventory/hosts.yml site.yml --limit worker

# Install specific node only
ansible-playbook -i inventory/hosts.yml site.yml --limit k8s-master-01

πŸ“Š Cluster Management

Check Cluster Status

# SSH to master node
ssh root@<master-ip>

# Check cluster nodes
kubectl get nodes -o wide

# Check all pods across namespaces
kubectl get pods -A -o wide

# Check RKE2 service status
systemctl status rke2-server  # on master
systemctl status rke2-agent   # on worker

Verify Installation

Check cluster status:

# Check nodes
kubectl get nodes -o wide

# Check system pods
kubectl get pods -A

# Check RKE2 version
kubectl version

Load Balancer Setup (Important for HA)

For production HA setup, configure a Load Balancer for RKE2 API server:

HAProxy Configuration

# /etc/haproxy/haproxy.cfg
frontend rke2_api_frontend
    bind *:6443
    mode tcp
    option tcplog
    default_backend rke2_api_backend

frontend rke2_supervisor_frontend
    bind *:9345
    mode tcp
    option tcplog
    default_backend rke2_supervisor_backend

backend rke2_api_backend
    mode tcp
    option tcp-check
    balance roundrobin
    server master-01 192.168.1.101:6443 check
    server master-02 192.168.1.102:6443 check
    server master-03 192.168.1.103:6443 check

backend rke2_supervisor_backend
    mode tcp
    option tcp-check
    balance roundrobin
    server master-01 192.168.1.101:9345 check
    server master-02 192.168.1.102:9345 check
    server master-03 192.168.1.103:9345 check

Nginx Load Balancer (Alternative)

# /etc/nginx/nginx.conf
stream {
    upstream rke2_api {
        least_conn;
        server 192.168.1.101:6443 max_fails=3 fail_timeout=5s;
        server 192.168.1.102:6443 max_fails=3 fail_timeout=5s;
        server 192.168.1.103:6443 max_fails=3 fail_timeout=5s;
    }

    upstream rke2_supervisor {
        least_conn;
        server 192.168.1.101:9345 max_fails=3 fail_timeout=5s;
        server 192.168.1.102:9345 max_fails=3 fail_timeout=5s;
        server 192.168.1.103:9345 max_fails=3 fail_timeout=5s;
    }

    server {
        listen 6443;
        proxy_pass rke2_api;
    }

    server {
        listen 9345;
        proxy_pass rke2_supervisor;
    }
}

🎯 Using the Cluster

1. Get kubeconfig

# Copy from master node
scp [email protected]:/etc/rancher/rke2/rke2.yaml ~/.kube/rke2-config

# Update server address in kubeconfig (replace 127.0.0.1 with Load Balancer IP)
sed -i '' 's/127.0.0.1/192.168.1.100/g' ~/.kube/rke2-config

# Export kubeconfig
export KUBECONFIG=~/.kube/rke2-config

# Or merge into main kubeconfig
KUBECONFIG=~/.kube/config:~/.kube/rke2-config kubectl config view --flatten > ~/.kube/config.new
mv ~/.kube/config.new ~/.kube/config

2. Verify Cluster

# Check nodes
kubectl get nodes -o wide

# Check system pods
kubectl get pods -A

# Cluster info
kubectl cluster-info

# Check RKE2 version
kubectl get nodes -o jsonpath='{.items[0].status.nodeInfo.kubeletVersion}'

# Check etcd health
kubectl get pods -n kube-system -l component=etcd

# View component status
kubectl get componentstatuses  # deprecated in newer versions
kubectl get --raw='/readyz?verbose'  # preferred method

3. Deploy Test Application

# Create test deployment
kubectl create deployment nginx --image=nginx --replicas=3

# Expose as service
kubectl expose deployment nginx --port=80 --type=NodePort

# Check deployment
kubectl get deployments
kubectl get pods -l app=nginx
kubectl get svc nginx

# Access the service
curl http://<node-ip>:<node-port>

# Cleanup
kubectl delete deployment nginx
kubectl delete service nginx

πŸ“ Cluster Management

Add New Worker Node

  1. Add node to inventory/hosts.yml:
worker:
  hosts:
    k8s-worker-03:
      ansible_host: 192.168.1.113
      ansible_user: root
      ansible_ssh_private_key_file: ~/.ssh/id_rsa
  1. Run playbook for new node only:
set -a && source .env && set +a
ansible-playbook -i inventory/hosts.yml site.yml --limit k8s-worker-03
  1. Verify node joined:
kubectl get nodes

Add New Master Node

  1. Add to inventory under master section
  2. Update Load Balancer configuration to include new master
  3. Run playbook:
set -a && source .env && set +a
ansible-playbook -i inventory/hosts.yml site.yml --limit k8s-master-02
  1. Verify etcd cluster:
kubectl get pods -n kube-system -l component=etcd

Remove Node

# Drain node first
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Remove from cluster
kubectl delete node <node-name>

# Uninstall RKE2 on the node
ansible-playbook -i inventory/hosts.yml uninstall.yml --limit <node-name>

# Remove from inventory
# Edit inventory/hosts.yml and remove the node entry

πŸ“‚ Project Structure

kubernetes-cookbook/
β”œβ”€β”€ .env.example              # Environment variables template
β”œβ”€β”€ .env                      # Actual configuration (gitignored)
β”œβ”€β”€ .gitignore               # Git ignore rules
β”œβ”€β”€ ansible.cfg              # Ansible configuration
β”œβ”€β”€ site.yml                 # Main installation playbook
β”œβ”€β”€ uninstall.yml            # Uninstall playbook
β”œβ”€β”€ README.md                # This file
β”œβ”€β”€ inventory/
β”‚   β”œβ”€β”€ hosts.yml.example    # Inventory template
β”‚   └── hosts.yml           # Actual inventory (gitignored)
β”œβ”€β”€ group_vars/
β”‚   └── all.yml             # Ansible variables (reads from environment)
β”œβ”€β”€ vars/
β”‚   └── registry.yml        # Complex variable parsing (registries, taints)
└── roles/
    β”œβ”€β”€ prereq/             # System preparation
    β”‚   └── tasks/
    β”‚       └── main.yml
    β”œβ”€β”€ rke2-server/        # RKE2 master nodes installation
    β”‚   β”œβ”€β”€ tasks/
    β”‚   β”‚   β”œβ”€β”€ main.yml
    β”‚   β”‚   └── registry.yml
    β”‚   └── templates/
    β”‚       β”œβ”€β”€ config.yaml.j2
    β”‚       └── registries.yaml.j2
    └── rke2-agent/         # RKE2 worker nodes installation
        β”œβ”€β”€ tasks/
        β”‚   β”œβ”€β”€ main.yml
        β”‚   └── registry.yml
        └── templates/
            β”œβ”€β”€ config.yaml.j2
            └── registries.yaml.j2

πŸ› Troubleshooting

1. Check RKE2 Service Status

# From Ansible control node
ansible -i inventory/hosts.yml master -m shell -a "systemctl status rke2-server"
ansible -i inventory/hosts.yml worker -m shell -a "systemctl status rke2-agent"

# View logs
ansible -i inventory/hosts.yml master -m shell -a "journalctl -u rke2-server -n 50"
ansible -i inventory/hosts.yml worker -m shell -a "journalctl -u rke2-agent -n 50"

# Or SSH directly to node
ssh [email protected]
systemctl status rke2-server
journalctl -u rke2-server -f

2. Node Not Ready

# Check node conditions
kubectl describe node <node-name>

# Check kubelet logs on worker
ssh root@<worker-ip>
journalctl -u rke2-agent -n 100 --no-pager

# Check CNI pods
kubectl get pods -n kube-system -o wide | grep -E 'canal|calico|cilium'

# Restart RKE2 service
systemctl restart rke2-agent  # on worker
systemctl restart rke2-server # on master

3. etcd Issues

# Check etcd pods
kubectl get pods -n kube-system -l component=etcd

# Check etcd health (from master node)
ETCDCTL_API=3 /var/lib/rancher/rke2/bin/etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  endpoint health

# List etcd members
ETCDCTL_API=3 /var/lib/rancher/rke2/bin/etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  member list

4. Certificate Issues

# Check certificate validity
openssl x509 -in /var/lib/rancher/rke2/server/tls/server-ca.crt -text -noout | grep -A2 Validity

# Verify TLS connection
openssl s_client -connect <master-ip>:6443 -servername kubernetes

# Check kubelet certificates
ls -la /var/lib/rancher/rke2/agent/*.crt

5. Network Issues

# Test connectivity between nodes
ansible -i inventory/hosts.yml all -m ping

# Test port connectivity
ansible -i inventory/hosts.yml all -m wait_for -a "host=192.168.1.101 port=6443 timeout=5"

# Check iptables rules
ansible -i inventory/hosts.yml all -m shell -a "iptables -L -n | grep 10.42"

# Test RKE2 API endpoint
curl -k https://192.168.1.100:6443/version

# Check pod network connectivity
kubectl run test-pod --image=busybox --rm -it -- ping 10.42.0.1

6. SSH Issues

# Test SSH connectivity with verbose output
ssh -vvv [email protected]

# Copy SSH key again if needed
ssh-copy-id -i ~/.ssh/id_rsa.pub [email protected]

# Test with password authentication
ansible -i inventory/hosts.yml all -m ping --ask-pass

# Check SSH configuration
cat ~/.ssh/config

7. Ansible Issues

# Verbose mode to see detailed errors
ansible-playbook -i inventory/hosts.yml site.yml -vvv

# Check inventory parsing
ansible-inventory -i inventory/hosts.yml --list
ansible-inventory -i inventory/hosts.yml --graph

# Test fact gathering
ansible -i inventory/hosts.yml all -m setup

# Validate environment variables
set -a && source .env && set +a
env | grep RKE2

# Test RKE2 API
curl -k https://<master-ip>:6443

8. Registry Issues

# Check registry configuration
cat /etc/rancher/rke2/registries.yaml

# Test registry connectivity
curl -v https://btxh-reg.azinsu.com/v2/_catalog

# Check pod image pull status
kubectl describe pod <pod-name> | grep -A5 Events

# View containerd logs
journalctl -u containerd -n 100 --no-pager

πŸ”„ Upgrade RKE2

Method 1: Using Ansible

  1. Update RKE2_VERSION in .env:
RKE2_VERSION="v1.35.0+rke2r1"
  1. Load environment and run playbook:
set -a && source .env && set +a
ansible-playbook -i inventory/hosts.yml site.yml

Method 2: Manual Upgrade

# On each node, run:
curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION=v1.35.0+rke2r1 sh -

# Restart service
systemctl restart rke2-server  # on master
systemctl restart rke2-agent   # on worker

# Verify upgrade
kubectl get nodes

Note: Always upgrade master nodes first, then worker nodes.

πŸ’Ύ Backup and Restore

Backup etcd

# SSH to master node
ssh root@<master-ip>

# Create etcd snapshot using RKE2 built-in command
rke2 etcd-snapshot save --name snapshot-$(date +%Y%m%d-%H%M%S)

# List snapshots
rke2 etcd-snapshot list

# Snapshots are stored in: /var/lib/rancher/rke2/server/db/snapshots/

# Copy snapshot to backup location
scp /var/lib/rancher/rke2/server/db/snapshots/snapshot-*.db backup-server:/backups/

Restore from Backup

# Stop RKE2 on all master nodes
systemctl stop rke2-server

# On first master node, restore snapshot
rke2 server \
  --cluster-reset \
  --cluster-reset-restore-path=/var/lib/rancher/rke2/server/db/snapshots/snapshot-xxx.db

# Start RKE2 on first master
systemctl start rke2-server

# Wait for cluster to be ready
kubectl get nodes

# Re-join other master nodes (if multi-master setup)
# They will automatically sync from the first master
systemctl start rke2-server

✨ Best Practices

  1. Use at least 3 master nodes for production HA setup
  2. Setup Load Balancer before cluster installation
  3. Backup etcd regularly (recommended: daily automated backups)
  4. Monitor cluster health with Prometheus/Grafana
  5. Update RKE2 regularly to patch security vulnerabilities
  6. Use persistent storage for production workloads
  7. Configure resource limits for pods (requests/limits)
  8. Implement network policies for security isolation
  9. Enable audit logging to track cluster activities
  10. Test disaster recovery procedures periodically
  11. Use private registries with authentication for production images
  12. Implement proper RBAC for user access control
  13. Enable Pod Security Standards (PSS) for workload security
  14. Use secrets management (e.g., Sealed Secrets, External Secrets)
  15. Regular security scanning of container images

πŸ—‘οΈ Uninstall Cluster

To completely remove RKE2 from all nodes:

set -a && source .env && set +a
ansible-playbook -i inventory/hosts.yml uninstall.yml

This will:

  • Stop RKE2 services on all nodes
  • Remove RKE2 binaries and configuration files
  • Clean up container images and volumes
  • Reset network interfaces
  • Remove firewall rules

πŸ“š References

πŸ“ License

This project is licensed under the MIT License.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ’¬ Support

If you encounter any issues or have questions:

  1. Check the Troubleshooting section above
  2. Review RKE2 official documentation
  3. Open an issue in this repository

Author: XDEV Asia Labs
Last Updated: December 2025
RKE2 Version: v1.34.2+rke2r1 (Kubernetes v1.34.2)

License

MIT License

Đóng góp

Mọi Δ‘Γ³ng gΓ³p đều được chΓ o Δ‘Γ³n! Vui lΓ²ng tαΊ‘o pull request hoαΊ·c issue.

Support

NαΊΏu gαΊ·p vαΊ₯n đề, vui lΓ²ng:

  1. Kiểm tra phαΊ§n Troubleshooting
  2. Xem RKE2 logs: journalctl -u rke2-server -f
  3. TαΊ‘o issue vα»›i Δ‘αΊ§y Δ‘α»§ thΓ΄ng tin về lα»—i

About

Ansible playbooks for deploying a production-ready RKE2 Kubernetes cluster with High Availability (HA) configuration.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jinja 100.0%