Think of your setup as three layers:
┌────────────────────────────┐
│ Laptop (UI) │
│ Browser → Rancher UI │
└─────────────▲──────────────┘
│ HTTPS
┌─────────────┴──────────────┐
│ Rancher Node (dcn0) │
│ Management + Monitoring │
└─────────────▲──────────────┘
│ Kubernetes API
┌─────────────┴──────────────┐
│ K3s Cluster (7 Pis) │
│ Control Plane + Workers │
└────────────────────────────┘
Nothing is random. Everything is deterministic and role-based.
The repository is designed for fast, deterministic cluster bring-up, not for fully managed HA cloud replacement.
This poster documents the iterative evolution of the EdgeStack-K3s cluster – from an early multi-node Raspberry Pi 3B+ setup to a compact, performance-optimized Raspberry Pi 4 deployment.
The architecture, workloads, and orchestration layer (K3s + Rancher) remained consistent across phases, enabling practical performance comparison rather than synthetic benchmarking.
| Dimension | Pi 3B+ Cluster (7–8 nodes) | Pi 4 Cluster (2 nodes) |
|---|---|---|
| Total usable RAM | ~7 GB | 4-8 GB |
| CPU class | Cortex-A53 | Cortex-A72 |
| Network | USB 2–backed Ethernet | Native Gigabit |
| Pod density | Low | High |
| Control-plane overhead | Higher | Lower |
| Power consumption | ~2.5–3× higher | Reduced |
| Physical footprint | Large | Compact |
| Cable complexity | High | Minimal |
Key takeaway:
For K3s-based edge workloads, a 2-node Raspberry Pi 4 cluster delivers comparable application-level performance to a much larger Raspberry Pi 3B+ cluster while significantly reducing cost, power draw, and infrastructure complexity.
This is the brain of the cluster.
It runs:
- Kubernetes API server
- Scheduler (decides where pods run)
- Controller manager (keeps desired vs actual state)
- Embedded datastore (SQLite by default in K3s)
It does NOT run your apps by default.
It decides where apps should run.
This setup intentionally uses a single control-plane node.
Rationale:
- Edge / lab / in-house micro-cloud use case
- Resource-constrained hardware (Raspberry Pi 3B+)
- Preference for operational simplicity over full HA
Behavior:
- Existing workloads continue during control-plane downtime
- New scheduling and cluster mutations pause until recovery
This mirrors many real-world edge Kubernetes deployments, where
control-plane availability is a tradeoff rather than a requirement.
These are execution engines.
Each worker:
- Registers itself to the master
- Advertises CPU + RAM capacity
- Pulls container images
- Runs application pods
Workers do zero coordination themselves.
They strictly obey the control plane.
This project intentionally optimizes for:
- Determinism over abstraction
- Node resilience over control-plane HA
- Edge reliability over cloud-scale elasticity
Not implemented by design:
- Multi-control-plane quorum
- External etcd
- Service mesh
- Cloud-managed load balancers
These trade-offs reflect real constraints in edge and in-house micro-cloud environments.
This is not Kubernetes infrastructure.
It is:
- A Kubernetes manager
- A UI + policy layer
- A cluster lifecycle controller
Rancher:
- Talks to K3s API
- Does NOT replace Kubernetes
- Does NOT schedule workloads
Think of Rancher as:
“Kubernetes’s remote control + observability layer”
Sequence
- OS updated
- Linux kernel cgroups enabled
(mandatory for containers) - K3s server starts
- API server opens on port
6443 - Join token generated
Result
- Cluster exists
- Ready to accept workers
- Control plane is live
Sequence
- Worker enables cgroups
- Fetches join token
- Connects to master API
- Registers itself
Handshake
Worker → Master:
"Here is my token, CPU, RAM, IP"
Master → Worker:
"You are node dcnX, accepted"
Result
- Node appears as
Ready - Scheduler can now use it
Sequence
- Docker installed
- Rancher container launched
- Rancher UI exposed on HTTPS
Important
Rancher is completely separate from K3s binaries.
This is the most misunderstood part
-
Rancher generates an import manifest
-
You apply it once on dcn1
-
That manifest:
- Creates a Rancher agent inside the cluster
- Opens a reverse tunnel to Rancher
Rancher UI
▲
│ secure websocket
▼
Rancher Agent Pod (inside K3s)
│
▼
Kubernetes API (dcn1)
After this:
- Rancher never SSHs into nodes
- Rancher never touches OS
- Everything is Kubernetes-native
Let’s say you deploy NGINX from Rancher UI.
-
You click “Deploy”
-
Rancher sends YAML to Kubernetes API
-
Scheduler evaluates:
- Which node has resources
- Node selectors, labels, taints
-
Pod assigned to a worker (say dcn4)
-
dcn4:
- Pulls image
- Starts container
-
Status flows back:
- Worker → Master → Rancher → UI
You never log into dcn4.
Your decision to use static IPs was critical.
- Workers disconnect after reboot
- Rancher agents lose trust
- Cluster breaks silently
- Node identity is stable
- TLS certs remain valid
- Zero reconfiguration needed
Your Python IP automation was the correct engineering call.
My Python script operates at Layer 0 (below Kubernetes).
It checks:
- ICMP
- SSH
- Node reachability
- Kubernetes node state
Kubernetes monitoring operates at Layer 1+:
- Pod health
- Container restarts
- Resource pressure
They complement, not replace each other.
To address frequent OOM conditions on low-memory nodes, the cluster uses
a custom node-side component ("Edge-Pulse") that provides:
- Hybrid memory management (zram + disk-backed swap)
- SD-card wear protection via USB-backed IO
- Systemd-managed lifecycle
- Runtime validation and rollback
- Node-local observability API
This operates below Kubernetes, complementing pod-level memory limits
with OS-level stability guarantees.
- Scheduler marks node
NotReady - Pods rescheduled to other workers
- When node returns, it rejoins automatically
- Cluster continues running
- No workloads affected
- UI unavailable only
- No new scheduling
- Existing pods continue running
- Recovery needed for changes
- Lightweight (K3s)
- Deterministic (static IPs)
- Centralized control (Rancher)
- No SSH dependency
- Survives node loss
- Scales horizontally
This is exactly how production edge clusters are built - just on smaller hardware, deliberately designed and tested end-to-end.