End-to-End K3s Cluster Architecture with Rancher Management

1 Big Picture (Mental Model)

Think of your setup as three layers:

┌────────────────────────────┐
│        Laptop (UI)         │
│  Browser → Rancher UI      │
└─────────────▲──────────────┘
              │ HTTPS
┌─────────────┴──────────────┐
│      Rancher Node (dcn0)   │
│  Management + Monitoring   │
└─────────────▲──────────────┘
              │ Kubernetes API
┌─────────────┴──────────────┐
│   K3s Cluster (7 Pis)      │
│  Control Plane + Workers   │
└────────────────────────────┘

Nothing is random. Everything is deterministic and role-based.

The repository is designed for fast, deterministic cluster bring-up, not for fully managed HA cloud replacement.

2 Cluster Evolution & Performance Context

This poster documents the iterative evolution of the EdgeStack-K3s cluster – from an early multi-node Raspberry Pi 3B+ setup to a compact, performance-optimized Raspberry Pi 4 deployment.

The architecture, workloads, and orchestration layer (K3s + Rancher) remained consistent across phases, enabling practical performance comparison rather than synthetic benchmarking.

Observed Outcomes (Same Workloads, Different Hardware)

Dimension	Pi 3B+ Cluster (7–8 nodes)	Pi 4 Cluster (2 nodes)
Total usable RAM	~7 GB	4-8 GB
CPU class	Cortex-A53	Cortex-A72
Network	USB 2–backed Ethernet	Native Gigabit
Pod density	Low	High
Control-plane overhead	Higher	Lower
Power consumption	~2.5–3× higher	Reduced
Physical footprint	Large	Compact
Cable complexity	High	Minimal

Key takeaway:
For K3s-based edge workloads, a 2-node Raspberry Pi 4 cluster delivers comparable application-level performance to a much larger Raspberry Pi 3B+ cluster while significantly reducing cost, power draw, and infrastructure complexity.

3 What Each Node Actually Does

Control Plane (dcn1)

This is the brain of the cluster.

It runs:

Kubernetes API server
Scheduler (decides where pods run)
Controller manager (keeps desired vs actual state)
Embedded datastore (SQLite by default in K3s)

It does NOT run your apps by default.
It decides where apps should run.

Control Plane Design Choice

This setup intentionally uses a single control-plane node.

Rationale:

Edge / lab / in-house micro-cloud use case
Resource-constrained hardware (Raspberry Pi 3B+)
Preference for operational simplicity over full HA

Behavior:

Existing workloads continue during control-plane downtime
New scheduling and cluster mutations pause until recovery

This mirrors many real-world edge Kubernetes deployments, where
control-plane availability is a tradeoff rather than a requirement.

Worker Nodes (dcn2–dcn7)

These are execution engines.

Each worker:

Registers itself to the master
Advertises CPU + RAM capacity
Pulls container images
Runs application pods

Workers do zero coordination themselves.
They strictly obey the control plane.

Design Trade-offs

This project intentionally optimizes for:

Determinism over abstraction
Node resilience over control-plane HA
Edge reliability over cloud-scale elasticity

Not implemented by design:

Multi-control-plane quorum
External etcd
Service mesh
Cloud-managed load balancers

These trade-offs reflect real constraints in edge and in-house micro-cloud environments.

Rancher Node (dcn0)

This is not Kubernetes infrastructure.

It is:

A Kubernetes manager
A UI + policy layer
A cluster lifecycle controller

Rancher:

Talks to K3s API
Does NOT replace Kubernetes
Does NOT schedule workloads

Think of Rancher as:

“Kubernetes’s remote control + observability layer”

4 What Happens When we Run the Scripts

Script 1 - Master Bootstrap

Sequence

OS updated
Linux kernel cgroups enabled
(mandatory for containers)
K3s server starts
API server opens on port 6443
Join token generated

Result

Cluster exists
Ready to accept workers
Control plane is live

Script 2 - Worker Bootstrap

Sequence

Worker enables cgroups
Fetches join token
Connects to master API
Registers itself

Handshake

Worker → Master:
"Here is my token, CPU, RAM, IP"

Master → Worker:
"You are node dcnX, accepted"

Result

Node appears as Ready
Scheduler can now use it

Script 3 - Rancher Bootstrap

Sequence

Docker installed
Rancher container launched
Rancher UI exposed on HTTPS

Important
Rancher is completely separate from K3s binaries.

5 How Rancher Connects to the Cluster

This is the most misunderstood part

Import Flow

Rancher generates an import manifest
You apply it once on dcn1
That manifest:
- Creates a Rancher agent inside the cluster
- Opens a reverse tunnel to Rancher

Rancher UI
   ▲
   │ secure websocket
   ▼
Rancher Agent Pod (inside K3s)
   │
   ▼
Kubernetes API (dcn1)

After this:

Rancher never SSHs into nodes
Rancher never touches OS
Everything is Kubernetes-native

6 How a Deployment Actually Runs (Example)

Let’s say you deploy NGINX from Rancher UI.

Step-by-step:

You click “Deploy”
Rancher sends YAML to Kubernetes API
Scheduler evaluates:
- Which node has resources
- Node selectors, labels, taints
Pod assigned to a worker (say dcn4)
dcn4:
- Pulls image
- Starts container
Status flows back:
- Worker → Master → Rancher → UI

You never log into dcn4.

7 Networking Flow (Why Static IPs Matter)

Your decision to use static IPs was critical.

Without static IPs:

Workers disconnect after reboot
Rancher agents lose trust
Cluster breaks silently

With static IPs:

Node identity is stable
TLS certs remain valid
Zero reconfiguration needed

Your Python IP automation was the correct engineering call.

8 Monitoring & Health (Where Your Python Script Fits)

My Python script operates at Layer 0 (below Kubernetes).

It checks:

ICMP
SSH
Node reachability
Kubernetes node state

Kubernetes monitoring operates at Layer 1+:

Pod health
Container restarts
Resource pressure

They complement, not replace each other.

8.1 Node-Level Memory Resilience (Edge-Pulse)

To address frequent OOM conditions on low-memory nodes, the cluster uses
a custom node-side component ("Edge-Pulse") that provides:

Hybrid memory management (zram + disk-backed swap)
SD-card wear protection via USB-backed IO
Systemd-managed lifecycle
Runtime validation and rollback
Node-local observability API

This operates below Kubernetes, complementing pod-level memory limits
with OS-level stability guarantees.

9 Failure Scenarios (And What Happens)

Worker Power Loss

Scheduler marks node NotReady
Pods rescheduled to other workers
When node returns, it rejoins automatically

Rancher Down

Cluster continues running
No workloads affected
UI unavailable only

Control Plane Down

No new scheduling
Existing pods continue running
Recovery needed for changes

10 Why This Design Is Correct for Edge / IoT

Lightweight (K3s)
Deterministic (static IPs)
Centralized control (Rancher)
No SSH dependency
Survives node loss
Scales horizontally

This is exactly how production edge clusters are built - just on smaller hardware, deliberately designed and tested end-to-end.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
0-networking		0-networking
1-time-sync		1-time-sync
2-kernel		2-kernel
3-node-lifecycle		3-node-lifecycle
4-edge-pulse		4-edge-pulse
5-master		5-master
6-worker		6-worker
Readme.md		Readme.md
bootstrap-node.sh		bootstrap-node.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-End K3s Cluster Architecture with Rancher Management

1 Big Picture (Mental Model)

2 Cluster Evolution & Performance Context

Observed Outcomes (Same Workloads, Different Hardware)

3 What Each Node Actually Does

Control Plane (dcn1)

Control Plane Design Choice

Worker Nodes (dcn2–dcn7)

Design Trade-offs

Rancher Node (dcn0)

4 What Happens When we Run the Scripts

Script 1 - Master Bootstrap

Script 2 - Worker Bootstrap

Script 3 - Rancher Bootstrap

5 How Rancher Connects to the Cluster

Import Flow

6 How a Deployment Actually Runs (Example)

Step-by-step:

7 Networking Flow (Why Static IPs Matter)

Without static IPs:

With static IPs:

8 Monitoring & Health (Where Your Python Script Fits)

8.1 Node-Level Memory Resilience (Edge-Pulse)

9 Failure Scenarios (And What Happens)

Worker Power Loss

Rancher Down

Control Plane Down

10 Why This Design Is Correct for Edge / IoT

About

Uh oh!

Languages

855princekumar/EdgeStack-K3s

Folders and files

Latest commit

History

Repository files navigation

End-to-End K3s Cluster Architecture with Rancher Management

1 Big Picture (Mental Model)

2 Cluster Evolution & Performance Context

Observed Outcomes (Same Workloads, Different Hardware)

3 What Each Node Actually Does

Control Plane (dcn1)

Control Plane Design Choice

Worker Nodes (dcn2–dcn7)

Design Trade-offs

Rancher Node (dcn0)

4 What Happens When we Run the Scripts

Script 1 - Master Bootstrap

Script 2 - Worker Bootstrap

Script 3 - Rancher Bootstrap

5 How Rancher Connects to the Cluster

Import Flow

6 How a Deployment Actually Runs (Example)

Step-by-step:

7 Networking Flow (Why Static IPs Matter)

Without static IPs:

With static IPs:

8 Monitoring & Health (Where Your Python Script Fits)

8.1 Node-Level Memory Resilience (Edge-Pulse)

9 Failure Scenarios (And What Happens)

Worker Power Loss

Rancher Down

Control Plane Down

10 Why This Design Is Correct for Edge / IoT

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages