Node Doctor - Real Cluster Deployment Guide

This guide walks through deploying Node Doctor to a real Kubernetes cluster.

Prerequisites

Kubernetes cluster (1.19+) with kubectl access
Docker installed for building images (for manual deployment)
Access to Docker Hub registry (for manual deployment)
Cluster admin privileges (for RBAC and DaemonSet deployment)

Deployment Options

Method	Recommended For	Description
Helm Chart	Production	Easiest installation with configurable values
Automated RC	Development/Testing	One-command build and deploy for RC releases
Manual	Custom/Advanced	Full control over each deployment step

Option 1: Helm Chart Deployment (Recommended)

The Helm chart is the recommended method for deploying Node Doctor to production clusters.

Add the Helm Repository

helm repo add supporttools https://charts.support.tools
helm repo update

Basic Installation

helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace

Installation with Custom Values

# Create custom values file
cat > custom-values.yaml << 'EOF'
settings:
  logLevel: info
  logFormat: json
  updateInterval: 30s
  enableRemediation: true
  dryRunMode: false

resources:
  requests:
    cpu: 50m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

serviceMonitor:
  enabled: true
  interval: 30s
EOF

# Install with custom values
helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace \
  -f custom-values.yaml

Verify Helm Deployment

# Check DaemonSet status
kubectl get daemonset -n node-doctor node-doctor

# View pods
kubectl get pods -n node-doctor -l app.kubernetes.io/name=node-doctor

# Check logs
kubectl logs -n node-doctor -l app.kubernetes.io/name=node-doctor --tail=50

# View node conditions
kubectl get nodes -o custom-columns='NAME:.metadata.name,HEALTHY:.status.conditions[?(@.type=="NodeDoctorHealthy")].status'

Helm Upgrade and Rollback

# Upgrade to latest version
helm repo update
helm upgrade node-doctor supporttools/node-doctor -n node-doctor

# Upgrade with new values
helm upgrade node-doctor supporttools/node-doctor \
  -n node-doctor \
  --set settings.logLevel=debug

# Rollback to previous release
helm rollback node-doctor -n node-doctor

Uninstall via Helm

helm uninstall node-doctor -n node-doctor
kubectl delete namespace node-doctor

For complete Helm chart configuration options, see ../helm/node-doctor/README.md.

Option 2: Automated RC Deployment (Development/Testing)

For rapid deployment of release candidates to the a1-ops-prd cluster:

# One command to build, push, and deploy
make bump-rc

This single command will:

✅ Validate the pipeline (run all tests)
✅ Increment the RC version (e.g., v0.1.0-rc.1 → v0.1.0-rc.2)
✅ Build Docker image with RC tag
✅ Push to Docker Hub registry: supporttools/node-doctor
✅ Deploy to a1-ops-prd cluster in node-doctor namespace
✅ Commit and tag the release

Configuration:

Cluster: a1-ops-prd (from ~/.kube/config)
Namespace: node-doctor (auto-created if needed)
Registry: docker.io/supporttools/node-doctor
Version tracking: .version-rc file

Verify Deployment:

# Check pod status
kubectl --context=a1-ops-prd -n node-doctor get pods -l app=node-doctor

# View logs
kubectl --context=a1-ops-prd -n node-doctor logs -l app=node-doctor

# Monitor health
kubectl --context=a1-ops-prd -n node-doctor get pods -l app=node-doctor -w

Option 3: Manual Deployment (Advanced)

If you prefer manual control over each step, follow the detailed instructions below.

Step 1: Build and Push Container Image

# Build the Docker image with version tag
docker build -t supporttools/node-doctor:v0.1.0 \
  --build-arg VERSION=v0.1.0 \
  --build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \
  --build-arg BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
  .

# Tag as latest
docker tag supporttools/node-doctor:v0.1.0 \
  supporttools/node-doctor:latest

# Login to Docker Hub (if not already logged in)
docker login

# Push both tags
docker push supporttools/node-doctor:v0.1.0
docker push supporttools/node-doctor:latest

Image Details:

Size: ~85 MB (optimized multi-stage Alpine build)
Base: Alpine Linux 3.19
Go: 1.24.4
Binary location: /usr/local/bin/node-doctor

Step 2: Deploy to Kubernetes Cluster

2.1 Deploy RBAC Resources (MUST BE FIRST)

# Apply RBAC: ServiceAccount, ClusterRole, ClusterRoleBinding
kubectl apply -f deployment/rbac.yaml

# Verify RBAC resources created
kubectl get serviceaccount -n kube-system node-doctor
kubectl get clusterrole node-doctor
kubectl get clusterrolebinding node-doctor

2.2 Deploy Node Doctor DaemonSet

# Apply DaemonSet (includes ConfigMap and Service)
kubectl apply -f deployment/daemonset.yaml

# Watch rollout
kubectl rollout status daemonset/node-doctor -n kube-system

# Verify pods running on all nodes
kubectl get pods -n kube-system -l app=node-doctor -o wide

2.3 Quick Deployment (Both Files)

# Apply both in correct order
kubectl apply -f deployment/rbac.yaml -f deployment/daemonset.yaml

Step 3: Verify Deployment

3.1 Check Pod Status

# View all Node Doctor pods
kubectl get pods -n kube-system -l app=node-doctor

# Check pod logs
POD_NAME=$(kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].metadata.name}')
kubectl logs -n kube-system $POD_NAME --tail=50 -f

3.2 Test Health Endpoints

# Test health endpoint
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/healthz

# Test readiness
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/ready

# Test detailed status
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/status | jq .

3.3 Check Metrics

# View Prometheus metrics
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics | head -20

# Check for Node Doctor specific metrics
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics | grep node_doctor

3.4 Verify Node Conditions

# Check if NodeDoctorHealthy condition is set
kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.type == "NodeDoctorHealthy")'

# View full node conditions
kubectl describe nodes | grep -A 10 "Conditions:"

3.5 Check Kubernetes Events

# View Node Doctor events
kubectl get events -n kube-system --field-selector involvedObject.name=node-doctor --sort-by='.lastTimestamp'

Step 4: Run Validation Script

# Run comprehensive validation
cd /home/mmattox/go/src/github.com/supporttools/node-doctor
bash deployment/validate.sh

Expected Output:

✅ kubectl available
✅ Cluster connection OK
✅ RBAC manifest exists
✅ DaemonSet manifest exists
✅ DaemonSet deployed
✅ All pods running
✅ ServiceAccount exists
✅ ClusterRole exists
✅ ClusterRoleBinding exists
✅ ConfigMap exists
✅ Service exists
✅ Health endpoints responding
✅ Metrics endpoint working

What's Monitored

CPU Monitor (system-cpu)

Load average thresholds: 80% warning, 95% critical
Thermal throttling detection
CPU usage monitoring

Memory Monitor (system-memory)

Memory usage: 85% warning, 95% critical
Swap usage: 50% warning, 80% critical
OOM kill detection via /dev/kmsg

Disk Monitor (system-disk)

Monitored paths: /, /var/lib/kubelet, /var/lib/docker, /var/lib/containerd
Space thresholds: 85% warning, 95% critical
Inode thresholds: 85% warning, 95% critical
Readonly filesystem detection
I/O health monitoring

Monitoring Configuration

Configuration is embedded in the DaemonSet ConfigMap (lines 498-569 of daemonset.yaml).

To update configuration:

# Edit ConfigMap
kubectl edit configmap -n kube-system node-doctor-config

# Or apply updated manifest
kubectl apply -f deployment/daemonset.yaml

# Restart pods to pick up changes
kubectl rollout restart daemonset/node-doctor -n kube-system

Troubleshooting

Pods Not Starting

# Check pod events
kubectl describe pod -n kube-system $POD_NAME

# Common issues:
# - ImagePullBackOff: Check Harbor registry access
# - CrashLoopBackOff: Check logs for errors
# - Pending: Check node selectors and tolerations

Permission Errors

# Verify RBAC is applied
kubectl get serviceaccount -n kube-system node-doctor
kubectl get clusterrole node-doctor
kubectl get clusterrolebinding node-doctor

# Check pod service account
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].spec.serviceAccountName}'

Monitors Not Working

# Check pod has host access
kubectl exec -n kube-system $POD_NAME -- ls -la /host
kubectl exec -n kube-system $POD_NAME -- ls -la /dev/kmsg
kubectl exec -n kube-system $POD_NAME -- cat /proc/meminfo | head -5

# Verify security context
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].spec.containers[0].securityContext}'

No Metrics Appearing

# Check Prometheus endpoint is accessible
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics

# Verify Service is created
kubectl get service -n kube-system node-doctor-metrics

# Check if Prometheus can scrape
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].metadata.annotations}'

Uninstalling

# Remove DaemonSet (includes ConfigMap and Service)
kubectl delete -f deployment/daemonset.yaml

# Remove RBAC resources
kubectl delete -f deployment/rbac.yaml

# Verify cleanup
kubectl get all -n kube-system -l app=node-doctor

Production Considerations

Resource Limits: Adjust CPU/memory limits based on cluster size
Image Pull Secrets: Add imagePullSecrets if Harbor requires authentication
Node Selectors: Add nodeSelector to limit deployment to specific nodes
Monitoring Integration: Configure Prometheus ServiceMonitor or PodMonitor
Alerting: Set up alerts for NodeDoctorHealthy condition changes
Backup: Include ConfigMap in cluster backup procedures
Updates: Use rolling updates with maxUnavailable: 1 for safety

Support

GitHub: https://github.com/supporttools/node-doctor
Documentation: See /docs directory
Issues: Report at GitHub issues

Summary of Components

Completed Tasks:

✅ Task #3129: DaemonSet manifest
✅ Task #3130: RBAC manifests
✅ Task #3131: ConfigMap manifest (embedded in daemonset.yaml)
✅ Task #3132: Service manifest (embedded in daemonset.yaml)
✅ Dockerfile for container image

What's Working:

CPU Monitor: 93.1% test coverage
Memory Monitor: 95.7% test coverage
Disk Monitor: 93.0% test coverage
All tests passing
Binary builds successfully
Docker image builds (84.8 MB)
Ready for production deployment

Image Registry:

Docker Hub: supporttools/node-doctor:v0.1.0
Image: node-doctor:v0.1.0 (84.8 MB Alpine-based)

You're ready to deploy to a real cluster! 🚀

FilesExpand file tree

deployment.md

Latest commit

History