Skip to content

Latest commit

 

History

History
432 lines (320 loc) · 11.1 KB

File metadata and controls

432 lines (320 loc) · 11.1 KB

Node Doctor - Real Cluster Deployment Guide

This guide walks through deploying Node Doctor to a real Kubernetes cluster.

Prerequisites

  • Kubernetes cluster (1.19+) with kubectl access
  • Docker installed for building images (for manual deployment)
  • Access to Docker Hub registry (for manual deployment)
  • Cluster admin privileges (for RBAC and DaemonSet deployment)

Deployment Options

Method Recommended For Description
Helm Chart Production Easiest installation with configurable values
Automated RC Development/Testing One-command build and deploy for RC releases
Manual Custom/Advanced Full control over each deployment step

Option 1: Helm Chart Deployment (Recommended)

The Helm chart is the recommended method for deploying Node Doctor to production clusters.

Add the Helm Repository

helm repo add supporttools https://charts.support.tools
helm repo update

Basic Installation

helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace

Installation with Custom Values

# Create custom values file
cat > custom-values.yaml << 'EOF'
settings:
  logLevel: info
  logFormat: json
  updateInterval: 30s
  enableRemediation: true
  dryRunMode: false

resources:
  requests:
    cpu: 50m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

serviceMonitor:
  enabled: true
  interval: 30s
EOF

# Install with custom values
helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace \
  -f custom-values.yaml

Verify Helm Deployment

# Check DaemonSet status
kubectl get daemonset -n node-doctor node-doctor

# View pods
kubectl get pods -n node-doctor -l app.kubernetes.io/name=node-doctor

# Check logs
kubectl logs -n node-doctor -l app.kubernetes.io/name=node-doctor --tail=50

# View node conditions
kubectl get nodes -o custom-columns='NAME:.metadata.name,HEALTHY:.status.conditions[?(@.type=="NodeDoctorHealthy")].status'

Helm Upgrade and Rollback

# Upgrade to latest version
helm repo update
helm upgrade node-doctor supporttools/node-doctor -n node-doctor

# Upgrade with new values
helm upgrade node-doctor supporttools/node-doctor \
  -n node-doctor \
  --set settings.logLevel=debug

# Rollback to previous release
helm rollback node-doctor -n node-doctor

Uninstall via Helm

helm uninstall node-doctor -n node-doctor
kubectl delete namespace node-doctor

For complete Helm chart configuration options, see ../helm/node-doctor/README.md.


Option 2: Automated RC Deployment (Development/Testing)

For rapid deployment of release candidates to the a1-ops-prd cluster:

# One command to build, push, and deploy
make bump-rc

This single command will:

  1. ✅ Validate the pipeline (run all tests)
  2. ✅ Increment the RC version (e.g., v0.1.0-rc.1 → v0.1.0-rc.2)
  3. ✅ Build Docker image with RC tag
  4. ✅ Push to Docker Hub registry: supporttools/node-doctor
  5. ✅ Deploy to a1-ops-prd cluster in node-doctor namespace
  6. ✅ Commit and tag the release

Configuration:

  • Cluster: a1-ops-prd (from ~/.kube/config)
  • Namespace: node-doctor (auto-created if needed)
  • Registry: docker.io/supporttools/node-doctor
  • Version tracking: .version-rc file

Verify Deployment:

# Check pod status
kubectl --context=a1-ops-prd -n node-doctor get pods -l app=node-doctor

# View logs
kubectl --context=a1-ops-prd -n node-doctor logs -l app=node-doctor

# Monitor health
kubectl --context=a1-ops-prd -n node-doctor get pods -l app=node-doctor -w

Option 3: Manual Deployment (Advanced)

If you prefer manual control over each step, follow the detailed instructions below.

Step 1: Build and Push Container Image

# Build the Docker image with version tag
docker build -t supporttools/node-doctor:v0.1.0 \
  --build-arg VERSION=v0.1.0 \
  --build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \
  --build-arg BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
  .

# Tag as latest
docker tag supporttools/node-doctor:v0.1.0 \
  supporttools/node-doctor:latest

# Login to Docker Hub (if not already logged in)
docker login

# Push both tags
docker push supporttools/node-doctor:v0.1.0
docker push supporttools/node-doctor:latest

Image Details:

  • Size: ~85 MB (optimized multi-stage Alpine build)
  • Base: Alpine Linux 3.19
  • Go: 1.24.4
  • Binary location: /usr/local/bin/node-doctor

Step 2: Deploy to Kubernetes Cluster

2.1 Deploy RBAC Resources (MUST BE FIRST)

# Apply RBAC: ServiceAccount, ClusterRole, ClusterRoleBinding
kubectl apply -f deployment/rbac.yaml

# Verify RBAC resources created
kubectl get serviceaccount -n kube-system node-doctor
kubectl get clusterrole node-doctor
kubectl get clusterrolebinding node-doctor

2.2 Deploy Node Doctor DaemonSet

# Apply DaemonSet (includes ConfigMap and Service)
kubectl apply -f deployment/daemonset.yaml

# Watch rollout
kubectl rollout status daemonset/node-doctor -n kube-system

# Verify pods running on all nodes
kubectl get pods -n kube-system -l app=node-doctor -o wide

2.3 Quick Deployment (Both Files)

# Apply both in correct order
kubectl apply -f deployment/rbac.yaml -f deployment/daemonset.yaml

Step 3: Verify Deployment

3.1 Check Pod Status

# View all Node Doctor pods
kubectl get pods -n kube-system -l app=node-doctor

# Check pod logs
POD_NAME=$(kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].metadata.name}')
kubectl logs -n kube-system $POD_NAME --tail=50 -f

3.2 Test Health Endpoints

# Test health endpoint
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/healthz

# Test readiness
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/ready

# Test detailed status
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/status | jq .

3.3 Check Metrics

# View Prometheus metrics
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics | head -20

# Check for Node Doctor specific metrics
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics | grep node_doctor

3.4 Verify Node Conditions

# Check if NodeDoctorHealthy condition is set
kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.type == "NodeDoctorHealthy")'

# View full node conditions
kubectl describe nodes | grep -A 10 "Conditions:"

3.5 Check Kubernetes Events

# View Node Doctor events
kubectl get events -n kube-system --field-selector involvedObject.name=node-doctor --sort-by='.lastTimestamp'

Step 4: Run Validation Script

# Run comprehensive validation
cd /home/mmattox/go/src/github.com/supporttools/node-doctor
bash deployment/validate.sh

Expected Output:

  • ✅ kubectl available
  • ✅ Cluster connection OK
  • ✅ RBAC manifest exists
  • ✅ DaemonSet manifest exists
  • ✅ DaemonSet deployed
  • ✅ All pods running
  • ✅ ServiceAccount exists
  • ✅ ClusterRole exists
  • ✅ ClusterRoleBinding exists
  • ✅ ConfigMap exists
  • ✅ Service exists
  • ✅ Health endpoints responding
  • ✅ Metrics endpoint working

What's Monitored

CPU Monitor (system-cpu)

  • Load average thresholds: 80% warning, 95% critical
  • Thermal throttling detection
  • CPU usage monitoring

Memory Monitor (system-memory)

  • Memory usage: 85% warning, 95% critical
  • Swap usage: 50% warning, 80% critical
  • OOM kill detection via /dev/kmsg

Disk Monitor (system-disk)

  • Monitored paths: /, /var/lib/kubelet, /var/lib/docker, /var/lib/containerd
  • Space thresholds: 85% warning, 95% critical
  • Inode thresholds: 85% warning, 95% critical
  • Readonly filesystem detection
  • I/O health monitoring

Monitoring Configuration

Configuration is embedded in the DaemonSet ConfigMap (lines 498-569 of daemonset.yaml).

To update configuration:

# Edit ConfigMap
kubectl edit configmap -n kube-system node-doctor-config

# Or apply updated manifest
kubectl apply -f deployment/daemonset.yaml

# Restart pods to pick up changes
kubectl rollout restart daemonset/node-doctor -n kube-system

Troubleshooting

Pods Not Starting

# Check pod events
kubectl describe pod -n kube-system $POD_NAME

# Common issues:
# - ImagePullBackOff: Check Harbor registry access
# - CrashLoopBackOff: Check logs for errors
# - Pending: Check node selectors and tolerations

Permission Errors

# Verify RBAC is applied
kubectl get serviceaccount -n kube-system node-doctor
kubectl get clusterrole node-doctor
kubectl get clusterrolebinding node-doctor

# Check pod service account
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].spec.serviceAccountName}'

Monitors Not Working

# Check pod has host access
kubectl exec -n kube-system $POD_NAME -- ls -la /host
kubectl exec -n kube-system $POD_NAME -- ls -la /dev/kmsg
kubectl exec -n kube-system $POD_NAME -- cat /proc/meminfo | head -5

# Verify security context
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].spec.containers[0].securityContext}'

No Metrics Appearing

# Check Prometheus endpoint is accessible
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics

# Verify Service is created
kubectl get service -n kube-system node-doctor-metrics

# Check if Prometheus can scrape
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].metadata.annotations}'

Uninstalling

# Remove DaemonSet (includes ConfigMap and Service)
kubectl delete -f deployment/daemonset.yaml

# Remove RBAC resources
kubectl delete -f deployment/rbac.yaml

# Verify cleanup
kubectl get all -n kube-system -l app=node-doctor

Production Considerations

  1. Resource Limits: Adjust CPU/memory limits based on cluster size
  2. Image Pull Secrets: Add imagePullSecrets if Harbor requires authentication
  3. Node Selectors: Add nodeSelector to limit deployment to specific nodes
  4. Monitoring Integration: Configure Prometheus ServiceMonitor or PodMonitor
  5. Alerting: Set up alerts for NodeDoctorHealthy condition changes
  6. Backup: Include ConfigMap in cluster backup procedures
  7. Updates: Use rolling updates with maxUnavailable: 1 for safety

Support

Summary of Components

Completed Tasks:

  • ✅ Task #3129: DaemonSet manifest
  • ✅ Task #3130: RBAC manifests
  • ✅ Task #3131: ConfigMap manifest (embedded in daemonset.yaml)
  • ✅ Task #3132: Service manifest (embedded in daemonset.yaml)
  • ✅ Dockerfile for container image

What's Working:

  • CPU Monitor: 93.1% test coverage
  • Memory Monitor: 95.7% test coverage
  • Disk Monitor: 93.0% test coverage
  • All tests passing
  • Binary builds successfully
  • Docker image builds (84.8 MB)
  • Ready for production deployment

Image Registry:

  • Docker Hub: supporttools/node-doctor:v0.1.0
  • Image: node-doctor:v0.1.0 (84.8 MB Alpine-based)

You're ready to deploy to a real cluster! 🚀