This guide walks through deploying Node Doctor to a real Kubernetes cluster.
- Kubernetes cluster (1.19+) with kubectl access
- Docker installed for building images (for manual deployment)
- Access to Docker Hub registry (for manual deployment)
- Cluster admin privileges (for RBAC and DaemonSet deployment)
| Method | Recommended For | Description |
|---|---|---|
| Helm Chart | Production | Easiest installation with configurable values |
| Automated RC | Development/Testing | One-command build and deploy for RC releases |
| Manual | Custom/Advanced | Full control over each deployment step |
The Helm chart is the recommended method for deploying Node Doctor to production clusters.
helm repo add supporttools https://charts.support.tools
helm repo updatehelm install node-doctor supporttools/node-doctor \
--namespace node-doctor \
--create-namespace# Create custom values file
cat > custom-values.yaml << 'EOF'
settings:
logLevel: info
logFormat: json
updateInterval: 30s
enableRemediation: true
dryRunMode: false
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
serviceMonitor:
enabled: true
interval: 30s
EOF
# Install with custom values
helm install node-doctor supporttools/node-doctor \
--namespace node-doctor \
--create-namespace \
-f custom-values.yaml# Check DaemonSet status
kubectl get daemonset -n node-doctor node-doctor
# View pods
kubectl get pods -n node-doctor -l app.kubernetes.io/name=node-doctor
# Check logs
kubectl logs -n node-doctor -l app.kubernetes.io/name=node-doctor --tail=50
# View node conditions
kubectl get nodes -o custom-columns='NAME:.metadata.name,HEALTHY:.status.conditions[?(@.type=="NodeDoctorHealthy")].status'# Upgrade to latest version
helm repo update
helm upgrade node-doctor supporttools/node-doctor -n node-doctor
# Upgrade with new values
helm upgrade node-doctor supporttools/node-doctor \
-n node-doctor \
--set settings.logLevel=debug
# Rollback to previous release
helm rollback node-doctor -n node-doctorhelm uninstall node-doctor -n node-doctor
kubectl delete namespace node-doctorFor complete Helm chart configuration options, see ../helm/node-doctor/README.md.
For rapid deployment of release candidates to the a1-ops-prd cluster:
# One command to build, push, and deploy
make bump-rcThis single command will:
- ✅ Validate the pipeline (run all tests)
- ✅ Increment the RC version (e.g., v0.1.0-rc.1 → v0.1.0-rc.2)
- ✅ Build Docker image with RC tag
- ✅ Push to Docker Hub registry:
supporttools/node-doctor - ✅ Deploy to
a1-ops-prdcluster innode-doctornamespace - ✅ Commit and tag the release
Configuration:
- Cluster:
a1-ops-prd(from~/.kube/config) - Namespace:
node-doctor(auto-created if needed) - Registry:
docker.io/supporttools/node-doctor - Version tracking:
.version-rcfile
Verify Deployment:
# Check pod status
kubectl --context=a1-ops-prd -n node-doctor get pods -l app=node-doctor
# View logs
kubectl --context=a1-ops-prd -n node-doctor logs -l app=node-doctor
# Monitor health
kubectl --context=a1-ops-prd -n node-doctor get pods -l app=node-doctor -wIf you prefer manual control over each step, follow the detailed instructions below.
# Build the Docker image with version tag
docker build -t supporttools/node-doctor:v0.1.0 \
--build-arg VERSION=v0.1.0 \
--build-arg GIT_COMMIT=$(git rev-parse --short HEAD) \
--build-arg BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
.
# Tag as latest
docker tag supporttools/node-doctor:v0.1.0 \
supporttools/node-doctor:latest
# Login to Docker Hub (if not already logged in)
docker login
# Push both tags
docker push supporttools/node-doctor:v0.1.0
docker push supporttools/node-doctor:latestImage Details:
- Size: ~85 MB (optimized multi-stage Alpine build)
- Base: Alpine Linux 3.19
- Go: 1.24.4
- Binary location:
/usr/local/bin/node-doctor
# Apply RBAC: ServiceAccount, ClusterRole, ClusterRoleBinding
kubectl apply -f deployment/rbac.yaml
# Verify RBAC resources created
kubectl get serviceaccount -n kube-system node-doctor
kubectl get clusterrole node-doctor
kubectl get clusterrolebinding node-doctor# Apply DaemonSet (includes ConfigMap and Service)
kubectl apply -f deployment/daemonset.yaml
# Watch rollout
kubectl rollout status daemonset/node-doctor -n kube-system
# Verify pods running on all nodes
kubectl get pods -n kube-system -l app=node-doctor -o wide# Apply both in correct order
kubectl apply -f deployment/rbac.yaml -f deployment/daemonset.yaml# View all Node Doctor pods
kubectl get pods -n kube-system -l app=node-doctor
# Check pod logs
POD_NAME=$(kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].metadata.name}')
kubectl logs -n kube-system $POD_NAME --tail=50 -f# Test health endpoint
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/healthz
# Test readiness
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/ready
# Test detailed status
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:8080/status | jq .# View Prometheus metrics
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics | head -20
# Check for Node Doctor specific metrics
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics | grep node_doctor# Check if NodeDoctorHealthy condition is set
kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.type == "NodeDoctorHealthy")'
# View full node conditions
kubectl describe nodes | grep -A 10 "Conditions:"# View Node Doctor events
kubectl get events -n kube-system --field-selector involvedObject.name=node-doctor --sort-by='.lastTimestamp'# Run comprehensive validation
cd /home/mmattox/go/src/github.com/supporttools/node-doctor
bash deployment/validate.shExpected Output:
- ✅ kubectl available
- ✅ Cluster connection OK
- ✅ RBAC manifest exists
- ✅ DaemonSet manifest exists
- ✅ DaemonSet deployed
- ✅ All pods running
- ✅ ServiceAccount exists
- ✅ ClusterRole exists
- ✅ ClusterRoleBinding exists
- ✅ ConfigMap exists
- ✅ Service exists
- ✅ Health endpoints responding
- ✅ Metrics endpoint working
- Load average thresholds: 80% warning, 95% critical
- Thermal throttling detection
- CPU usage monitoring
- Memory usage: 85% warning, 95% critical
- Swap usage: 50% warning, 80% critical
- OOM kill detection via /dev/kmsg
- Monitored paths: /, /var/lib/kubelet, /var/lib/docker, /var/lib/containerd
- Space thresholds: 85% warning, 95% critical
- Inode thresholds: 85% warning, 95% critical
- Readonly filesystem detection
- I/O health monitoring
Configuration is embedded in the DaemonSet ConfigMap (lines 498-569 of daemonset.yaml).
To update configuration:
# Edit ConfigMap
kubectl edit configmap -n kube-system node-doctor-config
# Or apply updated manifest
kubectl apply -f deployment/daemonset.yaml
# Restart pods to pick up changes
kubectl rollout restart daemonset/node-doctor -n kube-system# Check pod events
kubectl describe pod -n kube-system $POD_NAME
# Common issues:
# - ImagePullBackOff: Check Harbor registry access
# - CrashLoopBackOff: Check logs for errors
# - Pending: Check node selectors and tolerations# Verify RBAC is applied
kubectl get serviceaccount -n kube-system node-doctor
kubectl get clusterrole node-doctor
kubectl get clusterrolebinding node-doctor
# Check pod service account
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].spec.serviceAccountName}'# Check pod has host access
kubectl exec -n kube-system $POD_NAME -- ls -la /host
kubectl exec -n kube-system $POD_NAME -- ls -la /dev/kmsg
kubectl exec -n kube-system $POD_NAME -- cat /proc/meminfo | head -5
# Verify security context
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].spec.containers[0].securityContext}'# Check Prometheus endpoint is accessible
kubectl exec -n kube-system $POD_NAME -- curl -s localhost:9100/metrics
# Verify Service is created
kubectl get service -n kube-system node-doctor-metrics
# Check if Prometheus can scrape
kubectl get pods -n kube-system -l app=node-doctor -o jsonpath='{.items[0].metadata.annotations}'# Remove DaemonSet (includes ConfigMap and Service)
kubectl delete -f deployment/daemonset.yaml
# Remove RBAC resources
kubectl delete -f deployment/rbac.yaml
# Verify cleanup
kubectl get all -n kube-system -l app=node-doctor- Resource Limits: Adjust CPU/memory limits based on cluster size
- Image Pull Secrets: Add imagePullSecrets if Harbor requires authentication
- Node Selectors: Add nodeSelector to limit deployment to specific nodes
- Monitoring Integration: Configure Prometheus ServiceMonitor or PodMonitor
- Alerting: Set up alerts for NodeDoctorHealthy condition changes
- Backup: Include ConfigMap in cluster backup procedures
- Updates: Use rolling updates with maxUnavailable: 1 for safety
- GitHub: https://github.com/supporttools/node-doctor
- Documentation: See
/docsdirectory - Issues: Report at GitHub issues
Completed Tasks:
- ✅ Task #3129: DaemonSet manifest
- ✅ Task #3130: RBAC manifests
- ✅ Task #3131: ConfigMap manifest (embedded in daemonset.yaml)
- ✅ Task #3132: Service manifest (embedded in daemonset.yaml)
- ✅ Dockerfile for container image
What's Working:
- CPU Monitor: 93.1% test coverage
- Memory Monitor: 95.7% test coverage
- Disk Monitor: 93.0% test coverage
- All tests passing
- Binary builds successfully
- Docker image builds (84.8 MB)
- Ready for production deployment
Image Registry:
- Docker Hub:
supporttools/node-doctor:v0.1.0 - Image:
node-doctor:v0.1.0(84.8 MB Alpine-based)
You're ready to deploy to a real cluster! 🚀