Thank you for your interest in contributing to Node Doctor! This document provides guidelines and instructions for contributing to this Kubernetes node health monitoring and auto-remediation system.
- Code of Conduct
- Getting Started
- Development Workflow
- Coding Standards
- Testing Requirements
- Pull Request Process
- Commit Message Guidelines
- Issue Reporting
- Security Disclosure Policy
- Additional Resources
This project adheres to the Contributor Covenant Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to the project maintainers through GitHub Issues.
- Be Respectful: Treat all contributors with respect and consideration
- Be Collaborative: Work together to improve the project
- Be Professional: Focus on constructive feedback and solutions
- Be Inclusive: Welcome contributors of all backgrounds and experience levels
Before you begin, ensure you have the following installed:
- Go 1.21 or higher - Required for building and testing
- Docker - For building container images
- kubectl - For interacting with Kubernetes
- kind or minikube - For local Kubernetes testing (optional but recommended)
- make - For using project build targets
-
Fork and Clone the Repository
# Fork the repository on GitHub first, then: git clone https://github.com/YOUR_USERNAME/node-doctor.git cd node-doctor # Add upstream remote git remote add upstream https://github.com/SupportTools/node-doctor.git
-
Install Dependencies
# Download Go dependencies go mod download # Verify Go version go version # Should be 1.21 or higher
-
Verify Build
# Build the binary make build-node-doctor-local # Verify it works ./bin/node-doctor --version
-
Run Tests
# Run all tests make test # Run specific package tests go test -v ./pkg/monitors/...
node-doctor/
├── cmd/node-doctor/ # Main entry point
├── pkg/ # Core packages
│ ├── monitors/ # Health monitoring implementations
│ ├── remediators/ # Auto-remediation implementations
│ ├── exporters/ # Kubernetes/Prometheus/HTTP exporters
│ ├── detector/ # Problem detection orchestration
│ └── types/ # Core interfaces and types
├── deployment/ # Kubernetes manifests
├── docs/ # Architecture and development guides
├── config/ # Example configurations
└── test/ # Integration and E2E tests
Use descriptive branch names that indicate the type of change:
feature/<description>- For new featuresfix/<description>- For bug fixesdocs/<description>- For documentation changesrefactor/<description>- For code refactoringtest/<description>- For test improvements
Examples:
feature/cpu-throttling-monitorfix/memory-leak-in-exporterdocs/update-architecture-guiderefactor/simplify-monitor-interface
-
Create a Feature Branch
git checkout -b feature/your-feature-name
-
Make Your Changes
- Write code following the Coding Standards
- Write tests alongside your code (not after)
- Run validation locally before pushing
-
Validate Locally
# Quick validation (format, vet, lint) make validate-quick # Full validation (mirrors CI/CD pipeline) make validate-pipeline-local
-
Commit Your Changes
- Follow Commit Message Guidelines
- Make atomic commits with clear descriptions
-
Push and Create Pull Request
git push origin feature/your-feature-name
- Create a PR on GitHub with a clear description
- Reference any related issues
-
Address Review Feedback
- Make requested changes
- Push updates to the same branch
- Re-request review when ready
Follow standard Go conventions and best practices:
- Formatting: Use
gofmtorgoimports(enforced by CI) - Linting: Code must pass
go vetandstaticcheck - Naming: Follow Go naming conventions
- Exported names:
CamelCase - Unexported names:
camelCase - Package names: lowercase, single word
- Exported names:
- Documentation: All exported functions, types, and packages must have documentation comments
- Package Documentation: Every package should have a package-level comment
- Interface Design: Prefer interface-based design for testability and extensibility
- Error Handling: Always handle errors; don't ignore them
- Use
fmt.Errorfwith%wfor error wrapping - Provide context in error messages
- Use
- Concurrency: Use channels for communication between goroutines
- Resource Cleanup: Always use
deferfor cleanup operations
When implementing a new monitor, follow this pattern:
package monitors
import (
"context"
"time"
"github.com/supporttools/node-doctor/pkg/types"
)
// CPUMonitor monitors CPU health metrics
type CPUMonitor struct {
config *CPUMonitorConfig
stopCh chan struct{}
statusCh chan *types.Status
}
// NewCPUMonitor creates a new CPU monitor instance
func NewCPUMonitor(config *CPUMonitorConfig) (*CPUMonitor, error) {
if err := config.Validate(); err != nil {
return nil, fmt.Errorf("invalid config: %w", err)
}
return &CPUMonitor{
config: config,
stopCh: make(chan struct{}),
statusCh: make(chan *types.Status, 10),
}, nil
}
// Start begins monitoring (runs in goroutine)
func (m *CPUMonitor) Start() (<-chan *types.Status, error) {
go m.monitorLoop()
return m.statusCh, nil
}
// Stop halts monitoring
func (m *CPUMonitor) Stop() {
close(m.stopCh)
}
func (m *CPUMonitor) monitorLoop() {
ticker := time.NewTicker(m.config.Interval)
defer ticker.Stop()
defer close(m.statusCh)
for {
select {
case <-m.stopCh:
return
case <-ticker.C:
status := m.check()
if status != nil {
m.statusCh <- status
}
}
}
}CRITICAL: All remediators MUST implement safety mechanisms:
package remediators
import (
"context"
"time"
"github.com/supporttools/node-doctor/pkg/types"
)
// ServiceRemediator handles service restart remediation
type ServiceRemediator struct {
config *ServiceRemediatorConfig
circuitBreaker *CircuitBreaker
rateLimiter *RateLimiter
lastRemediation map[string]time.Time
mu sync.Mutex
}
// CanRemediate checks if this remediator handles the problem
func (r *ServiceRemediator) CanRemediate(problem types.Problem) bool {
return problem.Type == "service-failed"
}
// Remediate attempts to fix the problem with safety checks
func (r *ServiceRemediator) Remediate(ctx context.Context, problem types.Problem) error {
// Check circuit breaker
if r.circuitBreaker.IsOpen() {
return fmt.Errorf("circuit breaker open, too many failures")
}
// Check rate limiter
if !r.rateLimiter.Allow() {
return fmt.Errorf("rate limit exceeded")
}
// Check cooldown period
r.mu.Lock()
lastTime, exists := r.lastRemediation[problem.Resource]
r.mu.Unlock()
if exists && time.Since(lastTime) < r.GetCooldown() {
return fmt.Errorf("cooldown period not elapsed")
}
// Perform remediation
err := r.performRemediation(ctx, problem)
// Update tracking
if err != nil {
r.circuitBreaker.RecordFailure()
return err
}
r.circuitBreaker.RecordSuccess()
r.mu.Lock()
r.lastRemediation[problem.Resource] = time.Now()
r.mu.Unlock()
return nil
}
// GetCooldown returns the cooldown period
func (r *ServiceRemediator) GetCooldown() time.Duration {
return r.config.CooldownPeriod
}For Remediators (MANDATORY):
- Cooldown Period: Minimum time between attempts (default: 5 minutes)
- Attempt Limiting: Maximum attempts per time window (default: 3 per hour)
- Circuit Breaker: Stop after consecutive failures (default: 5 failures)
- Rate Limiting: Global rate limit (default: 10 per minute)
- Input Validation: Validate ALL inputs before system calls
- Audit Logging: Log ALL remediation attempts with outcomes
- Dry-Run Mode: Support testing without actual changes
Never:
- Execute system commands without validation
- Disable safety mechanisms
- Skip cooldown periods
- Bypass rate limiting
- Ignore circuit breaker state
- Minimum Coverage: 80% overall code coverage
- Target Coverage: 85%+ for production code
- New Code: Must include tests for all new functionality
All code must have corresponding unit tests:
# Run all unit tests
make test
# Run specific package tests
go test -v ./pkg/monitors/...
go test -v ./pkg/remediators/...
# Run with coverage
go test -cover -coverprofile=coverage.out ./...
go tool cover -html=coverage.outUnit Test Requirements:
- Test happy path scenarios
- Test error cases and edge conditions
- Test boundary conditions
- Use table-driven tests for multiple scenarios
- Mock external dependencies
- Test concurrent operations for thread safety
Example Table-Driven Test:
func TestCPUMonitor_Check(t *testing.T) {
tests := []struct {
name string
cpuUsage float64
threshold float64
wantProblem bool
}{
{
name: "CPU usage below threshold",
cpuUsage: 50.0,
threshold: 80.0,
wantProblem: false,
},
{
name: "CPU usage above threshold",
cpuUsage: 85.0,
threshold: 80.0,
wantProblem: true,
},
{
name: "CPU usage at threshold",
cpuUsage: 80.0,
threshold: 80.0,
wantProblem: false,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// Test implementation
})
}
}Test component integration:
# Run integration tests
make test-integration
# Run with build tag
go test -tags=integration -v ./...Integration tests should verify:
- Monitor integration with Problem Detector
- Remediator safety mechanisms under load
- Exporter publishing to Prometheus/Kubernetes
- Channel communication patterns
- Configuration loading and validation
For major features, provide end-to-end tests:
# Run E2E tests (requires Kubernetes cluster)
make test-e2e
# Run with build tag
go test -tags=e2e -v ./test/e2e/...E2E test scenarios:
- Complete monitoring workflow (detect → export)
- Remediation workflow (detect → remediate → verify)
- DaemonSet deployment and configuration
- Multi-node scenarios
All tests must pass race detector:
go test -race -v ./...-
Ensure tests pass:
make validate-pipeline-local
-
Check code coverage (should not decrease):
go test -cover ./... -
Update documentation if needed:
- Update relevant docs in
docs/ - Update README.md if adding new features
- Add examples if introducing new configuration
- Update relevant docs in
-
Write a Clear Title
- Use the same format as commit messages
- Examples:
feat: add disk space monitor,fix: memory leak in prometheus exporter
-
Provide a Detailed Description
- What problem does this solve?
- What changes were made?
- How was it tested?
- Any breaking changes or migration notes?
-
Link Related Issues
- Use keywords:
Fixes #123,Closes #456,Related to #789
- Use keywords:
-
Automated Checks
- All CI checks must pass (build, tests, linting)
- Code coverage must be maintained or improved
- Security scanning must pass
-
Code Review
- At least one approval required from maintainers
- Address all review comments
- Re-request review after making changes
-
Merge Criteria
- All CI checks passing
- Approved by at least one maintainer
- No unresolved review comments
- Up-to-date with main branch
- Squash merge preferred for clean history
<type>: <short description>
<optional longer description>
<optional footer>
feat: New featurefix: Bug fixdocs: Documentation changesrefactor: Code refactoring (no functional changes)test: Adding or updating testschore: Maintenance tasks, dependency updatesperf: Performance improvementsstyle: Code style changes (formatting, no logic changes)
Simple commit:
feat: add CPU throttling monitor
Implements monitor to detect CPU throttling on nodes.
Includes unit tests with 85% coverage.
Bug fix with issue reference:
fix: resolve memory leak in prometheus exporter
The exporter was not properly cleaning up metric collectors
when monitors were stopped, causing gradual memory growth.
Fixes #234
Breaking change:
refactor: update Monitor interface signature
BREAKING CHANGE: Monitor.Start() now returns error as second
return value to allow initialization failures to be propagated.
Migration: Update all Monitor implementations to return (channel, error)
instead of just channel.
- Keep the first line under 72 characters
- Use imperative mood ("add feature" not "added feature")
- Explain the "why" not just the "what" in the body
- Reference issues and pull requests where relevant
- Use bullet points for multiple changes
When reporting a bug, please include:
Required Information:
- Description: Clear description of the issue
- Expected Behavior: What you expected to happen
- Actual Behavior: What actually happened
- Steps to Reproduce: Detailed steps to reproduce the issue
- Environment:
- Node Doctor version
- Kubernetes version
- Node OS and version
- Go version (if building from source)
- Logs: Relevant logs from node-doctor pods
- Configuration: Sanitized configuration (remove secrets)
Example:
### Description
Memory monitor incorrectly reports OOM when swap is enabled
### Expected Behavior
Monitor should account for swap space when calculating available memory
### Actual Behavior
Monitor triggers OOM alert even when swap space is available
### Steps to Reproduce
1. Deploy node-doctor on node with swap enabled
2. Set memory threshold to 80%
3. Fill memory to 75% with swap available
4. Observe false OOM alert
### Environment
- Node Doctor: v0.3.0
- Kubernetes: v1.28.2
- Node OS: Ubuntu 22.04
- Swap: 4GB enabled
### Logs
[Paste relevant logs]
### Configuration
[Paste sanitized config]For feature requests, please provide:
- Use Case: Why is this feature needed?
- Proposed Solution: How should it work?
- Alternatives Considered: Other approaches considered
- Additional Context: Any other relevant information
DO NOT open public issues for security vulnerabilities.
If you discover a security vulnerability in Node Doctor, please report it responsibly:
-
Private Disclosure: Use GitHub Security Advisory:
- Go to the Security tab
- Click "Report a vulnerability"
- Provide detailed information about the vulnerability
-
Alternative Contact: If GitHub Security Advisory is not available:
- Open a GitHub Issue with the label "Security" (for non-critical issues)
- For critical vulnerabilities, contact maintainers directly
When reporting a security issue, please provide:
- Description: Detailed description of the vulnerability
- Impact: What can an attacker accomplish?
- Steps to Reproduce: Detailed reproduction steps
- Affected Versions: Which versions are affected?
- Proposed Fix: If you have suggestions for fixing
- Public Disclosure Timeline: Your intended disclosure timeline
- Acknowledgment: We will acknowledge receipt within 48 hours
- Initial Assessment: We will provide an initial assessment within 5 business days
- Fix Timeline: We aim to release fixes for critical vulnerabilities within 30 days
- Credit: We will credit security researchers (unless they prefer to remain anonymous)
- Coordinated Disclosure: We follow a 90-day responsible disclosure policy
When contributing to Node Doctor:
- Never commit secrets: No credentials, API keys, or tokens in code
- Validate all inputs: Especially for remediators executing system commands
- Follow least privilege: Request only necessary Kubernetes permissions
- Use secure defaults: Safe defaults in configurations
- Audit logging: Log security-relevant operations
- Dependency scanning: Keep dependencies up-to-date
- Architecture Guide - System design and component overview
- Monitors Guide - Monitor implementation details
- Remediation Guide - Remediator patterns and safety mechanisms
- Task Execution Workflow - Development process
- Validation Quick Reference - Testing and validation
# Build node-doctor binary
make build-node-doctor-local
# Run all tests
make test
# Run validation pipeline (mirrors CI/CD)
make validate-pipeline-local
# Quick validation (format, vet, lint only)
make validate-quick
# Build and push Docker image
make build-image-node-doctor VERSION=v1.0.0
make push-image-node-doctor VERSION=v1.0.0
# Deploy to Kubernetes
kubectl apply -f deployment/# Format code
go fmt ./...
# Run linter
staticcheck ./...
# Run vet
go vet ./...
# Check test coverage
go test -cover ./...
# Run race detector
go test -race ./...
# Update dependencies
go mod tidy- GitHub Issues: For bug reports and feature requests
- Pull Requests: For code contributions and discussions
- GitHub Discussions: For general questions and community discussion
If you need help:
- Check existing documentation in the
docs/directory - Search existing GitHub Issues
- Ask in GitHub Discussions
- Open a new issue with the "question" label
Thank you for contributing to Node Doctor! Your contributions help improve Kubernetes node health monitoring and make infrastructure more reliable for everyone.
We appreciate your time and effort in following these guidelines. Together, we can build a robust, safe, and maintainable project that serves the Kubernetes community.
Happy coding! 🚀