Background
Discovered during review of #159 (Fix metrics).
The MetricsCollector.Collect() method calls Reset() on gauge vectors before re-populating them. This creates a brief window during a Prometheus scrape where condition/count metrics temporarily disappear — potentially causing transient alert fires for operators with high scrape frequency or tight alerting rules.
As noted by @xrstf in the review, the correct fix is to guard the collector with a sync.RWMutex:
- Hold a write lock during the
Reset() + repopulate sequence inside updateObjectCounts and updateCacheServerCounts
- Hold a read lock in
Collect() while metrics are being scraped
This ensures that Prometheus scrapes see a consistent, atomic snapshot and never observe a partially-reset state.
Proposed Fix
type MetricsCollector struct {
mu sync.RWMutex
// ... existing fields
}
func (c *MetricsCollector) updateObjectCounts(...) {
c.mu.Lock()
defer c.mu.Unlock()
// Reset + repopulate
}
func (c *MetricsCollector) Collect(ch chan<- prometheus.Metric) {
c.mu.RLock()
defer c.mu.RUnlock()
// collect
}
Scope
This is intentionally out of scope for #159 which focuses on the Set(1) → accumulate correctness fix. Tracked separately here as a good first issue.
/kind bug
/good-first-issue
Background
Discovered during review of #159 (Fix metrics).
The
MetricsCollector.Collect()method callsReset()on gauge vectors before re-populating them. This creates a brief window during a Prometheus scrape where condition/count metrics temporarily disappear — potentially causing transient alert fires for operators with high scrape frequency or tight alerting rules.As noted by @xrstf in the review, the correct fix is to guard the collector with a
sync.RWMutex:Reset()+ repopulate sequence insideupdateObjectCountsandupdateCacheServerCountsCollect()while metrics are being scrapedThis ensures that Prometheus scrapes see a consistent, atomic snapshot and never observe a partially-reset state.
Proposed Fix
Scope
This is intentionally out of scope for #159 which focuses on the
Set(1)→ accumulate correctness fix. Tracked separately here as a good first issue./kind bug
/good-first-issue