Skip to content

[BUG] Race condition on reloader deployment restart #1016

@ordity

Description

@ordity

Describe the bug
Helm chart:
reloadOnCreate: false

Workload's annotation:
deployment.reloader.stakater.com/pause-period: <anything above a few seconds>

With the above configuration, a race condition is created. If a configmap or secret changes and then the Reloader pod is restarted before the pause period ends, the workload is stuck with rollouts paused indefinitely until either the rollouts are unpaused manually or another change is made to the configmap.

Running in single-replica mode minimizes the timing but it still occurs. Running in leadership elections mode with multiple replicas prolongs the timeframe during which the race condition can occur.

Enabling syncAfterRestart does not solve that problem.

Enabling both syncAfterRestart and reloadOnCreate technically solves the problem, but it does so by restarting all workloads across all monitored namespaces after reloader restarts. This solution is not really viable.

The problem is more likely to occur with longer pause-period values.

To Reproduce

  1. Set reloadOnCreate: false
  2. Set syncAfterRestart to any value
  3. Add deployment.reloader.stakater.com/pause-period annotation to any workload with value like one minute.
  4. Change any configmap / secret that the above workload is referencing.
  5. Before the pause period ends, manually restart the reloader deployment.

Expected behavior
The reloader pod restarts shouldn't impact the timed reloads.

Possible solutions
Reloader already adds a temporary annotation specifying time at which the pause was activated, and another annotation also contains the pause period duration. It looks like it should be possible for Reloader to check for pending reloads with syncAfterRestart enabled, without the nuclear option of forcing reloads across all monitored workloads.

Alternatively, in leader elections mode, the information about pending reloads could be transferred to the new leader so it can handle the affected workloads.

Another option is to gracefully terminate the old reloader pod by allowing it to handle all the pending cases while letting the new one handle all future incoming changes.

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions