Skip to content

OCPBUGS-81476: Fix timeout in PinnedImages GC test#30962

Open
isabella-janssen wants to merge 1 commit intoopenshift:mainfrom
isabella-janssen:ocpbugs-81476
Open

OCPBUGS-81476: Fix timeout in PinnedImages GC test#30962
isabella-janssen wants to merge 1 commit intoopenshift:mainfrom
isabella-janssen:ocpbugs-81476

Conversation

@isabella-janssen
Copy link
Copy Markdown
Member

@isabella-janssen isabella-janssen commented Apr 6, 2026

This increases the timeout for the process of a node joining an MCP to prevent MCP degrades.

Summary by CodeRabbit

  • Tests
    • Updated test timing parameters to improve reliability of node configuration state verification during testing.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This fixes a race condition in the "All Nodes in a custom Pool should have the PinnedImages even after Garbage Collection" test that caused nodes to get stuck in degraded state with missing MachineConfig.

The Problem:
The test was using defers in the wrong order, causing cleanup to happen like this:

  1. Delete KubeletConfig
  2. Delete PinnedImageSet (triggers rendered-custom deletion)
  3. Unlabel node (triggers transition to worker pool)
  4. Wait for worker config

When step 3 triggered the transition, the node would reboot to apply the worker config. However, because the rendered-custom config was already deleted in step 2, the node would come back up with a reference to a non-existent config on disk and get stuck in degraded state:

currentConfig: rendered-custom-d356ed29481f2de2bb31c6443e1d29ca
desiredConfig: rendered-worker-82faad7319f9e10715adbfd98a4b67ba
state: Degraded
reason: "machineconfig 'rendered-custom-d356ed29481f2de2bb31c6443e1d29ca' not found"

The Fix:
Changed cleanup order to:

  1. Unlabel node (triggers transition)
  2. Wait for worker config transition to complete
  3. Delete KubeletConfig
  4. Delete PinnedImageSet

This ensures the node successfully transitions back to the worker pool BEFORE we delete any configs, eliminating the race condition.

Changes:

  • Removed defers for unlabelNode, waitTillNodeReadyWithConfig, deletePIS, and deleteKC
  • Added explicit cleanup after GCPISTest completes that performs operations in the correct order
  • Added logging to track cleanup progress
  • Removed defer deleteKC from GCPISTest function

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Walkthrough

Timeout increased from 5 minutes to 10 minutes in the waitTillNodeReadyWithConfig function within machine config test file, affecting node readiness polling behavior.

Changes

Cohort / File(s) Summary
Test Timeout Configuration
test/extended/machine_config/pinnedimages.go
Increased timeout duration for node readiness polling from 5 to 10 minutes.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Documentation comment states 5 minutes but implementation uses 10-minute timeout, creating a documentation-code mismatch. Update documentation comment at lines 601-603 from '5 minutes' to '10 minutes' to align with actual implementation.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning Six new Ginkgo e2e tests pull images from quay.io without Disconnected skip markers or IPv6 handling, causing failures on disconnected non-metal or IPv6-only clusters. Add [Skipped:Disconnected] to all 6 test names or update logic to use internal registry, and wrap external pulls in InIPv4ClusterContext() checks.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: increasing a timeout in the PinnedImages GC test to fix a related issue (OCPBUGS-81476).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names ✅ Passed All test names are stable, descriptive static strings with no dynamic values, generated identifiers, timestamps, or node names. The PR timeout change affects implementation, not test names.
Microshift Test Compatibility ✅ Passed This PR only adjusts an existing timeout value from 5 to 10 minutes and does not add any new Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR only modifies timeout in existing helper function; no new Ginkgo e2e tests added, so custom check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only test code adjusting a timeout value; no topology-aware scheduling constraints introduced.
Ote Binary Stdout Contract ✅ Passed The change modifies a timeout value in a test helper function with no stdout operations or process-level code violations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2026
@isabella-janssen
Copy link
Copy Markdown
Member Author

/payload-aggregate periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 6, 2026

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d7be16f0-31ca-11f1-9d47-a6fb2a91cc22-0

@isabella-janssen
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Apr 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 6, 2026
@isabella-janssen
Copy link
Copy Markdown
Member Author

/payload-aggregate periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 9, 2026

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/874ddc00-3454-11f1-9703-469c6b3fb240-0

@isabella-janssen isabella-janssen changed the title OCPBUGS-81476: Fix race condition in PinnedImages GC test OCPBUGS-81476: Fix timeout in PinnedImages GC test Apr 10, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This increases the timeout for the process of a node joining an MCP to prevent MCP degrades.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen
Copy link
Copy Markdown
Member Author

/payload-aggregate periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 10, 2026

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4728db40-34de-11f1-96d4-ab4c33428563-0

@isabella-janssen
Copy link
Copy Markdown
Member Author

/payload-aggregate periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 10, 2026

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-gcp-mco-disruptive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f71abeb0-3510-11f1-84e8-8eb280e2392d-0

@isabella-janssen isabella-janssen marked this pull request as ready for review April 13, 2026 14:23
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 13, 2026
@openshift-ci openshift-ci bot requested review from pablintino and umohnani8 April 13, 2026 14:23
@openshift-ci-robot openshift-ci-robot removed the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Apr 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.23.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

This increases the timeout for the process of a node joining an MCP to prevent MCP degrades.

Summary by CodeRabbit

  • Tests
  • Updated test timing parameters to improve reliability of node configuration state verification during testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 13, 2026
@isabella-janssen
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 13, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@isabella-janssen: This pull request references Jira Issue OCPBUGS-81476, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/extended/machine_config/pinnedimages.go (1)

601-618: Update stale timeout docs and make the failure message generic.

Line 601 still says “up to 5 minutes,” but Line 618 now waits 10 minutes. Also, Line 618 hardcodes rendered-worker even though this helper is used with multiple config prefixes.

Proposed patch
-// `waitTillNodeReadyWithConfig` loops for up to 5 minutes to check whether the input node reaches
+// `waitTillNodeReadyWithConfig` loops for up to 10 minutes to check whether the input node reaches
 // the desired rendered config version. The config version is determined by checking if the config
 // version prefix matches the stardard format of `rendered-<desired-mcp-name>`.
 func waitTillNodeReadyWithConfig(kubeClient *kubernetes.Clientset, nodeName, currentConfigPrefix string) {
 	o.Eventually(func() bool {
 		node, err := kubeClient.CoreV1().Nodes().Get(context.TODO(), nodeName, metav1.GetOptions{})
@@
-	}, 10*time.Minute, 10*time.Second).Should(o.BeTrue(), "Timed out waiting for Node '%s' to have rendered-worker config.", nodeName)
+	}, 10*time.Minute, 10*time.Second).Should(
+		o.BeTrue(),
+		"Timed out waiting for Node '%s' to have config prefix '%s'.",
+		nodeName,
+		currentConfigPrefix,
+	)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/machine_config/pinnedimages.go` around lines 601 - 618, The
function waitTillNodeReadyWithConfig has a stale comment stating "up to 5
minutes" and a hardcoded failure message referencing `rendered-worker`; update
the comment to reflect the actual 10 minute wait used in the o.Eventually call
and change the Should(...) failure message to be generic and use the
`currentConfigPrefix` parameter instead of "rendered-worker" so the helper works
for any config prefix (refer to function name waitTillNodeReadyWithConfig and
the currentConfigPrefix variable).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/extended/machine_config/pinnedimages.go`:
- Around line 601-618: The function waitTillNodeReadyWithConfig has a stale
comment stating "up to 5 minutes" and a hardcoded failure message referencing
`rendered-worker`; update the comment to reflect the actual 10 minute wait used
in the o.Eventually call and change the Should(...) failure message to be
generic and use the `currentConfigPrefix` parameter instead of "rendered-worker"
so the helper works for any config prefix (refer to function name
waitTillNodeReadyWithConfig and the currentConfigPrefix variable).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 099b22e1-5b25-4fd0-af1a-863a8cb97579

📥 Commits

Reviewing files that changed from the base of the PR and between 7da3e1c and 51e6fb4.

📒 Files selected for processing (1)
  • test/extended/machine_config/pinnedimages.go

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 13, 2026

@isabella-janssen: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants