Skip to content

separate pvc retain/delete into a separate test case to control timeline better.#1139

Open
kannon92 wants to merge 1 commit intokubernetes-sigs:mainfrom
kannon92:volume-claim-policy-flake
Open

separate pvc retain/delete into a separate test case to control timeline better.#1139
kannon92 wants to merge 1 commit intokubernetes-sigs:mainfrom
kannon92:volume-claim-policy-flake

Conversation

@kannon92
Copy link
Contributor

@kannon92 kannon92 commented Jan 26, 2026

What type of PR is this?

/kind flake

What this PR does / why we need it:

Separate integration test with PVC to a separate test case as there is a lot of orchestration needed.

Moving to a separate test case makes it clear what we need to run in order and hopefully deflakes the test.

Which issue(s) this PR fixes:

Fixes #1133

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Copilot AI review requested due to automatic review settings January 26, 2026 21:32
@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 26, 2026
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g January 26, 2026 21:32
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 26, 2026
@netlify
Copy link

netlify bot commented Jan 26, 2026

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 3f38d21
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-jobset/deploys/6977dd7f7b18520008038611

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the PVC retention/reuse integration test for the JobSet controller into its own dedicated Ginkgo test case to better control test ordering and reduce flakiness around PVC lifecycle handling.

Changes:

  • Removed the ginkgo.Entry("jobset with VolumeClaimPolicies should create two PVC with Delete and Retain policies", ...) from the generic DescribeTable-driven controller tests.
  • Added a standalone ginkgo.It("JobSet with VolumeClaimPolicies should create, retain, and reuse PVCs", ...) that explicitly creates a namespace, a first JobSet with Delete/Retain PVC policies, verifies completion, deletes and cleans up PVCs, and then creates a second JobSet that reuses the retained PVC.
  • Kept the existing suspended-JobSet-with-VolumeClaimPolicies test and the controller helper functions, while reusing them where appropriate (e.g., completeAllJobs, JobSetCompleted).

Comment on lines +2519 to +2530
ginkgo.By("manually deleting PVCs with JobSet OwnerReference")
var pvcList corev1.PersistentVolumeClaimList
gomega.Expect(k8sClient.List(ctx, &pvcList, client.InNamespace(namespace))).To(gomega.Succeed())
for i := range pvcList.Items {
pvc := &pvcList.Items[i]
// Envtest doesn't support garbage collection.
if len(pvc.OwnerReferences) > 0 {
// Remove finalizers to allow deletion in envtest.
pvc.Finalizers = []string{}
gomega.Expect(k8sClient.Update(ctx, pvc)).To(gomega.Succeed())
gomega.Expect(k8sClient.Delete(ctx, pvc)).To(gomega.Succeed())
}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual PVC cleanup relies solely on len(pvc.OwnerReferences) > 0 to decide which PVCs to delete. Because the JobSet controller removes the owner reference from the retained PVC asynchronously when the JobSet is deleted, this loop can race with that reconciliation and end up deleting the PVC that is supposed to be retained if the controller has not yet cleared its owner reference. That makes the "retain" semantics under test non-deterministic and can cause flakes when the retained PVC is deleted before the later assertion. To avoid this, the cleanup should deterministically target only the PVCs that are expected to be deleted (for example by matching on the known test-volume name or retention policy) rather than using the presence of an owner reference as the discriminator.

Copilot uses AI. Check for mistakes.
},
},
}),
ginkgo.Entry("suspended jobset with VolumeClaimPolicies should create PVCs", &testCase{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to move suspend tests as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check today but I didn't see any flakes on that job.

@kannon92
Copy link
Contributor Author

/test pull-jobset-test-integration-main

@k8s-ci-robot
Copy link
Contributor

@kannon92: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-jobset-test-integration-main 3f38d21 link true /test pull-jobset-test-integration-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kannon92
Copy link
Contributor Author

/hold

Looks to still be flaky.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/flake Categorizes issue or PR as related to a flaky test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

3 participants

Comments