Skip to content

Add federation to skmo#3766

Open
vakwetu wants to merge 16 commits intoopenstack-k8s-operators:mainfrom
vakwetu:add-federation-to-skmo
Open

Add federation to skmo#3766
vakwetu wants to merge 16 commits intoopenstack-k8s-operators:mainfrom
vakwetu:add-federation-to-skmo

Conversation

@vakwetu
Copy link
Contributor

@vakwetu vakwetu commented Mar 13, 2026

Add multi-namespace SKMO scenario and playbooks

This PR contains playbooks in support of the Single Keystone Multi-region OpenStack (SKMO)
scenario - which is further defined in openstack-k8s-operators/architecture#716
This scenario is a modification of the multi-namespace VA, with the addition of federation and cinder
volume support.

In addtion, I've added a few small patches to fix issues that I encountered along the way as I was trying to repeatedly test this scenario. Basically, small fixes that make the code more idempotent or robust so that it can be re-run.

There are a lot more details in each of the commit messages.

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3f0763b046e041c18b43ec998692e6d3

openstack-k8s-operators-content-provider FAILURE in 10m 52s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal-minor-update SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
✔️ cifmw-pod-zuul-files SUCCESS in 4m 34s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 05s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 53s
✔️ cifmw-pod-pre-commit SUCCESS in 8m 55s
✔️ cifmw-architecture-validate-hci SUCCESS in 3m 46s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 21s
✔️ cifmw-molecule-federation SUCCESS in 1m 59s

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/735d0c0530b44e039353be5e0993611a

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 46m 16s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 21m 45s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 33m 32s
cifmw-crc-podified-edpm-baremetal-minor-update RETRY_LIMIT in 24m 48s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 46s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 01s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 5m 26s
✔️ cifmw-pod-pre-commit SUCCESS in 9m 53s
✔️ cifmw-architecture-validate-hci SUCCESS in 3m 51s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 19s
✔️ cifmw-molecule-federation SUCCESS in 1m 33s

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/f424a1444f9247a78d0afc7cb1f4660f

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 11m 03s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 24m 05s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 23m 46s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 55m 56s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 54s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 37s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 52s
✔️ cifmw-pod-pre-commit SUCCESS in 8m 18s
cifmw-architecture-validate-hci FAILURE in 3m 34s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 52s
✔️ cifmw-molecule-federation SUCCESS in 2m 04s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 30s

@vakwetu vakwetu force-pushed the add-federation-to-skmo branch 5 times, most recently from 7b69e43 to b0ed8a7 Compare March 20, 2026 19:37
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/9a5b3bdb290346f4afb91921e37419c7

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 13m 36s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 26m 28s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 35m 50s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 2h 00m 16s
✔️ cifmw-pod-zuul-files SUCCESS in 29m 17s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 59s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 25s
✔️ cifmw-pod-pre-commit SUCCESS in 9m 09s
cifmw-architecture-validate-hci FAILURE in 3m 54s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 22s
✔️ cifmw-molecule-federation SUCCESS in 2m 03s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 04s

@vakwetu
Copy link
Contributor Author

vakwetu commented Mar 20, 2026

recheck

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d1f9efba90624c1595998f89fea46d3e

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 13m 34s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 19m 04s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 29m 43s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 58m 41s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 45s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 50s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 32s
✔️ cifmw-pod-pre-commit SUCCESS in 7m 45s
cifmw-architecture-validate-hci FAILURE in 4m 02s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 13s
✔️ cifmw-molecule-federation SUCCESS in 2m 00s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 09s

@vakwetu vakwetu force-pushed the add-federation-to-skmo branch from b0ed8a7 to 7451a7a Compare March 23, 2026 16:05
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d5079756cc094c5391a494ef5d15c918

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 08m 48s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 21m 24s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 26m 13s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 53m 58s
✔️ cifmw-pod-zuul-files SUCCESS in 5m 32s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 10m 10s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 6m 00s
✔️ cifmw-pod-pre-commit SUCCESS in 9m 07s
cifmw-architecture-validate-hci FAILURE in 4m 10s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 7m 35s
✔️ cifmw-molecule-federation SUCCESS in 1m 59s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 5m 33s

@vakwetu vakwetu force-pushed the add-federation-to-skmo branch from fab3b0f to a05e410 Compare March 23, 2026 19:04
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/2a65a8569d4646ea8661838fe39c5419

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 06m 10s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 20m 08s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 23m 23s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 53m 49s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 44s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 21s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 5m 03s
✔️ cifmw-pod-pre-commit SUCCESS in 7m 47s
cifmw-architecture-validate-hci FAILURE in 3m 49s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 44s
✔️ cifmw-molecule-federation SUCCESS in 1m 34s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 14s
✔️ cifmw-molecule-openshift_adm SUCCESS in 1m 57s

@vakwetu vakwetu force-pushed the add-federation-to-skmo branch from a05e410 to 46da368 Compare March 23, 2026 21:29
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b66f2e8f173d469781c4c58a5f42cfc3

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 05m 38s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 19m 03s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 27m 03s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 1h 52m 37s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 46s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 37s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 47s
✔️ cifmw-pod-pre-commit SUCCESS in 7m 53s
cifmw-architecture-validate-hci FAILURE in 3m 22s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 15s
✔️ cifmw-molecule-federation SUCCESS in 2m 00s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 24s
✔️ cifmw-molecule-openshift_adm SUCCESS in 1m 57s

vakwetu and others added 4 commits March 25, 2026 21:20
MachineConfigs applied during devscripts install trigger an MCO update
cycle that runs asynchronously after the cluster becomes reachable.  On
compact 3-master clusters the MCO controller can enter a permanent
deadlock: all nodes reboot, apply the new config, and report
state=Done with desiredDrain=lastAppliedDrain=uncordon-*, but the
controller never issues the final kubectl uncordon.  This leaves all
nodes SchedulingDisabled indefinitely, causing every subsequent cluster
operator to degrade and the deployment to time out.

Add a retry loop in wait_for_cluster.yml (run as part of the
openshift_adm 'stable' operation after devscripts post-install) that:

- Polls MachineConfigPool status every 30 s for up to 30 minutes.
- If a pool is updating normally (nodes being drained/rebooted in
  sequence) it waits without interrupting the MCO mid-cycle.
- If it detects the stuck state (updatedMachineCount == machineCount
  but readyMachineCount == 0) it runs 'oc adm uncordon' on all nodes
  to break the deadlock, then continues polling.
- Only proceeds to 'oc adm wait-for-stable-cluster' once all pools
  report Updated=True.

Signed-off-by: Ade Lee <alee@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Refactor how the CA bundle secret is managed across federation hooks
to avoid relying on kustomize timing and make the logic self-healing:

- federation/hook_controlplane_config.yml: Dynamically resolve the CA
  bundle secret name by reading the live OSCP state (using the existing
  caBundleSecretName if set, falling back to cifmw_custom_ca_certs_secret_name
  or 'custom-ca-certs'). Create or update the secret with the Keycloak CA,
  and patch the OSCP to set caBundleSecretName only when it is not yet set.

- federation/run_openstack_auth_setup.yml: Build the full CA list used
  for auth testing by fetching the openstackclient pod's own system CA
  bundle as the base (which already trusts RHOSO internal CAs), then
  appending the ingress-operator CA. This avoids trust mismatches between
  controller-0 and the pod.

- federation/defaults/main.yml: Rename cifmw_federation_ca_bundle_secret_name
  to cifmw_custom_ca_certs_secret_name to reflect that the variable is not
  federation-specific.

- hooks/playbooks/skmo/update-central-ca-bundle.yaml: Merge the two stage-6
  post-deploy playbooks (trust-leaf-ca.yaml and ensure-central-ca-bundle.yaml)
  into a single idempotent playbook that resolves the secret name dynamically,
  creates or updates the bundle with leaf region root CAs, patches the OSCP
  when caBundleSecretName is unset, and waits for the leaf CA fingerprint to
  appear in combined-ca-bundle before continuing.

- kustomize_deploy/execute_step.yml: Add | string filters to OSDPD suffix
  handling so that YAML integer interpretation does not cause a TypeError
  when the timestamp suffix is checked or concatenated.

Signed-off-by: Ade Lee <alee@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Ansible's default() filter (without boolean=True) only substitutes
Undefined values, not empty strings.  cifmw_custom_ca_certs_secret_name
is defined as "" in defaults/main.yml, so:

  | default(cifmw_custom_ca_certs_secret_name | default('custom-ca-certs'))

evaluated the inner default() to "" (defined, not undefined), and the
outer default() then received "" instead of Undefined, leaving the
secret name empty and causing the kubernetes.core.k8s task to fail
with "metadata.name: Required value".

Fix by passing true as the second argument to both default() calls so
that falsy values (including empty strings) are also replaced.

Affects hook_controlplane_config.yml and update-central-ca-bundle.yaml.

Made-with: Cursor
Signed-off-by: Ade Lee <alee@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
rpm-ostree usroverlay returns exit code 1 with the message
"Deployment is already in unlocked state: development" when the
CoreOS node is already in the unlocked overlay state from a previous
run. This caused the pcp_metrics hook to abort the entire deployment
on re-runs without a full node reboot.

Register the result and only treat non-zero exit codes as failures
when the stderr does not contain the "already in unlocked state"
message, making the task idempotent across multiple deploy attempts.

Signed-off-by: Ade Lee <alee@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@vakwetu vakwetu force-pushed the add-federation-to-skmo branch from 0ac166c to 7f3b9b2 Compare March 25, 2026 21:21
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a1097f76a96640c7a201d239e8639afc

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 15m 42s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 24m 01s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 40m 30s
✔️ cifmw-crc-podified-edpm-baremetal-minor-update SUCCESS in 2h 02m 22s
✔️ cifmw-pod-zuul-files SUCCESS in 6m 04s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 30s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 4m 47s
cifmw-pod-pre-commit FAILURE in 9m 26s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 08s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 5m 30s
✔️ cifmw-molecule-federation SUCCESS in 2m 25s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 20s
✔️ cifmw-molecule-openshift_adm SUCCESS in 2m 21s
✔️ cifmw-molecule-pcp_metrics SUCCESS in 2m 09s

Replace python3 -c JSON parsing in wait_for_cluster.yml with jq expressions.
Move the inline python3 heredoc for OSDPD renaming in execute_step.yml to a
standalone script (roles/kustomize_deploy/files/uniquify_osdpd.py) invoked via
ansible.builtin.script. Replace the shell+openssl+python fingerprint loop in
update-central-ca-bundle.yaml with a kubernetes.core.k8s_info until task that
checks for the leaf cert PEM as a substring of the combined bundle using Jinja2.

Signed-off-by: Ade Lee <alee@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Made-with: Cursor
@vakwetu vakwetu force-pushed the add-federation-to-skmo branch from 7f3b9b2 to 6767ae6 Compare March 26, 2026 01:14
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8a51bd5728e7496f9f1135c10dd0bb71

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 49m 39s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 15m 58s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 36m 28s
cifmw-crc-podified-edpm-baremetal-minor-update RETRY_LIMIT in 31m 02s
✔️ cifmw-pod-zuul-files SUCCESS in 6m 07s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 9m 39s
✔️ cifmw-pod-k8s-snippets-source SUCCESS in 5m 36s
✔️ cifmw-pod-pre-commit SUCCESS in 9m 22s
✔️ cifmw-architecture-validate-hci SUCCESS in 4m 29s
✔️ cifmw-molecule-ci_gen_kustomize_values SUCCESS in 6m 00s
✔️ cifmw-molecule-federation SUCCESS in 2m 02s
✔️ cifmw-molecule-kustomize_deploy SUCCESS in 4m 11s
✔️ cifmw-molecule-openshift_adm SUCCESS in 2m 02s
✔️ cifmw-molecule-pcp_metrics SUCCESS in 1m 58s

@abays
Copy link
Contributor

abays commented Mar 26, 2026

recheck

@vakwetu
Copy link
Contributor Author

vakwetu commented Mar 26, 2026

Just pinging folks who seem to have merged stuff here before.

I'll update the description with a little more detail.

@abays
Copy link
Contributor

abays commented Mar 26, 2026

@danpawlik @amartyasinha @evallesp Could we humbly request your review on this?

Copy link
Contributor

@fultonj fultonj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

(I have one kustomize question below but it's non-blocking -- I know this has been tested)

@openshift-ci openshift-ci bot added lgtm and removed lgtm labels Mar 26, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from fultonj. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fultonj
Copy link
Contributor

fultonj commented Mar 26, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 26, 2026
…tion

Replace the JSON Patch (op/path/value) entries in the kustomize file
written by hook_controlplane_config.yml with a single strategic merge
patch. The JSON Patch approach was fragile: `add /spec/tls/caBundleSecretName`
would fail if spec.tls had no parent yet, and adding the parent first as
an empty dict would clobber existing TLS fields. A strategic merge patch
merges at each level, so it works regardless of whether spec.tls already
exists and leaves any pre-existing TLS fields untouched.

Signed-off-by: Ade Lee <alee@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Made-with: Cursor
@vakwetu vakwetu force-pushed the add-federation-to-skmo branch from 435f77f to 49572b5 Compare March 26, 2026 20:22
@openshift-ci openshift-ci bot removed the lgtm label Mar 26, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 26, 2026

New changes are detected. LGTM label has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants