Add federation to skmo#3766
Conversation
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3f0763b046e041c18b43ec998692e6d3 ❌ openstack-k8s-operators-content-provider FAILURE in 10m 52s |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/735d0c0530b44e039353be5e0993611a ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 46m 16s |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/f424a1444f9247a78d0afc7cb1f4660f ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 11m 03s |
7b69e43 to
b0ed8a7
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/9a5b3bdb290346f4afb91921e37419c7 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 13m 36s |
|
recheck |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d1f9efba90624c1595998f89fea46d3e ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 13m 34s |
b0ed8a7 to
7451a7a
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d5079756cc094c5391a494ef5d15c918 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 08m 48s |
fab3b0f to
a05e410
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/2a65a8569d4646ea8661838fe39c5419 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 06m 10s |
a05e410 to
46da368
Compare
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b66f2e8f173d469781c4c58a5f42cfc3 ✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 05m 38s |
08a735f to
763eca7
Compare
763eca7 to
0ebed80
Compare
0ebed80 to
e487fcc
Compare
MachineConfigs applied during devscripts install trigger an MCO update cycle that runs asynchronously after the cluster becomes reachable. On compact 3-master clusters the MCO controller can enter a permanent deadlock: all nodes reboot, apply the new config, and report state=Done with desiredDrain=lastAppliedDrain=uncordon-*, but the controller never issues the final kubectl uncordon. This leaves all nodes SchedulingDisabled indefinitely, causing every subsequent cluster operator to degrade and the deployment to time out. Add a retry loop in wait_for_cluster.yml (run as part of the openshift_adm 'stable' operation after devscripts post-install) that: - Polls MachineConfigPool status every 30 s for up to 30 minutes. - If a pool is updating normally (nodes being drained/rebooted in sequence) it waits without interrupting the MCO mid-cycle. - If it detects the stuck state (updatedMachineCount == machineCount but readyMachineCount == 0) it runs 'oc adm uncordon' on all nodes to break the deadlock, then continues polling. - Only proceeds to 'oc adm wait-for-stable-cluster' once all pools report Updated=True. Signed-off-by: Ade Lee <alee@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>
Refactor how the CA bundle secret is managed across federation hooks to avoid relying on kustomize timing and make the logic self-healing: - federation/hook_controlplane_config.yml: Dynamically resolve the CA bundle secret name by reading the live OSCP state (using the existing caBundleSecretName if set, falling back to cifmw_custom_ca_certs_secret_name or 'custom-ca-certs'). Create or update the secret with the Keycloak CA, and patch the OSCP to set caBundleSecretName only when it is not yet set. - federation/run_openstack_auth_setup.yml: Build the full CA list used for auth testing by fetching the openstackclient pod's own system CA bundle as the base (which already trusts RHOSO internal CAs), then appending the ingress-operator CA. This avoids trust mismatches between controller-0 and the pod. - federation/defaults/main.yml: Rename cifmw_federation_ca_bundle_secret_name to cifmw_custom_ca_certs_secret_name to reflect that the variable is not federation-specific. - hooks/playbooks/skmo/update-central-ca-bundle.yaml: Merge the two stage-6 post-deploy playbooks (trust-leaf-ca.yaml and ensure-central-ca-bundle.yaml) into a single idempotent playbook that resolves the secret name dynamically, creates or updates the bundle with leaf region root CAs, patches the OSCP when caBundleSecretName is unset, and waits for the leaf CA fingerprint to appear in combined-ca-bundle before continuing. - kustomize_deploy/execute_step.yml: Add | string filters to OSDPD suffix handling so that YAML integer interpretation does not cause a TypeError when the timestamp suffix is checked or concatenated. Signed-off-by: Ade Lee <alee@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>
Ansible's default() filter (without boolean=True) only substitutes
Undefined values, not empty strings. cifmw_custom_ca_certs_secret_name
is defined as "" in defaults/main.yml, so:
| default(cifmw_custom_ca_certs_secret_name | default('custom-ca-certs'))
evaluated the inner default() to "" (defined, not undefined), and the
outer default() then received "" instead of Undefined, leaving the
secret name empty and causing the kubernetes.core.k8s task to fail
with "metadata.name: Required value".
Fix by passing true as the second argument to both default() calls so
that falsy values (including empty strings) are also replaced.
Affects hook_controlplane_config.yml and update-central-ca-bundle.yaml.
Made-with: Cursor
Signed-off-by: Ade Lee <alee@redhat.com>
Co-Authored-By: Claude <noreply@anthropic.com>
rpm-ostree usroverlay returns exit code 1 with the message "Deployment is already in unlocked state: development" when the CoreOS node is already in the unlocked overlay state from a previous run. This caused the pcp_metrics hook to abort the entire deployment on re-runs without a full node reboot. Register the result and only treat non-zero exit codes as failures when the stderr does not contain the "already in unlocked state" message, making the task idempotent across multiple deploy attempts. Signed-off-by: Ade Lee <alee@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com>
0ac166c to
7f3b9b2
Compare
Replace python3 -c JSON parsing in wait_for_cluster.yml with jq expressions. Move the inline python3 heredoc for OSDPD renaming in execute_step.yml to a standalone script (roles/kustomize_deploy/files/uniquify_osdpd.py) invoked via ansible.builtin.script. Replace the shell+openssl+python fingerprint loop in update-central-ca-bundle.yaml with a kubernetes.core.k8s_info until task that checks for the leaf cert PEM as a substring of the combined bundle using Jinja2. Signed-off-by: Ade Lee <alee@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com> Made-with: Cursor
7f3b9b2 to
6767ae6
Compare
|
recheck |
|
Just pinging folks who seem to have merged stuff here before. I'll update the description with a little more detail. |
|
@danpawlik @amartyasinha @evallesp Could we humbly request your review on this? |
fultonj
left a comment
There was a problem hiding this comment.
/lgtm
(I have one kustomize question below but it's non-blocking -- I know this has been tested)
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/lgtm |
…tion Replace the JSON Patch (op/path/value) entries in the kustomize file written by hook_controlplane_config.yml with a single strategic merge patch. The JSON Patch approach was fragile: `add /spec/tls/caBundleSecretName` would fail if spec.tls had no parent yet, and adding the parent first as an empty dict would clobber existing TLS fields. A strategic merge patch merges at each level, so it works regardless of whether spec.tls already exists and leaves any pre-existing TLS fields untouched. Signed-off-by: Ade Lee <alee@redhat.com> Co-Authored-By: Claude <noreply@anthropic.com> Made-with: Cursor
435f77f to
49572b5
Compare
|
New changes are detected. LGTM label has been removed. |
Add multi-namespace SKMO scenario and playbooks
This PR contains playbooks in support of the Single Keystone Multi-region OpenStack (SKMO)
scenario - which is further defined in openstack-k8s-operators/architecture#716
This scenario is a modification of the multi-namespace VA, with the addition of federation and cinder
volume support.
In addtion, I've added a few small patches to fix issues that I encountered along the way as I was trying to repeatedly test this scenario. Basically, small fixes that make the code more idempotent or robust so that it can be re-run.
There are a lot more details in each of the commit messages.