-
Notifications
You must be signed in to change notification settings - Fork 9
Description
1. Environment:
- CrateDB Operator Version:
crate/crate-operator:2.47.1(Upgraded from2.42.0during troubleshooting) - CrateDB Version (Targeted in CR):
5.8.1 - Kubernetes Version:
- Client:
v1.31.6(gitCommit:6b3560758b37680cb713dfc71da03c04cadd657c) - Server:
v1.31.3(gitCommit:c83cbee114ddb732cdc06d3d1b62c9eb9220726f)
- Client:
- Kubernetes Distribution:
kubeadm-deployed cluster (Control plane on Ubuntu 24.04.2 LTS, Worker nodes on Ubuntu 22.04.5 LTS, some with Azure kernels). - Number of Kubernetes Nodes: 4 (1 control-plane:
iot-api, 3 worker nodes:vmhygenco,vmhygenco2,vmhygenco3)
2. Observed Behavior & Symptoms:
Even after upgrading to CrateDB Operator v2.47.1 and its corresponding CRDs, we continue to encounter critical issues when attempting to manage a CrateDB cluster with more than one data node pool, specifically when updating an existing second pool or when trying to add a second pool to a single-pool cluster via a CR update.
KeyError: 1during Updates to Second Node Pool:
After successfully creating a two-pool cluster (hotandhotter) via a "clean slate" (delete CR & operator restart, then create CR with both pools defined from the start), any subsequent attempt to modify the configuration of the second pool (hotter) in theCrateDBCR (e.g., changingspec.nodes.data.resources.heapRatiofrom0.25to0.26) results in the operator failing with aKeyError: 1. The traceback consistently points tohandle_update_cratedb.pywhen accessingold_spec[node_spec_idx](or equivalentold_spec_val[node_spec_idx]).Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once # Path from 2.47.1 logs result = await invoke_handler( # ... (elided for brevity) ... File "/etc/cloud/main.py", line 173, in cluster_update # Path from 2.47.1 logs await update_cratedb(namespace, name, patch, status, diff, started) File "/usr/local/lib/python3.12/site-packages/crate/operator/handlers/handle_update_cratedb.py", line 126, in update_cratedb old_spec = old_spec[node_spec_idx] KeyError: 1- Initial Failure to Add Second Pool via Update (Observed with v2.42.0, behavior with v2.47.1 for this specific path needs re-confirmation but is suspected to be similar given the persistent
KeyError): When starting with a singlehotpool and then applying a CR update to add thehotterpool, the operator often failed to create the StatefulSet for thehotterpool. Theoperator.cloud.crate.io/lastannotation would update to reflect the two-pool spec, but no resources forhotterwould be provisioned. Subsequent changes would then often trigger theKeyError: 1. - "Updating is superseded by..." Loop: During troubleshooting with v2.42.0, the operator frequently entered a state logging "Updating is superseded by resuming..." followed by inactivity, failing to progress. This specific loop was not re-tested extensively with v2.47.1 after the
KeyErrorresurfaced during the pool update attempt. nodepoolFunctionality: When a two-pool cluster is successfully created via the "clean slate" method, thenodepooldirectives are correctly translated into pod affinity/selectors, and pods schedule on the appropriate nodes. This suggests the create path handles node pool attributes correctly.
3. Expected Behavior:
- The CrateDB Operator should allow reliable modification of any data node pool's configuration in an existing multi-pool cluster.
- The operator should seamlessly add a new data node pool when it's defined in the
spec.nodes.dataarray of an existingCrateDBCR. - The operator should not encounter
KeyErrorexceptions when diffing or applying changes tospec.nodes.data.
4. Steps to Reproduce (Illustrating KeyError on Update with Operator v2.47.1):
- Ensure CrateDB Operator
v2.47.1and its corresponding CRDs (cratedbs.cloud.crate.ioCRD atgeneration: 2or later) are installed. - Perform a "clean slate" deployment:
a. Ensure noCrateDBCRs or related resources exist in the target namespace (e.g.,cratedb-ns).
b. Restart the operator pod for a clean state.
c. Apply aCrateDBCR that defines bothhotandhotterdata node pools from the outset (e.g.,hotterwithheapRatio: 0.25).
yaml # Initial two-pool CR for clean create apiVersion: cloud.crate.io/v1 kind: CrateDB metadata: name: my-cluster namespace: cratedb-ns spec: cluster: {imageRegistry: crate, name: crate-dev, version: "5.8.1"} nodes: data: - name: hot replicas: 1 nodepool: default-pool resources: {limits: {cpu: 4, memory: 8Gi}, disk: {count: 1, size: 50GiB, storageClass: local-path}, heapRatio: 0.25} - name: hotter replicas: 1 nodepool: hotter-pool resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.25} - Wait for both pools (
hotandhotter) to be stable and running. - Modify the
CrateDBCR to change a property of the second pool (hotter), e.g.,heapRatiofrom0.25to0.26:# ... (spec remains the same except for hotter.heapRatio) ... - name: hotter replicas: 1 nodepool: hotter-pool resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.26} # Changed
- Apply the updated CR.
- Observation: The operator logs will show the
cluster_updatehandler failing with theKeyError: 1stack trace.
5. Analysis/Hypothesis of the Root Cause:
The fundamental issue persists in v2.47.1 and lies in handle_update_cratedb.py. The loop for node_spec_idx in range(len(old_spec_val)): attempts to compare elements of old_spec_val (representing spec.nodes.data from the previous state) and new_spec_val at matching indices.
The KeyError: 1 when accessing old_spec_val[node_spec_idx] (where node_spec_idx is 1 for the second pool) indicates that the old_spec_val list used by the operator for that specific diff operation does not contain an element at index 1. This occurs even if the operator.cloud.crate.io/last annotation (when inspected via kubectl) appears to correctly store the two-pool configuration. This suggests a problem with how the operator's internal cache or Kopf's diff mechanism retrieves/presents the "old state" for the spec.nodes.data array, particularly when modifications are made to pools other than the first one in the list. The operator seems to be incorrectly using an "old state" that only contains the first data pool when diffing, leading to the error when processing the second pool.
6. Successful Workaround (Only for Initial Create, Not for Updates):
The only method to achieve the desired two-pool configuration was a "clean slate" create: deleting any existing CrateDB CR and its resources, restarting the operator, and then applying a CR manifest that defines both pools from the very beginning. Subsequent attempts to update the second pool in this successfully created two-pool cluster trigger the KeyError: 1.
7. Request:
We urgently request an investigation and fix for the spec.nodes.data diffing and update logic within the CrateDB operator (v2.47.1). The operator must reliably:
a. Allow the addition of new data node pools to an existing cluster via CR update.
b. Allow modifications to any existing data node pool in a multi-pool cluster without encountering KeyError exceptions.
c. Maintain a consistent and correct "previous state" representation for spec.nodes.data during diff operations.
8. Additional Information:
- CRD Used:
cratedbs.cloud.crate.ioversionv1(CRDmetadata.generation: 2,creationTimestamp: "2024-11-27T05:45:08Z").