Skip to content

CrateDB Operator v2.47.1 Fails to Update or Reliably Add Second Data Node Pool Due to spec.nodes.data Diffing Issues (KeyError: 1) #747

@pySage

Description

@pySage

1. Environment:

  • CrateDB Operator Version: crate/crate-operator:2.47.1 (Upgraded from 2.42.0 during troubleshooting)
  • CrateDB Version (Targeted in CR): 5.8.1
  • Kubernetes Version:
    • Client: v1.31.6 (gitCommit: 6b3560758b37680cb713dfc71da03c04cadd657c)
    • Server: v1.31.3 (gitCommit: c83cbee114ddb732cdc06d3d1b62c9eb9220726f)
  • Kubernetes Distribution: kubeadm-deployed cluster (Control plane on Ubuntu 24.04.2 LTS, Worker nodes on Ubuntu 22.04.5 LTS, some with Azure kernels).
  • Number of Kubernetes Nodes: 4 (1 control-plane: iot-api, 3 worker nodes: vmhygenco, vmhygenco2, vmhygenco3)

2. Observed Behavior & Symptoms:

Even after upgrading to CrateDB Operator v2.47.1 and its corresponding CRDs, we continue to encounter critical issues when attempting to manage a CrateDB cluster with more than one data node pool, specifically when updating an existing second pool or when trying to add a second pool to a single-pool cluster via a CR update.

  • KeyError: 1 during Updates to Second Node Pool:
    After successfully creating a two-pool cluster (hot and hotter) via a "clean slate" (delete CR & operator restart, then create CR with both pools defined from the start), any subsequent attempt to modify the configuration of the second pool (hotter) in the CrateDB CR (e.g., changing spec.nodes.data.resources.heapRatio from 0.25 to 0.26) results in the operator failing with a KeyError: 1. The traceback consistently points to handle_update_cratedb.py when accessing old_spec[node_spec_idx] (or equivalent old_spec_val[node_spec_idx]).
    Traceback (most recent call last):
      File "/usr/local/lib/python3.12/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once # Path from 2.47.1 logs
        result = await invoke_handler(
      # ... (elided for brevity) ...
      File "/etc/cloud/main.py", line 173, in cluster_update # Path from 2.47.1 logs
        await update_cratedb(namespace, name, patch, status, diff, started)
      File "/usr/local/lib/python3.12/site-packages/crate/operator/handlers/handle_update_cratedb.py", line 126, in update_cratedb
        old_spec = old_spec[node_spec_idx]
    KeyError: 1
    
  • Initial Failure to Add Second Pool via Update (Observed with v2.42.0, behavior with v2.47.1 for this specific path needs re-confirmation but is suspected to be similar given the persistent KeyError): When starting with a single hot pool and then applying a CR update to add the hotter pool, the operator often failed to create the StatefulSet for the hotter pool. The operator.cloud.crate.io/last annotation would update to reflect the two-pool spec, but no resources for hotter would be provisioned. Subsequent changes would then often trigger the KeyError: 1.
  • "Updating is superseded by..." Loop: During troubleshooting with v2.42.0, the operator frequently entered a state logging "Updating is superseded by resuming..." followed by inactivity, failing to progress. This specific loop was not re-tested extensively with v2.47.1 after the KeyError resurfaced during the pool update attempt.
  • nodepool Functionality: When a two-pool cluster is successfully created via the "clean slate" method, the nodepool directives are correctly translated into pod affinity/selectors, and pods schedule on the appropriate nodes. This suggests the create path handles node pool attributes correctly.

3. Expected Behavior:

  • The CrateDB Operator should allow reliable modification of any data node pool's configuration in an existing multi-pool cluster.
  • The operator should seamlessly add a new data node pool when it's defined in the spec.nodes.data array of an existing CrateDB CR.
  • The operator should not encounter KeyError exceptions when diffing or applying changes to spec.nodes.data.

4. Steps to Reproduce (Illustrating KeyError on Update with Operator v2.47.1):

  1. Ensure CrateDB Operator v2.47.1 and its corresponding CRDs (cratedbs.cloud.crate.io CRD at generation: 2 or later) are installed.
  2. Perform a "clean slate" deployment:
    a. Ensure no CrateDB CRs or related resources exist in the target namespace (e.g., cratedb-ns).
    b. Restart the operator pod for a clean state.
    c. Apply a CrateDB CR that defines both hot and hotter data node pools from the outset (e.g., hotter with heapRatio: 0.25).
    yaml # Initial two-pool CR for clean create apiVersion: cloud.crate.io/v1 kind: CrateDB metadata: name: my-cluster namespace: cratedb-ns spec: cluster: {imageRegistry: crate, name: crate-dev, version: "5.8.1"} nodes: data: - name: hot replicas: 1 nodepool: default-pool resources: {limits: {cpu: 4, memory: 8Gi}, disk: {count: 1, size: 50GiB, storageClass: local-path}, heapRatio: 0.25} - name: hotter replicas: 1 nodepool: hotter-pool resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.25}
  3. Wait for both pools (hot and hotter) to be stable and running.
  4. Modify the CrateDB CR to change a property of the second pool (hotter), e.g., heapRatio from 0.25 to 0.26:
    # ... (spec remains the same except for hotter.heapRatio) ...
              - name: hotter
                replicas: 1
                nodepool: hotter-pool
                resources: {limits: {cpu: 4, memory: 13Gi}, disk: {count: 1, size: 100GiB, storageClass: local-path}, heapRatio: 0.26} # Changed
  5. Apply the updated CR.
  6. Observation: The operator logs will show the cluster_update handler failing with the KeyError: 1 stack trace.

5. Analysis/Hypothesis of the Root Cause:

The fundamental issue persists in v2.47.1 and lies in handle_update_cratedb.py. The loop for node_spec_idx in range(len(old_spec_val)): attempts to compare elements of old_spec_val (representing spec.nodes.data from the previous state) and new_spec_val at matching indices.

The KeyError: 1 when accessing old_spec_val[node_spec_idx] (where node_spec_idx is 1 for the second pool) indicates that the old_spec_val list used by the operator for that specific diff operation does not contain an element at index 1. This occurs even if the operator.cloud.crate.io/last annotation (when inspected via kubectl) appears to correctly store the two-pool configuration. This suggests a problem with how the operator's internal cache or Kopf's diff mechanism retrieves/presents the "old state" for the spec.nodes.data array, particularly when modifications are made to pools other than the first one in the list. The operator seems to be incorrectly using an "old state" that only contains the first data pool when diffing, leading to the error when processing the second pool.

6. Successful Workaround (Only for Initial Create, Not for Updates):

The only method to achieve the desired two-pool configuration was a "clean slate" create: deleting any existing CrateDB CR and its resources, restarting the operator, and then applying a CR manifest that defines both pools from the very beginning. Subsequent attempts to update the second pool in this successfully created two-pool cluster trigger the KeyError: 1.

7. Request:

We urgently request an investigation and fix for the spec.nodes.data diffing and update logic within the CrateDB operator (v2.47.1). The operator must reliably:
a. Allow the addition of new data node pools to an existing cluster via CR update.
b. Allow modifications to any existing data node pool in a multi-pool cluster without encountering KeyError exceptions.
c. Maintain a consistent and correct "previous state" representation for spec.nodes.data during diff operations.

8. Additional Information:

  • CRD Used: cratedbs.cloud.crate.io version v1 (CRD metadata.generation: 2, creationTimestamp: "2024-11-27T05:45:08Z").

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions