Skip to content

K8SPG-943: check operator panic in e2e test#1441

Draft
mayankshah1607 wants to merge 4 commits intomainfrom
K8SPG-943-e2e-test
Draft

K8SPG-943: check operator panic in e2e test#1441
mayankshah1607 wants to merge 4 commits intomainfrom
K8SPG-943-e2e-test

Conversation

@mayankshah1607
Copy link
Member

CHANGE DESCRIPTION

Problem:
Short explanation of the problem.

Cause:
Short explanation of the root cause of the issue if applicable.

Solution:
Short explanation of the solution we are providing with this PR.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

Signed-off-by: Mayank Shah <mayank.shah@percona.com>
Copilot AI review requested due to automatic review settings February 13, 2026 09:59
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds operator panic detection to E2E test cleanup phases. The change introduces a new check_operator_panic function that searches operator logs for panic messages and calls this function before destroying the operator in test cleanup scripts.

Changes:

  • Added check_operator_panic function to e2e-tests/functions that checks operator logs for "Observed a panic" messages
  • Updated 29 E2E test cleanup files to call check_operator_panic before destroy_operator

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 1 comment.

File Description
e2e-tests/functions Adds new check_operator_panic function to detect panics in operator logs
e2e-tests/tests//99-.yaml Updates 29 test cleanup scripts to check for operator panics before destroying the operator

Comment on lines +40 to +46
check_operator_panic() {
local operator_pod=$(get_operator_pod)
if kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | grep -q "Observed a panic"; then
echo "Detected panic in operator"
exit 1
fi
}
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function doesn't verify that the operator pod exists before checking for panics. If get_operator_pod returns an empty string (no operator pod found), the kubectl logs command will fail, but the error will be masked by the pipeline to grep, causing the function to silently succeed without checking for panics.

Add validation that operator_pod is non-empty before attempting to retrieve logs. Consider also handling the case where kubectl logs fails due to the pod not existing or not being ready yet.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 16, 2026 08:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 1 comment.

Comment on lines +41 to +42
local operator_pod=$(get_operator_pod)
if kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | grep -q "Observed a panic"; then
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Problem: check_operator_panic can silently succeed even when it fails to actually check logs (e.g., get_operator_pod returns empty/multiple pod names, or kubectl logs fails), because the kubectl logs | grep -q pipeline is used without pipefail and without validating $operator_pod.
  2. Why it matters: This creates false negatives, which defeats the purpose of adding a panic check to E2E cleanup steps.
  3. Fix: Validate that exactly one operator pod name is resolved; if none (or more than one), fail with a clear message. Also treat log retrieval failures as errors (enable set -o pipefail for the pipeline or check kubectl logs exit status before grepping).
Suggested change
local operator_pod=$(get_operator_pod)
if kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator | grep -q "Observed a panic"; then
local operator_pod
operator_pod=$(get_operator_pod)
# Ensure we have exactly one operator pod to check.
if [[ -z "$operator_pod" ]]; then
echo "Failed to detect operator pod: no operator pod found in namespace '${OPERATOR_NS:-$NAMESPACE}'" >&2
exit 1
fi
# Split the result into an array to detect multiple pods.
local pods_array=()
# shellcheck disable=SC2206
pods_array=($operator_pod)
if [[ ${#pods_array[@]} -ne 1 ]]; then
echo "Failed to detect operator pod: expected 1 pod, got ${#pods_array[@]}: $operator_pod" >&2
exit 1
fi
operator_pod="${pods_array[0]}"
# Retrieve logs and fail if we cannot get them.
local logs
if ! logs=$(kubectl logs -n "${OPERATOR_NS:-$NAMESPACE}" "$operator_pod" -c operator 2>&1); then
echo "Failed to retrieve logs for operator pod '$operator_pod' in namespace '${OPERATOR_NS:-$NAMESPACE}'" >&2
echo "$logs" >&2
exit 1
fi
if grep -q "Observed a panic" <<<"$logs"; then

Copilot uses AI. Check for mistakes.
@JNKPercona
Copy link
Collaborator

Test Name Result Time
backup-enable-disable passed 00:07:01
builtin-extensions passed 00:05:16
custom-envs passed 00:18:00
custom-extensions passed 00:14:33
custom-tls passed 00:05:16
database-init-sql passed 00:02:16
demand-backup passed 00:23:19
demand-backup-offline-snapshot passed 00:13:31
finalizers failure 00:03:51
init-deploy passed 00:03:12
huge-pages passed 00:03:02
monitoring passed 00:07:13
monitoring-pmm3 passed 00:08:14
one-pod passed 00:05:51
operator-self-healing passed 00:08:10
pitr passed 00:11:52
scaling passed 00:04:57
scheduled-backup passed 00:28:42
self-healing passed 00:08:52
sidecars passed 00:02:38
standby-pgbackrest passed 00:12:07
standby-streaming passed 00:09:31
start-from-backup passed 00:10:56
tablespaces passed 00:07:07
telemetry-transfer passed 00:03:31
upgrade-consistency passed 00:05:43
upgrade-minor passed 00:05:09
users passed 00:04:46
Summary Value
Tests Run 28/28
Job Duration 01:19:32
Total Test Time 04:04:50

commit: 1418e39
image: perconalab/percona-postgresql-operator:PR-1441-1418e3914

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants