Skip to content

Validate AWS account consistency before Karpenter operations#2892

Draft
L3n41c wants to merge 5 commits intomainfrom
lenaic/validate-aws-k8s-account-match
Draft

Validate AWS account consistency before Karpenter operations#2892
L3n41c wants to merge 5 commits intomainfrom
lenaic/validate-aws-k8s-account-match

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented Apr 10, 2026

What does this PR do?

Adds a guard to verify that the AWS credentials being used belong to the same AWS account as the target EKS cluster before any kubectl datadog autoscaling cluster operations (install/uninstall). This prevents accidentally creating CloudFormation stacks in one AWS account while deploying Helm charts to an EKS cluster in another.

Motivation

The install and uninstall commands create both AWS resources (CloudFormation stacks, IAM roles) and Kubernetes resources (Helm chart, Karpenter CRDs). The AWS clients are configured from the default AWS credential chain, while the Kubernetes clients come from kubeconfig. Without validation, a misconfigured environment could silently target different AWS accounts for each.

Additional Notes

  • Install path: Validation is a hard error — the cluster must exist and accounts must match.
  • Uninstall path: A confirmed mismatch is a hard error, but if the cluster cannot be described (e.g. already deleted), it logs a warning and proceeds with cleanup.
  • Also extracts a shared GetAWSAccountID helper to deduplicate the STS.GetCallerIdentity boilerplate that was repeated in install and uninstall.
  • Introduces AccountMismatchError typed error so the uninstall flow can distinguish "confirmed mismatch" from "cannot verify".

Minimum Agent Versions

N/A — this is a kubectl-datadog plugin change, not an agent change.

Describe your test plan

  • Unit tests added for the validateAccountIDs pure logic covering: matching accounts, mismatched accounts, GovCloud/China partitions, and invalid ARN format.
  • go build ./cmd/kubectl-datadog/... passes.
  • go test ./cmd/kubectl-datadog/autoscaling/cluster/common/clients/... passes.
  • go vet ./cmd/kubectl-datadog/... passes.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

🤖 Generated with Claude Code

Prevent accidental cross-account resource deployment by verifying that
the current AWS credentials and the target EKS cluster belong to the
same AWS account. This check runs in both `install` and `uninstall`
commands right after clients are built.

In the install path, a mismatch is a hard error. In the uninstall path,
a confirmed mismatch is a hard error but unreachable clusters (e.g.
already deleted) produce a warning so cleanup can still proceed.

Also extract a shared GetAWSAccountID helper to deduplicate the
STS GetCallerIdentity boilerplate that was repeated in install and
uninstall.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c added the enhancement New feature or request label Apr 10, 2026
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 10, 2026

@codex review

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 14.92537% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.05%. Comparing base (5adfc81) to head (8daa146).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...adog/autoscaling/cluster/common/clients/clients.go 17.85% 44 Missing and 2 partials ⚠️
...datadog/autoscaling/cluster/uninstall/uninstall.go 0.00% 7 Missing ⚠️
...ctl-datadog/autoscaling/cluster/install/install.go 0.00% 4 Missing ⚠️

❌ Your patch status has failed because the patch coverage (14.92%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2892      +/-   ##
==========================================
- Coverage   40.06%   40.05%   -0.01%     
==========================================
  Files         319      319              
  Lines       28039    28676     +637     
==========================================
+ Hits        11233    11487     +254     
- Misses      15983    16360     +377     
- Partials      823      829       +6     
Flag Coverage Δ
unittests 40.05% <14.92%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ctl-datadog/autoscaling/cluster/install/install.go 15.25% <0.00%> (+0.06%) ⬆️
...datadog/autoscaling/cluster/uninstall/uninstall.go 0.00% <0.00%> (ø)
...adog/autoscaling/cluster/common/clients/clients.go 13.20% <17.85%> (+13.20%) ⬆️

... and 9 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5adfc81...8daa146. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-prod-us1-4
Copy link
Copy Markdown

datadog-prod-us1-4 bot commented Apr 10, 2026

🎯 Code Coverage (details)
Patch Coverage: 13.33%
Overall Coverage: 40.08% (-0.07%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 8daa146 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29130a60da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +147 to +149
cluster, err := cli.EKS.DescribeCluster(ctx, &eks.DescribeClusterInput{
Name: awssdk.String(clusterName),
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Validate cluster identity from kubeconfig, not by name lookup

This check can return a false “accounts match” result because it resolves the cluster via DescribeCluster using the same AWS credentials being validated. If the wrong AWS account has a cluster with the same name (a common pattern like prod/staging), DescribeCluster returns that account’s cluster ARN, so the comparison always passes and cross-account install/uninstall can still proceed against the kubeconfig-selected Kubernetes cluster. To actually prevent cross-account operations, the cluster account must be derived from kubeconfig context data (for example, the context ARN when present) or another identity source independent of the current AWS credentials.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a3b59f4.

The validation now extracts the AWS account ID from the kubeconfig context ARN (via GetAccountIDFromKubeconfig) instead of relying on DescribeCluster with the credentials being validated. This is independent of the AWS credentials and cannot be fooled by same-named clusters across accounts.

When the kubeconfig context is not an ARN (eksctl format, plain name), falls back to DescribeCluster with the known limitation documented in the code.

@L3n41c L3n41c added this to the v1.27.0 milestone Apr 10, 2026
L3n41c and others added 4 commits April 10, 2026 20:42
The previous approach used DescribeCluster with the same AWS credentials
being validated, which gives a false positive when both accounts have a
cluster with the same name (e.g. "prod").

Now extract the account ID directly from the kubeconfig context when it
is an EKS ARN (arn:aws:eks:region:account:cluster/name). This source is
independent of the AWS credentials and cannot be fooled by same-named
clusters. Falls back to DescribeCluster only when the kubeconfig context
is not an ARN (eksctl format, plain name).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract shared helper to avoid duplicating kubeconfig loading and
context resolution between GetClusterNameFromKubeconfig and
GetAccountIDFromKubeconfig. Also rename ctx variable to kubeCtx
to avoid shadowing context.Context convention, and remove inline
comments that repeated the function's docstring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nsistency

Make getAccountIDFromKubeconfig private and call it from inside
ValidateAWSAccountConsistency instead of requiring callers to
pre-compute and pass the kubeconfig account ID. This simplifies the
call sites and encapsulates the two-strategy validation logic
(kubeconfig ARN vs DescribeCluster fallback).

Also inline the trivial validateAccountIDs helper and remove its
now-unnecessary tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants