Skip to content

Add Fargate profile for Karpenter controller#2891

Draft
L3n41c wants to merge 2 commits intomainfrom
lenaic/karpenter-fargate-profile
Draft

Add Fargate profile for Karpenter controller#2891
L3n41c wants to merge 2 commits intomainfrom
lenaic/karpenter-fargate-profile

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented Apr 10, 2026

What does this PR do?

Adds AWS Fargate support to the kubectl datadog autoscaling cluster install command so the Karpenter controller runs on dedicated Fargate nodes instead of regular EC2 nodes.

Motivation

Karpenter's Helm chart includes a default node affinity (karpenter.sh/nodepool: DoesNotExist) that prevents the controller from running on nodes it manages. This creates a migration problem: when migrating all workloads from pre-existing nodes to Karpenter-managed nodes, the controller itself cannot be migrated, forcing users to keep some pre-existing nodes.

Running Karpenter on Fargate solves this by providing dedicated serverless compute, separate from both pre-existing and Karpenter-managed EC2 nodes. Users can then fully decommission their pre-existing node groups.

Additional Notes

Changes:

  • Private subnet discovery from cluster VPC route tables (Fargate requires private subnets)
  • FargatePodExecutionRole and FargateProfile added to CloudFormation template (conditional)
  • --no-fargate flag for opt-out
  • Controller resource requests set when running on Fargate (for proper Fargate sizing)
  • Existing Fargate profiles are preserved on re-run if subnet discovery fails transiently
  • Fargate profile displayed in uninstall resource summary

Fargate compatibility:

  • Karpenter is a regular Deployment (not DaemonSet) — works on Fargate
  • No privileged containers or persistent volumes needed
  • Fargate nodes don't have karpenter.sh/nodepool label, so default affinity is satisfied
  • Resource needs (1 vCPU, 1Gi memory) are within Fargate limits

Minimum Agent Versions

N/A — this change affects the kubectl plugin only, not the operator or agent.

Describe your test plan

  • Unit tests: go test ./cmd/kubectl-datadog/autoscaling/cluster/install/guess/... (12 test cases for subnet filtering)
  • Build: go build ./cmd/kubectl-datadog/...
  • Manual: Run kubectl datadog autoscaling cluster install --cluster-name <name> on a test EKS cluster and verify:
    • Fargate profile is created in the EKS console
    • Karpenter pods run on Fargate nodes (kubectl get pods -n dd-karpenter -o wide shows fargate-* node names)
    • kubectl datadog autoscaling cluster uninstall cleans up the Fargate profile via CloudFormation

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

🤖 Generated with Claude Code

Karpenter's Helm chart includes a default node affinity
(karpenter.sh/nodepool: DoesNotExist) that prevents the controller from
running on nodes it manages. This creates a migration problem: users
must keep pre-existing nodes just for the Karpenter controller.

Create an AWS Fargate profile for the Karpenter namespace so the
controller runs on dedicated serverless compute, separate from both
pre-existing and Karpenter-managed EC2 nodes. This allows users to
fully migrate workloads to Karpenter-managed nodes.

Changes:
- Add private subnet discovery from cluster VPC route tables
- Add FargatePodExecutionRole and FargateProfile to CloudFormation
- Add --no-fargate flag for opt-out
- Set controller resource requests when running on Fargate
- Preserve existing Fargate profile on re-run if subnet discovery fails
- Display Fargate profile in uninstall resource summary

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c added the enhancement New feature or request label Apr 10, 2026
- Fix govet shadow error: use = instead of := for err in
  createCloudFormationStacks to avoid shadowing named return value
- Fix gci import ordering in uninstall.go: ec2/types before eks
- Remove trailing blank line in privatesubnets.go

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 10, 2026

@codex review

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 23.38710% with 95 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.00%. Comparing base (5adfc81) to head (5b9a129).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ctl-datadog/autoscaling/cluster/install/install.go 0.00% 46 Missing ⚠️
...utoscaling/cluster/install/guess/privatesubnets.go 42.64% 39 Missing ⚠️
...datadog/autoscaling/cluster/uninstall/uninstall.go 0.00% 10 Missing ⚠️

❌ Your patch status has failed because the patch coverage (23.38%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2891      +/-   ##
==========================================
- Coverage   40.06%   40.00%   -0.06%     
==========================================
  Files         319      320       +1     
  Lines       28039    28153     +114     
==========================================
+ Hits        11233    11262      +29     
- Misses      15983    16068      +85     
  Partials      823      823              
Flag Coverage Δ
unittests 40.00% <23.38%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...datadog/autoscaling/cluster/uninstall/uninstall.go 0.00% <0.00%> (ø)
...utoscaling/cluster/install/guess/privatesubnets.go 42.64% <42.64%> (ø)
...ctl-datadog/autoscaling/cluster/install/install.go 13.18% <0.00%> (-2.01%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5adfc81...5b9a129. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-prod-us1-3
Copy link
Copy Markdown

🎯 Code Coverage (details)
Patch Coverage: 22.88%
Overall Coverage: 40.08% (-0.07%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 5b9a129 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b9a12952d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"DeployPodIdentityAddon": strconv.FormatBool(!isUnmanagedEKSPIAInstalled),
"DeployNodeAccessEntry": strconv.FormatBool(supportsAPIAuth),
"DeployFargateProfile": strconv.FormatBool(deployFargate),
"FargateSubnets": fargateSubnets,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep Fargate subnets stable across re-installs

This always sends a freshly discovered FargateSubnets value during stack updates, which can break idempotency: in AWS::EKS::FargateProfile, Subnets is a replacement-only property, while the template fixes FargateProfileName to ${ClusterName}-karpenter. If the discovered subnet set/order changes after first install (for example after VPC/subnet changes), CloudFormation needs to replace a custom-named profile and the update fails, so kubectl datadog autoscaling cluster install can fail on re-run. To avoid this, keep using the existing profile subnets whenever the profile already exists (not only on discovery errors), or stop pinning the profile name.

Useful? React with 👍 / 👎.

@L3n41c L3n41c added this to the v1.27.0 milestone Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants