Skip to content

Conversation

@simrankaurb
Copy link
Contributor

@simrankaurb simrankaurb commented Dec 29, 2025

This PR adds a new cleanup script designed to identify and remove stale GCP resources while strictly protecting active or explicitly exempted infrastructure. The solution is split into two primary components:

  1. tools/cleanup.sh: A Bash script that orchestrates the discovery and deletion of various GCP resources (Compute Engine, GKE, TPU, VPC, etc.) based on age and protection rules.
  2. tools/cloud-build/project-cleanup.yaml: A Cloud Build configuration to automate the execution of the cleanup script with built-in retry logic and environment variable management.

Key Features

  • Hierarchical Resource Protection: The script doesn't just check for names; it proactively identifies "protected" resources. For example, if a GKE cluster is marked for preservation, the script automatically identifies and protects its underlying node pools, Managed Instance Groups (MIGs), specific VM instances, disks, and even the VPC networks they inhabit.
  • Flexible Exclusions: Supports two methods for whitelisting resources:
    • Global Exclusion File: A text file (stored in GCS) containing names of resources to ignore.
    • Label-based Exemption: Resources with the label cleanup-exemption-date followed by a valid YYYY-MM-DD date will be skipped until that date has passed.
  • Multi-Phase Cleanup: Resources are deleted in an order that respects dependencies (e.g., clusters and VMs are deleted before the networks and subnets they use).
  • Safety Mechanisms:
    • Dry Run Mode: Allows to see what would be deleted without taking action.
    • Time Cutoffs: Defaults to deleting resources older than 2 hours, with a longer 60-day window for VM images to preserve base environment states.

What this Pull Request is doing

This PR establishes a "garbage collection" system for the project's GCP environment. It solves the problem of resource leakage (and subsequent costs) by:

  1. Scanning the project for nearly all major resource types (GKE, Compute, TPU, Filestore, Images, VPCs, IAM bindings).
  2. Filtering out anything that is too new (less than 2 hours old), explicitly listed in an exclusion file, or labeled for protection.
  3. Mapping Dependencies to ensure that if a core resource (like a VM) is being kept, its required infrastructure (like its IP address or network) is also kept, even if that infrastructure wasn't explicitly protected.
  4. Executing Deletion via gcloud commands in a specific order to avoid dependency errors.
  5. Automating the entire process via Cloud Build, which can be triggered on a schedule or manually to keep the project clean.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @simrankaurb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, automated system for cleaning up unused or aged resources within a GCP project. It provides a robust bash script capable of targeting a wide array of resource types, complemented by a Cloud Build pipeline for scheduled execution and a flexible exclusion mechanism to prevent accidental deletion of important assets. The primary goal is to help manage cloud costs and maintain a tidy project environment by regularly removing ephemeral resources.

Highlights

  • New Cleanup Script: A comprehensive bash script (cleanup.sh) has been added to identify and delete various aged Google Cloud Platform (GCP) resources, including GKE clusters, TPU VMs, Compute Instances, Filestore instances, VM Images, Docker Images, Instance Templates, Addresses, Disks, Subnetworks, VPC Networks, Cloud Routers, and IAM bindings for deleted service accounts.
  • Intelligent Resource Protection: The script incorporates robust protection mechanisms, allowing resources to be excluded by name via a GCS-hosted file or by a cleanup-exemption-date label. It also automatically identifies and protects sub-resources (like disks, networks, instance templates, and IPs) associated with protected GKE clusters, Compute Instances, and TPU VMs, as well as instance templates that utilize protected networks.
  • Automated Cloud Build Execution: A new Cloud Build configuration (project-cleanup.yaml) is introduced to automate the execution of the cleanup script. This configuration sets time cutoffs for resource age (2 hours for most, 60 days for images) and includes a retry mechanism with exponential backoff to enhance reliability.
  • Centralized Exclusion List: A dedicated exclusions.txt file is added, providing a clear and centralized list of specific GCP resource names that are explicitly exempt from the automated cleanup process, ensuring critical resources are not inadvertently deleted.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive cleanup script for GCP resources, along with a Cloud Build configuration to run it periodically. The script is well-structured with good safety mechanisms like a dry-run mode, an exclusion file, and protection of dependent resources. My review focuses on improving the script's robustness, security, and maintainability. I've identified an area in the TPU resource handling that uses fragile parsing and can be significantly improved. I've also made suggestions to enhance the retry logic in the Cloud Build job and to refactor duplicated code for better long-term maintainability.

@simrankaurb simrankaurb changed the title Cleanup Script Add automated GCP resource cleanup script and Cloud Build pipeline Dec 29, 2025
@simrankaurb
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive cleanup script for GCP resources and a Cloud Build pipeline to automate it. The script is well-structured with good dependency checking and safety mechanisms like dry-run mode and an exclusion file. The hierarchical protection logic, which protects dependencies of excluded resources, is a great feature.

My review focuses on improving the script's robustness and maintainability. I've identified a few areas for improvement:

  • Correcting potential bugs related to word-splitting when processing resource lists.
  • Making the parsing of gcloud output more reliable by switching from CSV to TSV.
  • Enhancing script safety with set -euo pipefail.
  • Improving the reusability of the Cloud Build pipeline by parameterizing the hardcoded GCS bucket name.

These changes should make the cleanup process more reliable and the automation easier to adapt for different environments.

@simrankaurb
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and much-needed automated cleanup system for GCP resources. The script is well-structured, with clear phasing and robust dependency handling to prevent errors during deletion. The hierarchical protection logic, which automatically protects sub-resources of excluded items, is particularly well-designed. The inclusion of a Cloud Build pipeline with retry logic and a safe-by-default DRY_RUN mode is excellent.

My review focuses on improving the script's security and robustness by addressing the use of eval and potential race conditions in IAM modifications. I've also suggested a minor improvement to the Cloud Build configuration for better maintainability. Overall, this is a fantastic contribution to the project's operational health.

@simrankaurb simrankaurb force-pushed the cleanup branch 2 times, most recently from e9fc707 to 2ea8d26 Compare December 29, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant