Skip to content

Conversation

@SinaChavoshi
Copy link
Member

This PR introduces new example blueprints to facilitate the deployment of GKE clusters optimized for the Inference Gateway on A3 Mega and A3 Ultra machine types.

These changes replicate the existing pattern from the gke-a3-highgpu-inference-gateway.yaml blueprint.

Key Changes:

  • New Blueprint for A3 Mega: Added examples/gke-a3-megagpu/gke-a3-megagpu-inference-gateway.yaml and a corresponding deployment file. This blueprint configures a GKE cluster with the necessary REGIONAL_MANAGED_PROXY subnet and enables the enable_inference_gateway flag.
  • New Blueprint for A3 Ultra: Added examples/gke-a3-ultragpu/gke-a3-ultragpu-inference-gateway.yaml and a corresponding deployment file, following the same pattern as the A3 Mega blueprint.
  • Documentation: Updated examples/README.md to include descriptions and links for the new gke-a3-megagpu-inference-gateway.yaml and gke-a3-ultragpu-inference-gateway.yaml blueprints.

These additions make it easier for users to get started with Inference Gateway on the latest A3 Mega and A3 Ultra hardware.

Submission Checklist

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@SinaChavoshi SinaChavoshi requested review from a team and samskillman as code owners December 16, 2025 00:24
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SinaChavoshi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the GKE Inference Gateway capabilities within the toolkit by introducing dedicated blueprints for A3 Mega and A3 Ultra GPU machine types. These new examples streamline the deployment of GKE clusters pre-configured for high-performance inference workloads, including necessary networking components and Inference Gateway enablement, making it easier for users to leverage the latest A3 hardware.

Highlights

  • A3 Mega Inference Gateway Blueprint: A new example blueprint and deployment file have been added for deploying GKE clusters optimized for Inference Gateway on A3 Mega machine types (H100 GPUs).
  • A3 Ultra Inference Gateway Blueprint: A new example blueprint and deployment file have been added for deploying GKE clusters optimized for Inference Gateway on A3 Ultra machine types (H200 GPUs).
  • Documentation Updates: The examples/README.md has been updated to include descriptions and links for these new A3 Mega and A3 Ultra Inference Gateway blueprints.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new example blueprints for GKE Inference Gateway on A3 Mega and A3 Ultra hardware, which is a great addition. The changes correctly update the documentation in examples/README.md. However, I've identified several inconsistencies between the new megagpu and ultragpu examples in both the main blueprint files and their corresponding deployment files. A significant point of feedback is that the gke-a3-ultragpu-inference-gateway.yaml blueprint is substantially more complex than its megagpu counterpart, including many modules related to training workloads that seem out of scope for an inference-focused example. Addressing these points by simplifying the ultragpu blueprint and aligning styles across all new files will significantly improve the consistency and usability of these new examples.

Comment on lines +153 to +475
- id: training_bucket
source: community/modules/file-system/cloud-storage-bucket
settings:
local_mount: /training-data
name_prefix: training
random_suffix: true
force_destroy: false
enable_hierarchical_namespace: true

- id: checkpoint_bucket
source: community/modules/file-system/cloud-storage-bucket
settings:
local_mount: /checkpoint-data
name_prefix: checkpoint
random_suffix: true
force_destroy: false
enable_hierarchical_namespace: true

- id: a3-ultragpu-cluster
source: modules/scheduler/gke-cluster
use: [gke-a3-ultra-net-0, workload_service_account]
settings:
system_node_pool_machine_type: "e2-standard-16"
system_node_pool_disk_size_gb: $(vars.system_node_pool_disk_size_gb)
system_node_pool_taints: []
enable_dcgm_monitoring: true
enable_gcsfuse_csi: true
enable_managed_lustre_csi: true # Enable Managed Lustre for the cluster
enable_private_endpoint: false # Allows access from authorized public IPs
configure_workload_identity_sa: true
master_authorized_networks:
- cidr_block: $(vars.authorized_cidr) # Allows your machine to run the kubectl command. Required for multi network setup.
display_name: "kubectl-access-network"
additional_networks:
$(concat(
[{
network=gke-a3-ultra-net-1.network_name,
subnetwork=gke-a3-ultra-net-1.subnetwork_name,
subnetwork_project=vars.project_id,
nic_type="GVNIC",
queue_count=null,
network_ip=null,
stack_type=null,
access_config=[{nat_ip=null, public_ptr_domain_name=null, network_tier=null}],
ipv6_access_config=[],
alias_ip_range=[]
}],
gke-a3-ultra-rdma-net.subnetwork_interfaces_gke
))
# Cluster versions cannot be updated through the toolkit after creation
# Please manage cluster version from the Google Cloud Console directly
version_prefix: $(vars.version_prefix)
release_channel: RAPID
maintenance_exclusions:
- name: no-minor-or-node-upgrades-indefinite
start_time: "2024-12-01T00:00:00Z"
end_time: "2025-12-22T00:00:00Z"
exclusion_scope: NO_MINOR_OR_NODE_UPGRADES
enable_inference_gateway: true
outputs: [instructions]

# # --- MANAGED LUSTRE ADDITIONS ---
# # Private Service Access (PSA) requires the compute.networkAdmin role which is
# # included in the Owner role, but not Editor.
# # PSA is required for all Managed Lustre functionality.
# # https://cloud.google.com/vpc/docs/configure-private-services-access#permissions
# - id: private_service_access
# source: community/modules/network/private-service-access
# use: [gke-a3-ultra-net-0]
# settings:
# prefix_length: 24

# # Firewall to allow Managed Lustre connection
# - id: lustre_firewall_rule
# source: modules/network/firewall-rules
# use: [gke-a3-ultra-net-0]
# settings:
# ingress_rules:
# - name: $(vars.deployment_name)-allow-lustre-traffic
# description: Allow Managed Lustre traffic
# source_ranges:
# - $(private_service_access.cidr_range)
# allow:
# - protocol: tcp
# ports:
# - "988"

# - id: managed-lustre
# source: modules/file-system/managed-lustre
# use: [gke-a3-ultra-net-0, private_service_access]
# settings:
# name: $(vars.lustre_instance_id)
# local_mount: /lustre
# remote_mount: lustrefs
# size_gib: $(vars.lustre_size_gib)
# per_unit_storage_throughput: $(vars.per_unit_storage_throughput)

# - id: lustre-pv
# source: modules/file-system/gke-persistent-volume
# use: [managed-lustre, a3-ultragpu-cluster]
# settings:
# capacity_gib: $(vars.lustre_size_gib)

- id: a3-ultragpu-pool
source: modules/compute/gke-node-pool
use: [a3-ultragpu-cluster, node_pool_service_account]
settings:
machine_type: a3-ultragpu-8g
auto_upgrade: true
zones: [$(vars.zone)]
disk_size_gb: $(vars.a3ultra_node_pool_disk_size_gb)
static_node_count: $(vars.static_node_count)
guest_accelerator:
- type: $(vars.accelerator_type)
count: 8
reservation_affinity:
consume_reservation_type: SPECIFIC_RESERVATION
specific_reservations:
- name: $(vars.reservation)
additional_networks:
$(concat(
[{
network=gke-a3-ultra-net-1.network_name,
subnetwork=gke-a3-ultra-net-1.subnetwork_name,
subnetwork_project=vars.project_id,
nic_type="GVNIC",
queue_count=null,
network_ip=null,
stack_type=null,
access_config=[{nat_ip=null, public_ptr_domain_name=null, network_tier=null}],
ipv6_access_config=[],
alias_ip_range=[]
}],
gke-a3-ultra-rdma-net.subnetwork_interfaces_gke
))
outputs: [instructions]

- id: workload-manager-install
source: modules/management/kubectl-apply
use: [a3-ultragpu-cluster]
settings:
apply_manifests:
- source: $(vars.permissions_file_staged_path)
enable: $(vars.enable_periodic_health_checks)
template_vars:
project_id: $(vars.project_id)
deployment_name: $(vars.deployment_name)
- source: $(vars.chs_pvc_rendered_path)
enable: $(vars.enable_periodic_health_checks)
template_vars:
pvc_name: $(vars.chs_pvc_claim_name)
access_mode: ReadWriteOnce
capacity: 1Gi
storage_class_name: standard-rwo
- source: $(vars.chs_cronjob_rendered_path)
enable: $(vars.enable_periodic_health_checks)
template_vars:
project_id: $(vars.project_id)
deployment_name: $(vars.deployment_name)
region: $(vars.region)
machine_type: a3-ultragpu-8g
gcs_bucket: $(vars.chs_output_bucket_name)
gcs_pvc: $(vars.chs_pvc_claim_name)
cronjob_schedule: $(vars.health_check_schedule)
kueue:
install: true
config_path: $(vars.kueue_configuration_path)
config_template_vars:
num_gpus: $(a3-ultragpu-pool.static_gpu_count)
accelerator_type: $(vars.accelerator_type)
jobset:
install: true
gib:
install: true # NCCL gIB plugin via DaemonSet initContainer
path: $(vars.gib_installer_path)
template_vars:
version: v1.1.0
accelerator_count: 8

- id: job-template
source: modules/compute/gke-job-template
use: [a3-ultragpu-pool]
settings:
image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
command:
- nvidia-smi
node_count: 2
name: run-nvidia-smi
k8s_service_account_name: workload-identity-k8s-sa
outputs: [instructions]

# Create a remote mount of training_bucket using
# mount options optimized for reading training data.
# Based on Source of truth https://github.com/GoogleCloudPlatform/gcsfuse/blob/d1373b665b7f60e98856d2181f1193396ef16427/samples/gke-csi-yaml/gpu/training-pv.yaml#L15
# Some of the options might be available only on latest GKE version, please check the cluster version to meet the required version https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf
- id: gcs-training
source: modules/file-system/pre-existing-network-storage
settings:
remote_mount: $(training_bucket.gcs_bucket_name)
local_mount: /training-data
fs_type: gcsfuse
mount_options: >-
implicit-dirs,
metadata-cache:ttl-secs:-1,
metadata-cache:stat-cache-max-size-mb:-1,
metadata-cache:type-cache-max-size-mb:-1,
file-cache:max-size-mb:-1,
file-cache:cache-file-for-range-read:true

# Create a remote mount of checkpoint_bucket using mount
# options optimized for writing and reading checkpoint data.
# Based on Source of truth https://github.com/GoogleCloudPlatform/gcsfuse/blob/d1373b665b7f60e98856d2181f1193396ef16427/samples/gke-csi-yaml/gpu/checkpointing-pv.yaml#L15
# Some of the options might be available only on latest GKE version, please check the cluster version to meet the required version https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf
- id: gcs-checkpointing
source: modules/file-system/pre-existing-network-storage
settings:
remote_mount: $(checkpoint_bucket.gcs_bucket_name)
local_mount: /checkpoint-data
fs_type: gcsfuse
mount_options: >-
implicit-dirs,
metadata-cache:ttl-secs:-1,
metadata-cache:stat-cache-max-size-mb:-1,
metadata-cache:type-cache-max-size-mb:-1,
file-cache:max-size-mb:-1,
file-cache:cache-file-for-range-read:true,
file-cache:enable-parallel-downloads:true,
rename-dir-limit=200000

# Persistent Volume for training data
- id: training-pv
source: modules/file-system/gke-persistent-volume
use: [gcs-training, a3-ultragpu-cluster]
settings:
gcs_bucket_name: $(training_bucket.gcs_bucket_name)
capacity_gib: 1000000

# Persistent Volume for checkpoint data
- id: checkpointing-pv
source: modules/file-system/gke-persistent-volume
use: [gcs-checkpointing, a3-ultragpu-cluster]
settings:
gcs_bucket_name: $(checkpoint_bucket.gcs_bucket_name)
capacity_gib: 1000000

# This is an example job that will install and run an `fio`
# benchmark against the training and checkpointing buckets.
- id: fio-bench-job-template
source: modules/compute/gke-job-template
use: [checkpointing-pv, training-pv, a3-ultragpu-pool]
settings:
security_context: # to make sure the job have enough access to install the fio packages
- key: runAsUser
value: 0
- key: runAsGroup
value: 100
- key: fsGroup
value: 100
# By adding an ephemeral volume, this will ensure that the job adds:
# nodeSelector:
# cloud.google.com/gke-ephemeral-storage-local-ssd: "true"
# which is the best practice for using local-ssd for ephemeral storage.
ephemeral_volumes:
- type: local-ssd
mount_path: /scratch-data
size_gb: 1000 # Use 1 out of 12 TB for local scratch

k8s_service_account_name: workload-identity-k8s-sa
image: ubuntu:latest

command:
- bash
- -c
- |

set -eux
export DEBIAN_FRONTEND=noninteractive

# Install fio
apt update -y && apt install -y fio

# Use a tag to create a unique path for tests
TAG=`date +%s`

# Verify mountpoints
df -h
mountpoint /scratch-data
mountpoint /checkpoint-data
mountpoint /training-data

# Create temporary directory for fio benchmarks
mkdir -p /{scratch,training,checkpoint}-data/fio-benchmarks-${TAG}

# The following will take roughly 10 minutes to complete

# Perform scratch data write performance test
fio --ioengine=libaio --filesize=10G --ramp_time=2s --runtime=1m \
--numjobs=32 --create_serialize=0 --direct=1 --verify=0 \
--randrepeat=0 --group_reporting --directory=/scratch-data/fio-benchmarks-${TAG} \
--name=scratch --blocksize=100m --iodepth=64 --readwrite=write

# Perform training data reading performance test
fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m \
--numjobs=32 --create_serialize=0 --direct=1 --verify=0 \
--randrepeat=0 --group_reporting --directory=/training-data/fio-benchmarks-${TAG} \
--name=training --blocksize=1m --iodepth=64 --readwrite=randread

# Perform checkpoint data writing performance test
fio --ioengine=libaio --filesize=10G --ramp_time=2s --runtime=1m \
--numjobs=32 --create_serialize=0 --direct=1 --verify=0 \
--randrepeat=0 --group_reporting --directory=/checkpoint-data/fio-benchmarks-${TAG} \
--name=checkpoint --blocksize=100m --iodepth=64 --readwrite=write

# Perform checkpoint data reading performance test
fio --ioengine=libaio --filesize=10G --ramp_time=2s --runtime=1m \
--numjobs=32 --create_serialize=0 --direct=1 --verify=0 \
--randrepeat=0 --group_reporting --directory=/checkpoint-data/fio-benchmarks-${TAG} \
--name=checkpoint --blocksize=100m --iodepth=64 --readwrite=read

# Clean up temporary directories for fio benchmarks
rm -rf /{scratch,training,checkpoint}-data/fio-benchmarks-${TAG}

outputs: [instructions]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This blueprint is significantly more complex than the other inference gateway examples (megagpu and highgpu). It includes numerous modules related to training workloads, such as training_bucket, checkpoint_bucket, fio-bench-job-template, and various persistent volume configurations. The purpose of an "inference-gateway" example should be to provide a minimal, focused configuration for deploying inference services. This complexity can be confusing for users and deviates from the stated goal of replicating the existing pattern. Please simplify this blueprint by removing the modules that are not essential for setting up the GKE Inference Gateway, aligning it with the structure of gke-a3-megagpu-inference-gateway.yaml.

Comment on lines +73 to +78
subnet_ip: 10.128.0.0/20
- subnet_name: $(vars.deployment_name)-proxy-subnet
subnet_region: $(vars.region)
subnet_ip: "10.129.0.0/24"
purpose: "REGIONAL_MANAGED_PROXY"
role: "ACTIVE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an inconsistency in how subnet_ip values are defined. On line 73, the IP is unquoted and has a trailing space, while on line 76, it's quoted. For consistency and correctness, it's best to use a single style. I recommend removing the quotes and the trailing space, as they are not required for these string values.

        subnet_ip: 10.128.0.0/20
      - subnet_name: $(vars.deployment_name)-proxy-subnet
        subnet_region: $(vars.region)
        subnet_ip: 10.129.0.0/24
        purpose: "REGIONAL_MANAGED_PROXY"
        role: "ACTIVE"

Comment on lines +23 to +28
project_id: PROJECT_ID
region: COMPUTE_REGION
zone: COMPUTE_ZONE
# Cidr block containing the IP of the machine calling terraform.
# The following line must be updated for this example to work.
authorized_cidr: IP_ADDRESS/SUFFIX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The placeholder values and comments in this deployment file are inconsistent with the newly added gke-a3-megagpu-inference-gateway-deployment.yaml and other examples in the repository. For instance, this file uses COMPUTE_REGION and IP_ADDRESS/SUFFIX, whereas the megagpu counterpart provides a default region with a comment and uses <IP_ADDRESS>/<SUFFIX>. Aligning these makes the examples more consistent and easier for users to understand and modify.

  project_id: PROJECT_ID

  # The GCP Region used for this deployment.
  region: us-central1

  # The GCP Zone used for this deployment.
  zone: us-central1-c

  # Cidr block containing the IP of the machine calling terraform.
  # The following line must be updated for this example to work.
  authorized_cidr: <IP_ADDRESS>/<SUFFIX>

Comment on lines +18 to +32
project_id: # add this
deployment_name: # add this
region: # add this
zone: # add this
# Cidr block containing the IP of the machine calling terraform.
# The following line must be updated for this example to work.
authorized_cidr: # add this
# The name of the compute engine reservation in the form of
# <reservation-name>
# To target a BLOCK_NAME, the name of the extended reservation
# can be inputted as <reservation-name>/reservationBlocks/<reservation-block-name>
reservation: # add this
kueue_configuration_path: $(ghpc_stage("./kueue-configuration.yaml.tftpl"))
gib_installer_path: $(ghpc_stage("./nccl-installer.yaml.tftpl"))
static_node_count: # add this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The placeholder comments here (e.g., # add this) and lack of default values are inconsistent with the style used in the gke-a3-megagpu-inference-gateway.yaml blueprint (e.g., ## Set GCP Project ID Here ## and default values provided). To maintain consistency across the example blueprints, please use the same placeholder style and provide sensible defaults where applicable.

  # The following variables should be over-written in the deployment.yaml file.
  # Your GCP Project ID
  project_id: ## Set GCP Project ID Here ##

  # This should be unique across all of your Cluster
  # Toolkit Deployments.
  deployment_name: gke-a3-ultra-ig

  # The GCP Region used for this deployment.
  region: us-central1

  # The GCP Zone used for this deployment.
  zone: us-central1-c

  # Cidr block containing the IP of the machine calling terraform.
  # The following line must be updated for this example to work.
  authorized_cidr:

  # The name of the compute engine reservation in the form of
  # <reservation-name>
  # To target a BLOCK_NAME, the name of the extended reservation
  # can be inputted as <reservation-name>/reservationBlocks/<reservation-block-name>
  reservation:
  kueue_configuration_path: $(ghpc_stage("./kueue-configuration.yaml.tftpl"))
  gib_installer_path: $(ghpc_stage("./nccl-installer.yaml.tftpl"))
  static_node_count: 2

Comment on lines +74 to +79
subnet_ip: 192.168.0.0/18
- subnet_name: $(vars.deployment_name)-proxy-subnet
subnet_region: $(vars.region)
subnet_ip: "192.168.192.0/24"
purpose: "REGIONAL_MANAGED_PROXY"
role: "ACTIVE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an inconsistency in how subnet_ip values are quoted. On line 74, the IP is unquoted, while on line 77, it's quoted. For consistency within this file and across other YAML blueprints, it's best to use a single style. I recommend removing the quotes for consistency, as they are not required for these string values.

        subnet_ip: 192.168.0.0/18
      - subnet_name: $(vars.deployment_name)-proxy-subnet
        subnet_region: $(vars.region)
        subnet_ip: 192.168.192.0/24
        purpose: "REGIONAL_MANAGED_PROXY"
        role: "ACTIVE"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant