|
| 1 | +# Prow CI/CD Cluster Documentation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This is the long-running reserved GKE cluster for Prow CI/CD job execution. This document shows you how to access it, get information about it, update it, and remove it if needed. |
| 6 | + |
| 7 | +- **Cluster Name**: `hyperfleet-dev-prow` |
| 8 | +- **GCP Project**: `hcm-hyperfleet` |
| 9 | +- **Connect Command**: `gcloud container clusters get-credentials hyperfleet-dev-prow --zone us-central1-a --project hcm-hyperfleet` |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Usage Policy |
| 14 | + |
| 15 | +**This cluster is dedicated to running Prow CI/CD jobs for the team.** |
| 16 | + |
| 17 | +- **Read-only operations** (viewing cluster info, logs, etc.) can be performed by all team members |
| 18 | +- **Modifications** (updates, deletions, configuration changes) to the cluster or the `prow-hyperfleet` namespace should follow these best practices: |
| 19 | + 1. Get **explicit approval** from team leaders |
| 20 | + 2. Send a **team-wide broadcast via Slack** before taking action to ensure everyone is aware of potential impacts |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +## Prerequisites for Viewing Cluster |
| 25 | + |
| 26 | +```bash |
| 27 | +# Install required tools |
| 28 | +gcloud components install kubectl gke-gcloud-auth-plugin |
| 29 | +``` |
| 30 | + |
| 31 | +## Prerequisites for Terraform Operations |
| 32 | + |
| 33 | +**Only needed if you want to view Terraform state, update, or remove the cluster.** |
| 34 | + |
| 35 | +```bash |
| 36 | +# Install Terraform |
| 37 | +brew install terraform # Terraform >= 1.5 |
| 38 | + |
| 39 | +# Clone the infrastructure repository |
| 40 | +git clone https://github.com/openshift-hyperfleet/hyperfleet-infra.git |
| 41 | +cd hyperfleet-infra |
| 42 | +``` |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## How to Access the Cluster |
| 47 | + |
| 48 | +### 1. Authenticate with GCP |
| 49 | + |
| 50 | +```bash |
| 51 | +gcloud auth login |
| 52 | +gcloud config set project hcm-hyperfleet |
| 53 | +``` |
| 54 | + |
| 55 | +### 2. Get Cluster Credentials |
| 56 | + |
| 57 | +```bash |
| 58 | +gcloud container clusters get-credentials hyperfleet-dev-prow \ |
| 59 | + --zone us-central1-a \ |
| 60 | + --project hcm-hyperfleet |
| 61 | +``` |
| 62 | + |
| 63 | +### 3. Verify Access |
| 64 | + |
| 65 | +```bash |
| 66 | +kubectl get namespaces |
| 67 | +kubectl get pods -n prow-hyperfleet |
| 68 | +``` |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## How to Get Cluster Information |
| 73 | + |
| 74 | +### View Cluster Details |
| 75 | + |
| 76 | +```bash |
| 77 | +# Cluster status and configuration |
| 78 | +gcloud container clusters describe hyperfleet-dev-prow \ |
| 79 | + --zone us-central1-a \ |
| 80 | + --project hcm-hyperfleet |
| 81 | + |
| 82 | +# Node information |
| 83 | +kubectl get nodes -o wide |
| 84 | + |
| 85 | +# Running workloads |
| 86 | +kubectl get all -n prow-hyperfleet |
| 87 | +``` |
| 88 | + |
| 89 | +### View Terraform State and Output of Pub/Sub Resource Information |
| 90 | + |
| 91 | +**First, clone the repo if you haven't already** (see [Prerequisites for Terraform Operations](#prerequisites-for-terraform-operations)). |
| 92 | + |
| 93 | +```bash |
| 94 | +cd hyperfleet-infra/terraform |
| 95 | + |
| 96 | +# Initialize with Prow backend |
| 97 | +terraform init -backend-config=envs/gke/dev-prow.tfbackend |
| 98 | + |
| 99 | +# View all managed resources |
| 100 | +terraform state list |
| 101 | + |
| 102 | +# View outputs (includes Pub/Sub config, etc.) |
| 103 | +terraform output |
| 104 | + |
| 105 | +# View Pub/Sub resources |
| 106 | +terraform output pubsub_config |
| 107 | +terraform output pubsub_resources |
| 108 | +``` |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +## How to Update the Cluster |
| 113 | + |
| 114 | +**⚠️ REMINDER**: Review the [Usage Policy](#usage-policy) before proceeding. Leader approval and team-wide Slack broadcast are recommended. |
| 115 | + |
| 116 | +**First, clone the repo if you haven't already** (see [Prerequisites for Terraform Operations](#prerequisites-for-terraform-operations)). |
| 117 | + |
| 118 | +### 1. Navigate to Terraform Directory |
| 119 | + |
| 120 | +```bash |
| 121 | +cd hyperfleet-infra/terraform |
| 122 | +``` |
| 123 | + |
| 124 | +### 2. Initialize Terraform with Prow Backend |
| 125 | + |
| 126 | +```bash |
| 127 | +terraform init -backend-config=envs/gke/dev-prow.tfbackend |
| 128 | +``` |
| 129 | + |
| 130 | +### 3. Edit Configuration |
| 131 | + |
| 132 | +Edit `envs/gke/dev-prow.tfvars` with your changes: |
| 133 | + |
| 134 | +```hcl |
| 135 | +# Common changes: |
| 136 | +node_count = 2 # Scale up/down |
| 137 | +machine_type = "e2-standard-8" # Change VM size |
| 138 | +use_spot_vms = false # Switch to regular VMs |
| 139 | +``` |
| 140 | + |
| 141 | +### 4. Preview and Apply Changes |
| 142 | + |
| 143 | +```bash |
| 144 | +# Review what will change |
| 145 | +terraform plan -var-file=envs/gke/dev-prow.tfvars |
| 146 | + |
| 147 | +# Coordinate with team before applying |
| 148 | +# Then apply changes |
| 149 | +terraform apply -var-file=envs/gke/dev-prow.tfvars |
| 150 | +``` |
| 151 | + |
| 152 | +### 5. Verify Changes |
| 153 | + |
| 154 | +```bash |
| 155 | +kubectl get nodes |
| 156 | +kubectl get pods -n prow-hyperfleet |
| 157 | +``` |
| 158 | + |
| 159 | +--- |
| 160 | + |
| 161 | +## How to Remove the Cluster |
| 162 | + |
| 163 | +**⚠️ WARNING**: This destroys the entire Prow cluster. Review the [Usage Policy](#usage-policy) before proceeding. Leader approval and team-wide Slack coordination are strongly recommended. |
| 164 | + |
| 165 | +**First, clone the repo if you haven't already** (see [Prerequisites for Terraform Operations](#prerequisites-for-terraform-operations)). |
| 166 | + |
| 167 | +### 1. Disable Deletion Protection |
| 168 | + |
| 169 | +Edit `envs/gke/dev-prow.tfvars`: |
| 170 | + |
| 171 | +```hcl |
| 172 | +enable_deletion_protection = false |
| 173 | +``` |
| 174 | + |
| 175 | +Apply the change: |
| 176 | + |
| 177 | +```bash |
| 178 | +cd hyperfleet-infra/terraform |
| 179 | +terraform init -backend-config=envs/gke/dev-prow.tfbackend |
| 180 | +terraform apply -var-file=envs/gke/dev-prow.tfvars |
| 181 | +``` |
| 182 | + |
| 183 | +### 2. Destroy the Cluster |
| 184 | + |
| 185 | +```bash |
| 186 | +terraform destroy -var-file=envs/gke/dev-prow.tfvars |
| 187 | +``` |
| 188 | + |
| 189 | +### 3. Recreate (if needed) |
| 190 | + |
| 191 | +```bash |
| 192 | +# Re-enable deletion protection in dev-prow.tfvars |
| 193 | +enable_deletion_protection = true |
| 194 | + |
| 195 | +# Create cluster |
| 196 | +terraform apply -var-file=envs/gke/dev-prow.tfvars |
| 197 | +``` |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +## Key Configuration Files in hyperfleet-infra Repo |
| 202 | + |
| 203 | +| File | Purpose | |
| 204 | +|------|---------| |
| 205 | +| `terraform/envs/gke/dev-prow.tfvars` | Cluster configuration (nodes, machine type, etc.) | |
| 206 | +| `terraform/envs/gke/dev-prow.tfbackend` | Remote state configuration | |
| 207 | +| `terraform/main.tf` | Main Terraform module | |
| 208 | + |
| 209 | +--- |
| 210 | + |
| 211 | +## Troubleshooting |
| 212 | + |
| 213 | +### Can't Connect to Cluster |
| 214 | + |
| 215 | +```bash |
| 216 | +# Re-authenticate |
| 217 | +gcloud auth login |
| 218 | +gcloud container clusters get-credentials hyperfleet-dev-prow \ |
| 219 | + --zone us-central1-a \ |
| 220 | + --project hcm-hyperfleet |
| 221 | +``` |
| 222 | + |
| 223 | +### Terraform State Lock Issues |
| 224 | + |
| 225 | +**Note**: Terraform automatically locks the state file when using the remote backend (GCS) to prevent concurrent modifications. This is already enabled and working. |
| 226 | + |
| 227 | +If a Terraform operation is interrupted (crashed, network issue, etc.), the lock may remain stuck. To resolve: |
| 228 | + |
| 229 | +```bash |
| 230 | +# First, confirm no one is currently running terraform operations |
| 231 | +# Then force-unlock using the lock ID from the error message |
| 232 | +terraform force-unlock <LOCK_ID> |
| 233 | +``` |
| 234 | + |
| 235 | +**⚠️ WARNING**: Only use `force-unlock` after confirming no one else is actively running Terraform operations, as this can cause state corruption if multiple people modify state simultaneously. |
| 236 | + |
| 237 | +--- |
| 238 | + |
| 239 | +## Additional Documentation |
| 240 | + |
| 241 | +- **Detailed infrastructure docs**: `terraform/README.md` (in the cloned repo) |
| 242 | +- **Shared VPC setup**: `terraform/shared/README.md` (in the cloned repo) |
| 243 | + |
| 244 | +--- |
| 245 | + |
| 246 | +**Last Updated**: 2026-01-23 |
0 commit comments