A production-ready Terraform infrastructure template for deploying data discovery systems on Google Cloud Platform. This repository provides a complete, reusable infrastructure setup for building data catalog and metadata discovery solutions.
This Terraform configuration deploys a comprehensive data discovery infrastructure on GCP, including:
- GKE Cluster (optional): Standard mode cluster with Workload Identity and private nodes
- Cloud Composer: Managed Apache Airflow for orchestrating data discovery workflows
- GCS Buckets: Storage for JSONL files (Vertex AI Search) and Markdown reports
- Service Accounts: Least-privilege accounts with appropriate IAM roles
- Artifact Registry: Docker image repository for container workloads
- Monitoring & Logging: Cloud Monitoring dashboards and log sinks
- Dataplex Profiling: Automated data quality and profiling scans (optional module)
- Vertex AI Search: Infrastructure for semantic search over metadata (optional module)
- ✅ Production-Ready: Security best practices, private networking, workload identity
- ✅ Modular Design: Enable/disable GKE, use subdirectories for optional features
- ✅ Cost-Optimized: Autoscaling, lifecycle policies, configurable resource sizes
- ✅ Fully Parameterized: No hardcoded values, all configuration via variables
- ✅ Read-Only by Design: Discovery service accounts cannot modify source data
- ✅ Well Documented: Comprehensive README, quickstart guide, and inline comments
┌─────────────────────────────────────────────────────────────┐
│ GCP Project │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ Cloud │ │ GKE Cluster │ (optional) │
│ │ Composer │ │ - Workload ID │ │
│ │ (Airflow) │ │ - Private Nodes │ │
│ └──────┬───────┘ └────────┬────────┘ │
│ │ │ │
│ │ ┌──────────────────┴─────────────┐ │
│ │ │ │ │
│ ┌──────▼────▼──────┐ ┌──────────▼─────────┐ │
│ │ Service Accounts │ │ Artifact Registry │ │
│ │ - Discovery (RO) │ │ - Docker Images │ │
│ │ - Metadata Writer │ └────────────────────┘ │
│ │ - GKE Node │ │
│ │ - Composer │ │
│ └──────┬────────────┘ │
│ │ │
│ ┌──────▼─────────────────────────────────┐ │
│ │ GCS Buckets │ │
│ │ - JSONL files (Vertex AI Search input) │ │
│ │ - Reports (Human-readable docs) │ │
│ └─────────────────────────────────────────┘ │
│ │
│ Optional Modules: │
│ ├─ Dataplex Profiling (data quality scans) │
│ └─ Vertex AI Search (semantic search infrastructure) │
│ │
└─────────────────────────────────────────────────────────────┘
- GCP Project: Active GCP project with billing enabled
- Terraform: Version >= 1.5.0 (Install Guide)
- gcloud CLI: Authenticated and configured (Install Guide)
- Permissions: Owner or Editor role on the project (for initial setup)
- Existing Network: VPC and subnet already configured, or use "default" VPC
- Subnet must have secondary IP ranges for GKE pods and services
This infrastructure supports both same-project VPC and cross-project Shared VPC deployments:
- Same-Project VPC: Network and resources in the same GCP project (simpler setup)
- Cross-Project Shared VPC: Network in a host project, resources in a service project (enterprise setup)
- ✅ Flexible Network References: Supports both short names and self-link formats
- ✅ Automatic IAM Configuration: Grants required
compute.networkUserpermissions - ✅ Configurable Secondary Ranges: Customize GKE pod and service IP ranges
- ✅ Validation: Built-in validation for network format and configuration
| Variable | Description | Required | Default |
|---|---|---|---|
network_project_id |
Host project ID (for Shared VPC) | No | Same as project_id |
network |
VPC network (name or self-link) | Yes | - |
subnetwork |
VPC subnet (name or self-link) | Yes | - |
pods_secondary_range_name |
Secondary range for GKE pods | No | "podcloud" |
services_secondary_range_name |
Secondary range for GKE services | No | "servicecloud" |
git clone https://github.com/YOUR_ORG/data-discovery-infrastructure-gcp.git
cd data-discovery-infrastructure-gcpcp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project detailsMinimum required variables:
project_id = "your-gcp-project-id"
network = "projects/your-gcp-project-id/global/networks/default"
subnetwork = "projects/your-gcp-project-id/regions/us-central1/subnetworks/default"
jsonl_bucket_name = "your-gcp-project-id-data-discovery-jsonl"
reports_bucket_name = "your-gcp-project-id-data-discovery-reports"
vertex_datastore_id = "your-vertex-datastore-id"gcloud auth login
gcloud config set project YOUR_PROJECT_ID
gcloud auth application-default loginterraform init
terraform plan
terraform applyThis will create all infrastructure components. The process takes approximately 10-15 minutes.
If you enabled GKE (enable_gke = true):
gcloud container clusters get-credentials data-discovery-cluster \
--region us-central1 \
--project YOUR_PROJECT_ID
# Create namespace and service accounts
kubectl create namespace data-discovery
kubectl create serviceaccount discovery-agent -n data-discovery
kubectl create serviceaccount metadata-writer -n data-discovery
# Annotate with GCP service accounts
kubectl annotate serviceaccount discovery-agent -n data-discovery \
iam.gke.io/gcp-service-account=data-discovery-agent@YOUR_PROJECT_ID.iam.gserviceaccount.com
kubectl annotate serviceaccount metadata-writer -n data-discovery \
iam.gke.io/gcp-service-account=data-discovery-metadata@YOUR_PROJECT_ID.iam.gserviceaccount.comSee QUICKSTART.md for detailed step-by-step instructions.
| Variable | Description | Default | Required |
|---|---|---|---|
project_id |
GCP Project ID | - | ✅ |
region |
GCP region | us-central1 |
❌ |
network |
VPC network path | - | ✅ |
subnetwork |
VPC subnet path | - | ✅ |
jsonl_bucket_name |
JSONL bucket name | - | ✅ |
reports_bucket_name |
Reports bucket name | - | ✅ |
vertex_datastore_id |
Vertex AI Search datastore ID | - | ✅ |
| Variable | Description | Default |
|---|---|---|
enable_gke |
Enable GKE cluster deployment | true |
cluster_name |
GKE cluster name | data-discovery-cluster |
machine_type |
Node machine type | e2-standard-2 |
min_node_count |
Minimum nodes | 1 |
max_node_count |
Maximum nodes | 5 |
Set enable_gke = false to skip GKE deployment and use Cloud Composer only.
| Variable | Description | Default |
|---|---|---|
composer_env_name |
Composer environment name | data-discovery-agent-composer |
composer_image_version |
Airflow version | composer-3-airflow-2.10.5 |
See terraform.tfvars.example for all available configuration options.
Four service accounts are created with least-privilege permissions:
-
Discovery Service Account (
data-discovery-agent)- Purpose: Read-only data discovery operations
- Permissions: BigQuery metadata viewer, Data Catalog viewer, Logging viewer, DLP reader
-
Metadata Write Service Account (
data-discovery-metadata)- Purpose: Write enriched metadata to Data Catalog only
- Permissions: Data Catalog entry group owner
-
GKE Service Account (
data-discovery-gke) - Optional- Purpose: GKE node operations
- Permissions: Logging and monitoring
-
Composer Service Account (
data-discovery-composer)- Purpose: Airflow workflow orchestration
- Permissions: BigQuery read/write, Data Catalog viewer, Vertex AI user, Dataplex admin
Two regional buckets with lifecycle policies:
-
JSONL Bucket: Stores JSONL files for Vertex AI Search ingestion
- Lifecycle: Nearline after 30 days, delete after 90 days
-
Reports Bucket: Stores Markdown reports for human consumption
- Lifecycle: Nearline after 60 days, delete after 180 days
Automated data quality and profiling scans for BigQuery tables.
cd dataplex-profiling/
terraform init
terraform applySee dataplex-profiling/README.md for details.
Infrastructure for semantic search over metadata.
cd vertex-ai-search/
terraform init
terraform applySee vertex-ai-search/README.md for details.
Estimated monthly costs (us-central1 region):
| Component | Monthly Cost (USD) |
|---|---|
| Cloud Composer (Small) | $300-400 |
| GKE Cluster (if enabled) | $123 |
| GCS Storage | $5-20 |
| Vertex AI Search | Variable (based on queries) |
| Total |
Costs vary based on usage. Use
enable_gke = falseto reduce costs.
- ✅ Private GKE Cluster: Nodes have no external IPs
- ✅ Workload Identity: Secure GCP API access without service account keys
- ✅ Least Privilege IAM: Minimal permissions for each service account
- ✅ Read-Only Discovery: Discovery service account cannot modify source data
- ✅ Audit Logging: All operations are logged to Cloud Logging
- ✅ VPC Integration: Uses existing VPC networks
# Initialize Terraform
terraform init
# Validate configuration
terraform validate
# Plan changes
terraform plan
# Apply changes
terraform apply
# View outputs
terraform output
# Destroy infrastructure (⚠️ careful!)
terraform destroyIf you see "API not enabled" errors, wait 2-3 minutes and retry:
terraform applyAPIs take time to propagate after initial enablement.
Verify your network paths in terraform.tfvars:
gcloud compute networks list
gcloud compute networks subnets list --network=YOUR_NETWORKVerify IAM bindings:
gcloud iam service-accounts get-iam-policy \
data-discovery-agent@YOUR_PROJECT_ID.iam.gserviceaccount.comSee QUICKSTART.md for more troubleshooting steps.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow Terraform best practices
- Use variables for all configurable values
- Never hardcode project-specific values
- Update documentation for any changes
- Test changes in a dev environment first
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Copyright 2025 Contributors to the Data Discovery Infrastructure GCP project
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
For issues, questions, or contributions:
This infrastructure template follows Google Cloud best practices for security, cost optimization, and operational excellence.
GitHub Topics: terraform, gcp, google-cloud, infrastructure-as-code, bigquery, data-discovery, gke, cloud-composer, vertex-ai, apache-2