Data Discovery Infrastructure - Google Cloud Platform

A production-ready Terraform infrastructure template for deploying data discovery systems on Google Cloud Platform. This repository provides a complete, reusable infrastructure setup for building data catalog and metadata discovery solutions.

Overview

This Terraform configuration deploys a comprehensive data discovery infrastructure on GCP, including:

GKE Cluster (optional): Standard mode cluster with Workload Identity and private nodes
Cloud Composer: Managed Apache Airflow for orchestrating data discovery workflows
GCS Buckets: Storage for JSONL files (Vertex AI Search) and Markdown reports
Service Accounts: Least-privilege accounts with appropriate IAM roles
Artifact Registry: Docker image repository for container workloads
Monitoring & Logging: Cloud Monitoring dashboards and log sinks
Dataplex Profiling: Automated data quality and profiling scans (optional module)
Vertex AI Search: Infrastructure for semantic search over metadata (optional module)

Features

✅ Production-Ready: Security best practices, private networking, workload identity
✅ Modular Design: Enable/disable GKE, use subdirectories for optional features
✅ Cost-Optimized: Autoscaling, lifecycle policies, configurable resource sizes
✅ Fully Parameterized: No hardcoded values, all configuration via variables
✅ Read-Only by Design: Discovery service accounts cannot modify source data
✅ Well Documented: Comprehensive README, quickstart guide, and inline comments

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     GCP Project                              │
│                                                              │
│  ┌──────────────┐      ┌─────────────────┐                 │
│  │ Cloud        │      │ GKE Cluster     │ (optional)      │
│  │ Composer     │      │ - Workload ID   │                 │
│  │ (Airflow)    │      │ - Private Nodes │                 │
│  └──────┬───────┘      └────────┬────────┘                 │
│         │                       │                           │
│         │    ┌──────────────────┴─────────────┐            │
│         │    │                                 │            │
│  ┌──────▼────▼──────┐              ┌──────────▼─────────┐  │
│  │ Service Accounts  │              │ Artifact Registry  │  │
│  │ - Discovery (RO)  │              │ - Docker Images    │  │
│  │ - Metadata Writer │              └────────────────────┘  │
│  │ - GKE Node        │                                      │
│  │ - Composer        │                                      │
│  └──────┬────────────┘                                      │
│         │                                                   │
│  ┌──────▼─────────────────────────────────┐                │
│  │ GCS Buckets                             │                │
│  │ - JSONL files (Vertex AI Search input) │                │
│  │ - Reports (Human-readable docs)        │                │
│  └─────────────────────────────────────────┘                │
│                                                              │
│  Optional Modules:                                          │
│  ├─ Dataplex Profiling (data quality scans)                │
│  └─ Vertex AI Search (semantic search infrastructure)      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Prerequisites

GCP Project: Active GCP project with billing enabled
Terraform: Version >= 1.5.0 (Install Guide)
gcloud CLI: Authenticated and configured (Install Guide)
Permissions: Owner or Editor role on the project (for initial setup)
Existing Network: VPC and subnet already configured, or use "default" VPC
- Subnet must have secondary IP ranges for GKE pods and services

Network Configuration

This infrastructure supports both same-project VPC and cross-project Shared VPC deployments:

Same-Project VPC: Network and resources in the same GCP project (simpler setup)
Cross-Project Shared VPC: Network in a host project, resources in a service project (enterprise setup)

Key Features

✅ Flexible Network References: Supports both short names and self-link formats
✅ Automatic IAM Configuration: Grants required compute.networkUser permissions
✅ Configurable Secondary Ranges: Customize GKE pod and service IP ranges
✅ Validation: Built-in validation for network format and configuration

Configuration Variables

Variable	Description	Required	Default
`network_project_id`	Host project ID (for Shared VPC)	No	Same as `project_id`
`network`	VPC network (name or self-link)	Yes	-
`subnetwork`	VPC subnet (name or self-link)	Yes	-
`pods_secondary_range_name`	Secondary range for GKE pods	No	`"podcloud"`
`services_secondary_range_name`	Secondary range for GKE services	No	`"servicecloud"`

Quick Start

1. Clone the Repository

git clone https://github.com/YOUR_ORG/data-discovery-infrastructure-gcp.git
cd data-discovery-infrastructure-gcp

2. Configure Variables

cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project details

Minimum required variables:

project_id          = "your-gcp-project-id"
network             = "projects/your-gcp-project-id/global/networks/default"
subnetwork          = "projects/your-gcp-project-id/regions/us-central1/subnetworks/default"
jsonl_bucket_name   = "your-gcp-project-id-data-discovery-jsonl"
reports_bucket_name = "your-gcp-project-id-data-discovery-reports"
vertex_datastore_id = "your-vertex-datastore-id"

3. Authenticate with GCP

gcloud auth login
gcloud config set project YOUR_PROJECT_ID
gcloud auth application-default login

4. Deploy Infrastructure

terraform init
terraform plan
terraform apply

This will create all infrastructure components. The process takes approximately 10-15 minutes.

5. (Optional) Configure GKE Access

If you enabled GKE (enable_gke = true):

gcloud container clusters get-credentials data-discovery-cluster \
  --region us-central1 \
  --project YOUR_PROJECT_ID

# Create namespace and service accounts
kubectl create namespace data-discovery
kubectl create serviceaccount discovery-agent -n data-discovery
kubectl create serviceaccount metadata-writer -n data-discovery

# Annotate with GCP service accounts
kubectl annotate serviceaccount discovery-agent -n data-discovery \
  iam.gke.io/gcp-service-account=data-discovery-agent@YOUR_PROJECT_ID.iam.gserviceaccount.com

kubectl annotate serviceaccount metadata-writer -n data-discovery \
  iam.gke.io/gcp-service-account=data-discovery-metadata@YOUR_PROJECT_ID.iam.gserviceaccount.com

See QUICKSTART.md for detailed step-by-step instructions.

Configuration Options

Core Configuration

Variable	Description	Default	Required
`project_id`	GCP Project ID	-	✅
`region`	GCP region	`us-central1`	❌
`network`	VPC network path	-	✅
`subnetwork`	VPC subnet path	-	✅
`jsonl_bucket_name`	JSONL bucket name	-	✅
`reports_bucket_name`	Reports bucket name	-	✅
`vertex_datastore_id`	Vertex AI Search datastore ID	-	✅

GKE Configuration

Variable	Description	Default
`enable_gke`	Enable GKE cluster deployment	`true`
`cluster_name`	GKE cluster name	`data-discovery-cluster`
`machine_type`	Node machine type	`e2-standard-2`
`min_node_count`	Minimum nodes	`1`
`max_node_count`	Maximum nodes	`5`

Set enable_gke = false to skip GKE deployment and use Cloud Composer only.

Cloud Composer Configuration

Variable	Description	Default
`composer_env_name`	Composer environment name	`data-discovery-agent-composer`
`composer_image_version`	Airflow version	`composer-3-airflow-2.10.5`

Full Configuration Reference

See terraform.tfvars.example for all available configuration options.

Infrastructure Components

Service Accounts

Four service accounts are created with least-privilege permissions:

Discovery Service Account (data-discovery-agent)
- Purpose: Read-only data discovery operations
- Permissions: BigQuery metadata viewer, Data Catalog viewer, Logging viewer, DLP reader
Metadata Write Service Account (data-discovery-metadata)
- Purpose: Write enriched metadata to Data Catalog only
- Permissions: Data Catalog entry group owner
GKE Service Account (data-discovery-gke) - Optional
- Purpose: GKE node operations
- Permissions: Logging and monitoring
Composer Service Account (data-discovery-composer)
- Purpose: Airflow workflow orchestration
- Permissions: BigQuery read/write, Data Catalog viewer, Vertex AI user, Dataplex admin

GCS Buckets

Two regional buckets with lifecycle policies:

JSONL Bucket: Stores JSONL files for Vertex AI Search ingestion
- Lifecycle: Nearline after 30 days, delete after 90 days
Reports Bucket: Stores Markdown reports for human consumption
- Lifecycle: Nearline after 60 days, delete after 180 days

Optional Modules

Dataplex Profiling

Automated data quality and profiling scans for BigQuery tables.

cd dataplex-profiling/
terraform init
terraform apply

See dataplex-profiling/README.md for details.

Vertex AI Search

Infrastructure for semantic search over metadata.

cd vertex-ai-search/
terraform init
terraform apply

See vertex-ai-search/README.md for details.

Cost Estimation

Estimated monthly costs (us-central1 region):

Component	Monthly Cost (USD)
Cloud Composer (Small)	$300-400
GKE Cluster (if enabled)	$123
GCS Storage	$5-20
Vertex AI Search	Variable (based on queries)
Total	~~$430-550 with GKE~~ $305-430 without GKE

Costs vary based on usage. Use enable_gke = false to reduce costs.

Security Features

✅ Private GKE Cluster: Nodes have no external IPs
✅ Workload Identity: Secure GCP API access without service account keys
✅ Least Privilege IAM: Minimal permissions for each service account
✅ Read-Only Discovery: Discovery service account cannot modify source data
✅ Audit Logging: All operations are logged to Cloud Logging
✅ VPC Integration: Uses existing VPC networks

Terraform Commands

# Initialize Terraform
terraform init

# Validate configuration
terraform validate

# Plan changes
terraform plan

# Apply changes
terraform apply

# View outputs
terraform output

# Destroy infrastructure (⚠️ careful!)
terraform destroy

Troubleshooting

API Not Enabled

If you see "API not enabled" errors, wait 2-3 minutes and retry:

terraform apply

APIs take time to propagate after initial enablement.

Network Not Found

Verify your network paths in terraform.tfvars:

gcloud compute networks list
gcloud compute networks subnets list --network=YOUR_NETWORK

Workload Identity Issues

Verify IAM bindings:

gcloud iam service-accounts get-iam-policy \
  data-discovery-agent@YOUR_PROJECT_ID.iam.gserviceaccount.com

See QUICKSTART.md for more troubleshooting steps.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow Terraform best practices
Use variables for all configurable values
Never hardcode project-specific values
Update documentation for any changes
Test changes in a dev environment first

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Copyright 2025 Contributors to the Data Discovery Infrastructure GCP project

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Support

For issues, questions, or contributions:

Related Projects

Acknowledgments

This infrastructure template follows Google Cloud best practices for security, cost optimization, and operational excellence.

GitHub Topics: terraform, gcp, google-cloud, infrastructure-as-code, bigquery, data-discovery, gke, cloud-composer, vertex-ai, apache-2

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.cursor/plans		.cursor/plans
dataplex-profiling		dataplex-profiling
vertex-ai-search		vertex-ai-search
.gitignore		.gitignore
.terraform.lock.hcl		.terraform.lock.hcl
ARTIFACT_REGISTRY.md		ARTIFACT_REGISTRY.md
GITHUB_SETUP.md		GITHUB_SETUP.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
artifact-registry.tf		artifact-registry.tf
composer.tf		composer.tf
main.tf		main.tf
monitoring.tf		monitoring.tf
outputs.tf		outputs.tf
publish-to-github.sh		publish-to-github.sh
secrets.tf		secrets.tf
service-accounts.tf		service-accounts.tf
storage.tf		storage.tf
terraform.tfvars.example		terraform.tfvars.example
variables.tf		variables.tf
versions.tf		versions.tf

License

opendedup/data-discovery-infrastructure-gcp

Folders and files

Latest commit

History

Repository files navigation

Data Discovery Infrastructure - Google Cloud Platform

Overview

Features

Architecture

Prerequisites

Network Configuration

Key Features

Configuration Variables

Quick Start

1. Clone the Repository

2. Configure Variables

3. Authenticate with GCP

4. Deploy Infrastructure

5. (Optional) Configure GKE Access

Configuration Options

Core Configuration

GKE Configuration

Cloud Composer Configuration

Full Configuration Reference

Infrastructure Components

Service Accounts

GCS Buckets

Optional Modules

Dataplex Profiling

Vertex AI Search

Cost Estimation

Security Features

Terraform Commands

Troubleshooting

API Not Enabled

Network Not Found

Workload Identity Issues

Contributing

Development Guidelines

License

Support

Related Projects

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages