- Learn how to provision computing resources for running Big Data analyses using the Infrastructure as Code (IaC) approach.
- Learn how to set up opinionated CI/CD pipelines to deploy cloud infrastructure.
- Learn how to utilize linters for detecting security vulnerabilities in cloud infrastructure.
- Learn how to run Apache Spark code in a distributed way on Hadoop cluster using Vertex AI notebooks and Dataproc services on GCP.
- Learn how to use Workload Identity Federation for a secure authentication from GitHub Actions
to Google Cloud.

- Google Cloud SDK
- gsutil
- pre-commit (optional)
- Terraform ( Requirements )
- Python ~>3.8
- Linux/MacOS
- pre-commit-terraform dependencies (optional)
- Redeem a GCP coupon to create a billing account
- Authenticate to GCP to obtain the default credentials used for running the code
# first remove the stored credentials if exist
gcloud auth application-default revoke
# login and get the new application credentials
gcloud auth application-default login- Export shared environment variables
export TF_VAR_tbd_semester=2023Z
# format: 20xx for teachers, student ID number for students
export TF_VAR_user_id=9900
# use your own billing account id
export TF_VAR_billing_account=01D435-06DD59-9A00B5
- Enter
bootstrapfolder then init project and Terraform state bucket
cd bootstrap
terraform init
terraform apply
cd ..- CI/CD (Github Actions setup using Workload Identity Federation)
- Edit
env/backend.tfvarsfile and setbucketvariable with the Terraform state bucket - Edit
env/project.tfvarsfile and setproject_name,iac_service_accountvariables using the output from thebootstrapphase, e.g.:
- Edit
cicd_bootstrap/conf/github_actions.tfvarsto setgithub_organdgithub_repo, e.g.:
github_org = "mwiewior"
github_repo = "tbd-2023z-phase1"
- Init state file and set env variables
cd cicd_bootstrap
terraform init -backend-config=../env/backend.tfvars- Apply
# authenticate Docker backend with GCP
gcloud auth configure-docker
# create CI/CD integration using Workload Identity
terraform apply -var-file ../env/project.tfvars -var-file conf/github_actions.tfvars -compact-warnings
cd ..- Use output variables for configuring Github Actions workflow:
.github/workflows/pull-request.yml,e.g. :
Please do not edit and hardcode these values in a YAML but set the Github Actions secrets instead
while preserving the secret names, i.e. GCP_WORKLOAD_IDENTITY_PROVIDER_NAMEandGCP_WORKLOAD_IDENTITY_SA_EMAIL.
- Install and configure
pre-commit(optional)
pre-commit install- Commit changes, push to a branch and open a PR to YOUR repository main/master branch.
If you see a warning like this -- please enable the workflows:
...and repush your changes!
Once all Pull Requests checks have passed please merge your PR and wait until your release job finishes.
-
Navigate to the Vertex AI Workbench menu item, find your notebook on the list, press CONNECT and follow the instructions

-
Check if
pysparkkernel exists - if not then in your Jupyterlab enviroment add Python3.8 kernel:
python3.8 -m ipykernel install --user --name pyspark-
Run a
Hello-worldPySpark application in a YARN-client mode:
-
IMPORTANT ❗ ❗ ❗ Please remember to destroy all the resources after the work:
terraform init -backend-config=env/backend.tfvars
terraform destroy -no-color -var-file env/project.tfvars | Name | Version |
|---|---|
| terraform | ~> 1.5.0 |
| docker | 3.0.2 |
| ~> 4.84.0 |
No providers.
| Name | Source | Version |
|---|---|---|
| composer | ./modules/composer | n/a |
| data-pipelines | ./modules/data-pipeline | n/a |
| dataproc | ./modules/dataproc | n/a |
| gcr | ./modules/gcr | n/a |
| jupyter_docker_image | ./modules/docker_image | n/a |
| vertex_ai_workbench | ./modules/vertex-ai-workbench | n/a |
| vpc | ./modules/vpc | n/a |
No resources.
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| ai_notebook_instance_owner | Vertex AI workbench owner | string |
n/a | yes |
| project_name | Project name | string |
n/a | yes |
| region | GCP region | string |
"europe-west1" |
no |
No outputs.