TBD-2023-phase1

Phase1 goals

Learn how to provision computing resources for running Big Data analyses using the Infrastructure as Code (IaC) approach.
Learn how to set up opinionated CI/CD pipelines to deploy cloud infrastructure.
Learn how to utilize linters for detecting security vulnerabilities in cloud infrastructure.
Learn how to run Apache Spark code in a distributed way on Hadoop cluster using Vertex AI notebooks and Dataproc services on GCP.
Learn how to use Workload Identity Federation for a secure authentication from GitHub Actions to Google Cloud.

Prerequisites

Software

Google Cloud SDK
gsutil
pre-commit (optional)
Terraform ( Requirements )
Python ~>3.8
Linux/MacOS
pre-commit-terraform dependencies (optional)

GCP

Redeem a GCP coupon to create a billing account
Authenticate to GCP to obtain the default credentials used for running the code

# first remove the stored credentials if exist
gcloud auth application-default revoke
# login and get the new application credentials
gcloud auth application-default login

Project setup

Export shared environment variables

export TF_VAR_tbd_semester=2023Z
# format: 20xx for teachers, student ID number for students 
export TF_VAR_user_id=9900
# use your own billing account id
export TF_VAR_billing_account=01D435-06DD59-9A00B5

Enter bootstrap folder then init project and Terraform state bucket

cd bootstrap
terraform init
terraform apply
cd ..

CI/CD (Github Actions setup using Workload Identity Federation)

Edit env/backend.tfvars file and set bucket variable with the Terraform state bucket
Edit env/project.tfvars file and set project_name, iac_service_account variables using the output from the bootstrap phase, e.g.:
Edit cicd_bootstrap/conf/github_actions.tfvars to set github_org and github_repo, e.g.:

  github_org  = "mwiewior"
  github_repo = "tbd-2023z-phase1"

Init state file and set env variables

cd cicd_bootstrap
terraform init -backend-config=../env/backend.tfvars

Apply

# authenticate Docker backend with GCP
gcloud auth configure-docker
# create CI/CD integration using Workload Identity
terraform apply -var-file ../env/project.tfvars -var-file conf/github_actions.tfvars -compact-warnings
cd ..

Use output variables for configuring Github Actions workflow: .github/workflows/pull-request.yml,e.g. : Please do not edit and hardcode these values in a YAML but set the Github Actions secrets instead while preserving the secret names, i.e. GCP_WORKLOAD_IDENTITY_PROVIDER_NAME and GCP_WORKLOAD_IDENTITY_SA_EMAIL.
Install and configure pre-commit (optional)

pre-commit install

Commit changes, push to a branch and open a PR to YOUR repository main/master branch. If you see a warning like this -- please enable the workflows: ...and repush your changes!

Once all Pull Requests checks have passed please merge your PR and wait until your release job finishes.

Navigate to the Vertex AI Workbench menu item, find your notebook on the list, press CONNECT and follow the instructions
Check if pyspark kernel exists - if not then in your Jupyterlab enviroment add Python3.8 kernel:

python3.8 -m ipykernel install --user --name pyspark

Run a Hello-world PySpark application in a YARN-client mode:
IMPORTANT ❗ ❗ ❗ Please remember to destroy all the resources after the work:

terraform init -backend-config=env/backend.tfvars
terraform destroy -no-color -var-file env/project.tfvars

Requirements

Name	Version
terraform	~> 1.5.0
docker	3.0.2
google	~> 4.84.0

Providers

No providers.

Modules

Name	Source	Version
composer	./modules/composer	n/a
data-pipelines	./modules/data-pipeline	n/a
dataproc	./modules/dataproc	n/a
gcr	./modules/gcr	n/a
jupyter_docker_image	./modules/docker_image	n/a
vertex_ai_workbench	./modules/vertex-ai-workbench	n/a
vpc	./modules/vpc	n/a

Resources

No resources.

Inputs

Name	Description	Type	Default	Required
ai_notebook_instance_owner	Vertex AI workbench owner	`string`	n/a	yes
project_name	Project name	`string`	n/a	yes
region	GCP region	`string`	`"europe-west1"`	no

Outputs

No outputs.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
bootstrap		bootstrap
cicd_bootstrap		cicd_bootstrap
devel		devel
doc		doc
env		env
mlops		mlops
modules		modules
notebooks		notebooks
scenarios/ics-intro		scenarios/ics-intro
.checkov.yaml		.checkov.yaml
.gitignore		.gitignore
.hadolint.yaml		.hadolint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.releaserc		.releaserc
.terraform.lock.hcl		.terraform.lock.hcl
LICENSE		LICENSE
README.md		README.md
backend.tf		backend.tf
main.tf		main.tf
provider.tf		provider.tf
tasks-phase1.md		tasks-phase1.md
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TBD-2023-phase1

Phase1 goals

Prerequisites

Software

GCP

Project setup

Requirements

Providers

Modules

Resources

Inputs

Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

bdg-tbd/tbd-2023z-phase1

Folders and files

Latest commit

History

Repository files navigation

TBD-2023-phase1

Phase1 goals

Prerequisites

Software

GCP

Project setup

Requirements

Providers

Modules

Resources

Inputs

Outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages