Skip to content

Commit e787b19

Browse files
fix CI/CD pipe (#6)
1 parent 6c9d32d commit e787b19

File tree

12 files changed

+150
-121
lines changed

12 files changed

+150
-121
lines changed

.github/workflows/onpush.yml

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ jobs:
1212

1313
runs-on: ubuntu-latest
1414
strategy:
15-
max-parallel: 4
15+
max-parallel: 1
1616

1717
env:
1818
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
@@ -33,10 +33,14 @@ jobs:
3333
pipenv run curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
3434
pipenv run databricks --version
3535
36-
- name: Package and Deployment
36+
- name: Deploy on staging
3737
run: |
38-
make deploy-dev
38+
make deploy-staging
3939
40-
- name: Run
40+
- name: Run on staging
4141
run: |
42-
make deploy-ci
42+
make run-staging
43+
44+
- name: Deploy on prod
45+
run: |
46+
make deploy-prod

Makefile

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
install:
2+
python3 -m pip install --upgrade pip
3+
pip install pipenv
24
pipenv install packages
35
pipenv run pytest tests/
4-
pipenv shell
6+
pipenv run pip list
57

68
pre-commit:
79
pre-commit autoupdate
@@ -11,9 +13,13 @@ deploy-dev:
1113
python ./scripts/generate_template_workflow.py dev
1214
databricks bundle deploy --target dev
1315

14-
run-dev:
15-
databricks bundle run default_python_job --target dev
16+
deploy-staging:
17+
pipenv run python ./scripts/generate_template_workflow.py staging
18+
pipenv run databricks bundle deploy --target staging
1619

17-
deploy-ci:
18-
pipenv run python ./scripts/generate_template_workflow.py ci
19-
pipenv run databricks bundle deploy --target ci
20+
run-staging:
21+
pipenv run databricks bundle run default_python_job --target staging
22+
23+
deploy-prod:
24+
pipenv run python ./scripts/generate_template_workflow.py prod
25+
pipenv run databricks bundle deploy --target prod

Pipfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,10 @@ name = "pypi"
55

66
[packages]
77
funcy = "==2.0"
8-
packages = "*"
98
numpy = "==1.23.5"
109
pandas = "==1.5.3"
1110
pyarrow = "8.0.0"
1211
pydantic = "==2.7.4"
13-
unidecode = "==1.3.8"
1412
wheel = "==0.44.0"
1513
coverage = "==7.6.1"
1614
setuptools = "==72.1.0"
@@ -19,6 +17,7 @@ pytest = "==8.3.2"
1917
jinja2 = "==3.1.4"
2018
pyspark = "==3.5.1"
2119
pytest-cov = "==5.0.0"
20+
packages = "*"
2221

2322
[dev-packages]
2423

README.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11

2-
# Project Template for Spark/Databricks with Python packaging and CI/CD automation
2+
# Databricks template project with Asset Bundles, Python packaging and CI/CD automation
33

44
This project template provides a structured approach to enhance your productivity when delivering ETL pipelines on Databricks. Feel free to customize it based on your project's specific nuances and the audience you are targeting.
55

66
This project template demonstrates how to:
77

88
- structure your PySpark code inside classes/packages.
9-
- package your code and move it on different environments on a CI/CD pipeline.
9+
- package your code and move it on different environments (dev, staging, prod) on a CI/CD pipeline.
1010
- configure your workflow to run in different environments with different configurations with [jinja package](https://pypi.org/project/jinja2/)
11+
- configure your workflow to selectively run tasks, preventing collisions and interference between developers working in parallel.
1112
- use a [medallion architecure](https://www.databricks.com/glossary/medallion-architecture) pattern by improving the data quality as it goes trought more refinement.
1213
- use a Make file to automate repetitive tasks on local env.
1314
- lint and format the code with [ruff](https://docs.astral.sh/ruff/) and [pre-commit](https://pre-commit.com/).
@@ -16,11 +17,12 @@ This project template demonstrates how to:
1617
- utilize [pytest package](https://pypi.org/project/pytest/) to run unit tests on transformations.
1718
- utilize [argparse package](https://pypi.org/project/argparse/) to build a flexible command line interface to start your jobs.
1819
- utilize [funcy package](https://pypi.org/project/funcy/) to log the execution time of each transformation.
19-
- utilize [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html) and (the new!!!) [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) to package/deploy/run a Python wheel package on Databricks.
20+
- utilize [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html) and [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) to package/deploy/run a Python wheel package on Databricks.
2021
- utilize [Databricks SDK for Python](https://docs.databricks.com/en/dev-tools/sdk-python.html) to manage workspaces and accounts. This script enables your metastore system tables that have [relevant data about billing, usage, lineage, prices, and access](https://www.youtube.com/watch?v=LcRWHzk8Wm4).
2122
- utilize [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) instead of Hive as your data catalog and earn for free data lineage for your tables and columns and a simplified permission model for your data.
2223
- utilize [Databricks Workflows](https://docs.databricks.com/en/workflows/index.html) to execute a DAG and [task parameters](https://docs.databricks.com/en/workflows/jobs/parameter-value-references.html) to share context information between tasks (see [Task Parameters section](#task-parameters)). Yes, you don't need Airflow to manage your DAGs here!!!
2324
- utilize [Databricks job clusters](https://docs.databricks.com/en/workflows/jobs/use-compute.html#use-databricks-compute-with-your-jobs) to reduce costs.
25+
- define clusters on AWS and Azure.
2426
- execute a CI/CD pipeline with [Github Actions](https://docs.github.com/en/actions) after a repo push.
2527

2628
For a debate about the use of notebooks x Python packages, please refer to:
@@ -74,37 +76,35 @@ For a debate about the use of notebooks x Python packages, please refer to:
7476

7577
# Instructions
7678

77-
### 1) install and configure Databricks CLI
79+
### 1) (optional) create a Databricks Workspace with Terraform
80+
81+
Follow instructions [here](https://github.com/databricks/terraform-databricks-examples)
82+
83+
84+
### 2) install and configure Databricks CLI on your local machine
7885

7986
Follow instructions [here](https://docs.databricks.com/en/dev-tools/cli/install.html)
8087

8188

82-
### 2) build python env and execute unit tests
89+
### 3) build python env and execute unit tests on your local machine
8390

8491
make install
8592

8693
You can also execute unit tests from your preferred IDE. Here's a screenshot from [VS Code](https://code.visualstudio.com/) with [Microsoft's Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python) installed.
8794

8895
<img src="docs/vscode.png" width="30%" height="30%">
8996

90-
### 3) deploy and execute on dev and prod workspaces.
97+
### 4) deploy and execute on dev workspace.
9198

9299
Update "job_clusters" properties on wf_template.yml file. There are different properties for AWS and Azure.
93100

94101
make deploy-dev
95102

96103

97-
### 4) configure CI/CD automation
104+
### 5) configure CI/CD automation
98105

99106
Configure [Github Actions repository secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) DATABRICKS_HOST and DATABRICKS_TOKEN.
100107

101-
### 5) enable system tables on Catalog Explorer
102-
103-
python sdk_system_tables.py
104-
105-
106-
... and now you can code the transformations for each task and run unit and integration tests.
107-
108108

109109
# Task parameters
110110

conf/wf_template.yml

Lines changed: 25 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ resources:
1717
tasks:
1818

1919
- task_key: extract_source1
20-
job_cluster_key: cluster-dev
20+
job_cluster_key: cluster-dev-aws
2121
max_retries: 0
2222
python_wheel_task:
2323
package_name: template
@@ -29,7 +29,7 @@ resources:
2929
- whl: ../dist/*.whl
3030

3131
- task_key: extract_source2
32-
job_cluster_key: cluster-dev
32+
job_cluster_key: cluster-dev-aws
3333
max_retries: 0
3434
python_wheel_task:
3535
package_name: template
@@ -44,7 +44,7 @@ resources:
4444
depends_on:
4545
- task_key: extract_source1
4646
- task_key: extract_source2
47-
job_cluster_key: cluster-dev
47+
job_cluster_key: cluster-dev-aws
4848
max_retries: 0
4949
python_wheel_task:
5050
package_name: template
@@ -58,7 +58,7 @@ resources:
5858
- task_key: generate_orders_agg
5959
depends_on:
6060
- task_key: generate_orders
61-
job_cluster_key: cluster-dev
61+
job_cluster_key: cluster-dev-aws
6262
max_retries: 0
6363
python_wheel_task:
6464
package_name: template
@@ -70,12 +70,26 @@ resources:
7070
- whl: ../dist/*.whl
7171

7272
job_clusters:
73-
- job_cluster_key: cluster-dev
73+
# - job_cluster_key: cluster-dev-azure
74+
# new_cluster:
75+
# spark_version: 15.3.x-scala2.12
76+
# node_type_id: Standard_D8as_v5
77+
# num_workers: 1
78+
# azure_attributes:
79+
# first_on_demand: 1
80+
# availability: SPOT_AZURE
81+
# data_security_mode: SINGLE_USER
82+
83+
- job_cluster_key: cluster-dev-aws
7484
new_cluster:
75-
spark_version: 15.3.x-scala2.12
76-
node_type_id: Standard_D8as_v5
77-
num_workers: 2
78-
azure_attributes:
85+
spark_version: 14.2.x-scala2.12
86+
node_type_id: c5d.xlarge
87+
num_workers: 1
88+
aws_attributes:
7989
first_on_demand: 1
80-
availability: SPOT_AZURE
81-
data_security_mode: SINGLE_USER
90+
availability: SPOT_WITH_FALLBACK
91+
zone_id: auto
92+
spot_bid_price_percent: 100
93+
ebs_volume_count: 0
94+
policy_id: 001934F3ABD02D4A
95+
data_security_mode: SINGLE_USER

conf/workflow.yml

Lines changed: 0 additions & 75 deletions
This file was deleted.

databricks.yml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,15 @@ targets:
2727
default: true
2828
workspace:
2929
profile: dev
30+
run_as:
31+
user_name: [email protected]
3032

3133
# Optionally, there could be a 'staging' target here.
3234
# (See Databricks docs on CI/CD at https://docs.databricks.com/dev-tools/bundles/index.html.)
33-
#
34-
# staging:
35-
# workspace:
36-
# host: https://myworkspace.databricks.com
35+
36+
staging:
37+
workspace:
38+
profile: dev
3739

3840
# The 'prod' target, used for production deployment.
3941
prod:
@@ -49,4 +51,4 @@ targets:
4951
# This runs as [email protected] in production. Alternatively,
5052
# a service principal could be used here using service_principal_name
5153
# (see Databricks documentation).
52-
user_name: username@company.com
54+
user_name: user.two@domain.com

0 commit comments

Comments
 (0)