You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,14 @@
1
1
2
-
# Project Template for Spark/Databricks with Python packaging and CI/CD automation
2
+
# Databricks template project with Asset Bundles, Python packaging and CI/CD automation
3
3
4
4
This project template provides a structured approach to enhance your productivity when delivering ETL pipelines on Databricks. Feel free to customize it based on your project's specific nuances and the audience you are targeting.
5
5
6
6
This project template demonstrates how to:
7
7
8
8
- structure your PySpark code inside classes/packages.
9
-
- package your code and move it on different environments on a CI/CD pipeline.
9
+
- package your code and move it on different environments (dev, staging, prod) on a CI/CD pipeline.
10
10
- configure your workflow to run in different environments with different configurations with [jinja package](https://pypi.org/project/jinja2/)
11
+
- configure your workflow to selectively run tasks, preventing collisions and interference between developers working in parallel.
11
12
- use a [medallion architecure](https://www.databricks.com/glossary/medallion-architecture) pattern by improving the data quality as it goes trought more refinement.
12
13
- use a Make file to automate repetitive tasks on local env.
13
14
- lint and format the code with [ruff](https://docs.astral.sh/ruff/) and [pre-commit](https://pre-commit.com/).
@@ -16,11 +17,12 @@ This project template demonstrates how to:
16
17
- utilize [pytest package](https://pypi.org/project/pytest/) to run unit tests on transformations.
17
18
- utilize [argparse package](https://pypi.org/project/argparse/) to build a flexible command line interface to start your jobs.
18
19
- utilize [funcy package](https://pypi.org/project/funcy/) to log the execution time of each transformation.
19
-
- utilize [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html) and (the new!!!) [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) to package/deploy/run a Python wheel package on Databricks.
20
+
- utilize [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html) and [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) to package/deploy/run a Python wheel package on Databricks.
20
21
- utilize [Databricks SDK for Python](https://docs.databricks.com/en/dev-tools/sdk-python.html) to manage workspaces and accounts. This script enables your metastore system tables that have [relevant data about billing, usage, lineage, prices, and access](https://www.youtube.com/watch?v=LcRWHzk8Wm4).
21
22
- utilize [Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) instead of Hive as your data catalog and earn for free data lineage for your tables and columns and a simplified permission model for your data.
22
23
- utilize [Databricks Workflows](https://docs.databricks.com/en/workflows/index.html) to execute a DAG and [task parameters](https://docs.databricks.com/en/workflows/jobs/parameter-value-references.html) to share context information between tasks (see [Task Parameters section](#task-parameters)). Yes, you don't need Airflow to manage your DAGs here!!!
23
24
- utilize [Databricks job clusters](https://docs.databricks.com/en/workflows/jobs/use-compute.html#use-databricks-compute-with-your-jobs) to reduce costs.
25
+
- define clusters on AWS and Azure.
24
26
- execute a CI/CD pipeline with [Github Actions](https://docs.github.com/en/actions) after a repo push.
25
27
26
28
For a debate about the use of notebooks x Python packages, please refer to:
@@ -74,37 +76,35 @@ For a debate about the use of notebooks x Python packages, please refer to:
74
76
75
77
# Instructions
76
78
77
-
### 1) install and configure Databricks CLI
79
+
### 1) (optional) create a Databricks Workspace with Terraform
### 3) build python env and execute unit tests on your local machine
83
90
84
91
make install
85
92
86
93
You can also execute unit tests from your preferred IDE. Here's a screenshot from [VS Code](https://code.visualstudio.com/) with [Microsoft's Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python) installed.
87
94
88
95
<imgsrc="docs/vscode.png"width="30%"height="30%">
89
96
90
-
### 3) deploy and execute on dev and prod workspaces.
97
+
### 4) deploy and execute on dev workspace.
91
98
92
99
Update "job_clusters" properties on wf_template.yml file. There are different properties for AWS and Azure.
93
100
94
101
make deploy-dev
95
102
96
103
97
-
### 4) configure CI/CD automation
104
+
### 5) configure CI/CD automation
98
105
99
106
Configure [Github Actions repository secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) DATABRICKS_HOST and DATABRICKS_TOKEN.
100
107
101
-
### 5) enable system tables on Catalog Explorer
102
-
103
-
python sdk_system_tables.py
104
-
105
-
106
-
... and now you can code the transformations for each task and run unit and integration tests.
0 commit comments