dbt-pharmaceutical-project/
βββ models/
β βββ staging/
β β βββ stg_hourly.sql
β β βββ stg_daily.sql
β β βββ stg_weekly.sql
β β βββ stg_monthly.sql
β βββ marts/
β β βββ dim_pharma.sql
β βββ marts.yml
βββ seeds/
βββ snapshots/
βββ tests/
βββ macros/
βββ dbt_project.yml
βββ README.md
βββ .gitignore
- Project Explanation
- Overview
- Objectives
- Project Structure
- Documentation
- Final Outcome
- Skills Demonstrated
- The DAG
-
models/staging/: cleans and standardizes raw data
-
models/marts/: final business-ready tables (marts)
-
dbt_project.yml: defines dbt config
-
tests: Contains custom SQL tests for validating business rules (e.g. no negative totals, unique keys)
-
README.md: explains the project to viewers
This project demonstrates a modular data transformation pipeline using dbt (data build tool) on Snowflake, designed to model and analyze raw pharmaceutical data. It applies staging, transformation, testing, and documentation best practices aligned with the dbt development workflow.
- Clean and stage raw pharmaceutical dataset
- Create reusable models and apply transformations
- Build a simple data mart for business consumption
- Add tests for data quality
- Deploy in a structured dev β prod Snowflake environment
I created three schemas:
| Environment | Schema |
|---|---|
| Raw data | SNOWFLAKE_LEARNING_DB.ROW_DATA |
| Development | SNOWFLAKE_LEARNING_DB.DBT_DTADELE |
| Production | SNOWFLAKE_LEARNING_DB.ANALYTICS_PRODUCTION |
Defined raw source tables for daily, hourly, monthly, and weekly sales:
version: 2
sources:
- name: pharma
description: "{{ doc('drugs_classification_system')}}"
database: ANALYTICS
schema: ROW_DATA
tables:
- name: stg_daily
description: 'Raw table loaded from CSV'
identifier: SALES_DAILY
config:
freshness:
warn_after: {count: 24, period: hour}
error_after: {count: 1, period: day}
loaded_at_field: LOADED_AT_TIMESTAMP
- name: stg_hourly
description: 'Raw table loaded from CSV'
identifier: SALES_HOURLY
- name: stg_monthly
description: 'Raw table loaded from CSV'
identifier: SALES_MONTHLY
- name: stg_weekly
description: 'Raw table loaded from CSV'
identifier: SALES_WEEKLY
columns:
- name: M01AB
description: "Drug type"
tests:
- not_nullDefined tests as a column and model level:
version: 2
models:
- name: stg_daily
description: 'Daily sales activity'
columns:
- name: DATUM
tests:
- not_null
Standardized and cleaned the data:
Example (stg_hourly.sql):
{{
config(
materialized='view'
)
}}
SELECT *
FROM {{ source('pharma', 'stg_hourly') }}
Built a final dim_pharma table using staged data.
Example (dim_e_pharma.sql):
{{
config(
materialized='table'
)
}}
WITH PHARMA_DATA AS(
SELECT *
FROM {{ ref('stg_daily') }}
),
HOURLY AS(
SELECT *
FROM {{ ref('stg_hourly') }}
),
MONTHLY AS(
SELECT *
FROM {{ ref('stg_monthly') }}
),
WEEKLY AS(
SELECT *
FROM {{ ref('stg_weekly') }}
)
SELECT *
FROM PHARMA_DATACreated data quality tests at the source and model levels. The model-level data quality checks are used to check data tests like uniqueness and non-null values of my models. The source-level data tests I used it to check our row data ingested from an outside source, if it fits our data test logic.
Here is an example for a data test at the model level:
models:
- name: stg_daily
columns:
- name: DATUM
tests:
- not_nullI also created a custom based SQL logic test to check the hourly activity column:
WITH HOURLY_CHECK AS(
SELECT *
FROM {{ ref('stg_hourly') }}
)
select
HOUR
from HOURLY_CHECK
where HOUR < 0
I created multiple documentation at the model level, column level, and source level. Here is an example of my documentation at source level:
{% docs drugs_classification_system %}
| status | definition |
|----------------|-----------------------------------------------------------------------------------------------------------------|
| M01AB | Anti-inflammatory and antirheumatic products (non-steroids, Acetic acid derivatives and related substances) |
| M01AE | Anti-inflammatory and antirheumatic products (non-steroids, Propionic acid derivatives) |
| N02BA | Other analgesics and antipyretics (Salicylic acid and derivatives) |
| N02BE/B | Other analgesics and antipyretics (Pyrazolones and Anilides) |
| N05B | Psycholeptics (Anxiolytic drugs) |
| N05C | Psycholeptics (Hypnotics and sedatives drugs) |
| R03 | Drugs for obstructive airway diseases |
| R06. | Drugs for obstructive airway diseases |
{% enddocs %}
I used dbt run and dbt build to compile and execute the models in the targeted Snowflake warehouse. I also configured a production environment in dbt and scheduled jobs to run at specific hours.
After execution, I successfully verified that the models were materialized in the ANALYTICS_PRODUCTION schema in Snowflake.
-
dbt Cloud & CLI
-
SQL transformation logic
-
Snowflake warehouse integration
-
Jinja templating & YAML config
-
Testing & documentation in dbt
-
Data modeling best practices (modular layers)
9. The DAG