Skip to content

This project demonstrates a real-world dbt pipeline on Snowflake, transforming raw pharmaceutical data into analytics-ready tables using modular staging, testing, and documentation practices.

Notifications You must be signed in to change notification settings

Danieltadele777/dbt-pharmaceutical-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 

Repository files navigation

🧱 dbt pharmaceutical dataset

πŸ“ Project Folder Structure

dbt-pharmaceutical-project/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ staging/
β”‚   β”‚   β”œβ”€β”€ stg_hourly.sql
β”‚   β”‚   β”œβ”€β”€ stg_daily.sql
β”‚   β”‚   β”œβ”€β”€ stg_weekly.sql
β”‚   β”‚   └── stg_monthly.sql
β”‚   β”œβ”€β”€ marts/
β”‚   β”‚   └── dim_pharma.sql
β”‚   └── marts.yml
β”œβ”€β”€ seeds/
β”œβ”€β”€ snapshots/
β”œβ”€β”€ tests/
β”œβ”€β”€ macros/
β”œβ”€β”€ dbt_project.yml
β”œβ”€β”€ README.md
└── .gitignore

Table of contents

  • models/staging/: cleans and standardizes raw data

  • models/marts/: final business-ready tables (marts)

  • dbt_project.yml: defines dbt config

  • tests: Contains custom SQL tests for validating business rules (e.g. no negative totals, unique keys)

  • README.md: explains the project to viewers

This project demonstrates a modular data transformation pipeline using dbt (data build tool) on Snowflake, designed to model and analyze raw pharmaceutical data. It applies staging, transformation, testing, and documentation best practices aligned with the dbt development workflow.


  • Clean and stage raw pharmaceutical dataset
  • Create reusable models and apply transformations
  • Build a simple data mart for business consumption
  • Add tests for data quality
  • Deploy in a structured dev β†’ prod Snowflake environment

1. Snowflake Setup

I created three schemas:

Environment Schema
Raw data SNOWFLAKE_LEARNING_DB.ROW_DATA
Development SNOWFLAKE_LEARNING_DB.DBT_DTADELE
Production SNOWFLAKE_LEARNING_DB.ANALYTICS_PRODUCTION

2. Source Layer (source.yml)

Defined raw source tables for daily, hourly, monthly, and weekly sales:

version: 2

sources:
  - name: pharma
    description: "{{ doc('drugs_classification_system')}}"
    database: ANALYTICS
    schema: ROW_DATA
    tables:
      - name: stg_daily
        description: 'Raw table loaded from CSV'
        identifier: SALES_DAILY
        config:
         freshness: 
            warn_after: {count: 24, period: hour}
            error_after: {count: 1, period: day}
         loaded_at_field: LOADED_AT_TIMESTAMP
      - name: stg_hourly
        description: 'Raw table loaded from CSV'
        identifier: SALES_HOURLY
      - name: stg_monthly
        description: 'Raw table loaded from CSV'
        identifier: SALES_MONTHLY
      - name: stg_weekly
        description: 'Raw table loaded from CSV'
        identifier: SALES_WEEKLY
        columns:
          - name: M01AB
            description: "Drug type"
            tests:
                - not_null

3. Model Layer (model.yml)

Defined tests as a column and model level:

version: 2


models:
  - name: stg_daily
    description: 'Daily sales activity'
    columns:
      - name: DATUM
        tests:
          - not_null

4. Staging Layer

Standardized and cleaned the data:

Example (stg_hourly.sql):

{{
    config(
        materialized='view'
    )
}}

SELECT *
FROM {{ source('pharma', 'stg_hourly') }}

5. Marts Layer

Built a final dim_pharma table using staged data.

Example (dim_e_pharma.sql):

{{
    config(
        materialized='table'
    )
}}

WITH PHARMA_DATA AS(
  SELECT *
  FROM {{ ref('stg_daily') }}
  ),

HOURLY AS(
  SELECT *
  FROM {{ ref('stg_hourly') }}
  ),

MONTHLY AS(
  SELECT *
  FROM {{ ref('stg_monthly') }}
  ),

WEEKLY AS(
  SELECT *
  FROM {{ ref('stg_weekly') }}
  )

SELECT *
FROM PHARMA_DATA

6. Data Quality tests at Source and Model levels

Created data quality tests at the source and model levels. The model-level data quality checks are used to check data tests like uniqueness and non-null values of my models. The source-level data tests I used it to check our row data ingested from an outside source, if it fits our data test logic.

Here is an example for a data test at the model level:

models:
  - name: stg_daily
    columns:
      - name: DATUM
        tests:
          - not_null

I also created a custom based SQL logic test to check the hourly activity column:

WITH HOURLY_CHECK AS(
    SELECT *
    FROM {{ ref('stg_hourly') }}
)

select 
   HOUR 
from HOURLY_CHECK
where HOUR < 0 

I created multiple documentation at the model level, column level, and source level. Here is an example of my documentation at source level:

{% docs drugs_classification_system %}


| status         | definition                                                                                                      |
|----------------|-----------------------------------------------------------------------------------------------------------------|
| M01AB          | Anti-inflammatory and antirheumatic products (non-steroids, Acetic acid derivatives and related substances)     |
| M01AE          | Anti-inflammatory and antirheumatic products (non-steroids, Propionic acid derivatives)                         |
| N02BA          | Other analgesics and antipyretics (Salicylic acid and derivatives)                                              |
| N02BE/B        | Other analgesics and antipyretics (Pyrazolones and Anilides)                                                    |
| N05B           | Psycholeptics (Anxiolytic drugs)                                                                                |
| N05C           | Psycholeptics (Hypnotics and sedatives drugs)                                                                   |
| R03            | Drugs for obstructive airway diseases                                                                           |
| R06.           | Drugs for obstructive airway diseases                                                                           |
{% enddocs %}

I used dbt run and dbt build to compile and execute the models in the targeted Snowflake warehouse. I also configured a production environment in dbt and scheduled jobs to run at specific hours.

After execution, I successfully verified that the models were materialized in the ANALYTICS_PRODUCTION schema in Snowflake.

  • dbt Cloud & CLI

  • SQL transformation logic

  • Snowflake warehouse integration

  • Jinja templating & YAML config

  • Testing & documentation in dbt

  • Data modeling best practices (modular layers)

dag2

About

This project demonstrates a real-world dbt pipeline on Snowflake, transforming raw pharmaceutical data into analytics-ready tables using modular staging, testing, and documentation practices.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published