🧱 dbt pharmaceutical dataset

📁 Project Folder Structure

dbt-pharmaceutical-project/
├── models/
│   ├── staging/
│   │   ├── stg_hourly.sql
│   │   ├── stg_daily.sql
│   │   ├── stg_weekly.sql
│   │   └── stg_monthly.sql
│   ├── marts/
│   │   └── dim_pharma.sql
│   └── marts.yml
├── seeds/
├── snapshots/
├── tests/
├── macros/
├── dbt_project.yml
├── README.md
└── .gitignore

Project Explanation

models/staging/: cleans and standardizes raw data
models/marts/: final business-ready tables (marts)
dbt_project.yml: defines dbt config
tests: Contains custom SQL tests for validating business rules (e.g. no negative totals, unique keys)
README.md: explains the project to viewers

Overview

This project demonstrates a modular data transformation pipeline using dbt (data build tool) on Snowflake, designed to model and analyze raw pharmaceutical data. It applies staging, transformation, testing, and documentation best practices aligned with the dbt development workflow.

Objectives

Clean and stage raw pharmaceutical dataset
Create reusable models and apply transformations
Build a simple data mart for business consumption
Add tests for data quality
Deploy in a structured dev → prod Snowflake environment

Project Structure

1. Snowflake Setup

I created three schemas:

Environment	Schema
Raw data	`SNOWFLAKE_LEARNING_DB.ROW_DATA`
Development	`SNOWFLAKE_LEARNING_DB.DBT_DTADELE`
Production	`SNOWFLAKE_LEARNING_DB.ANALYTICS_PRODUCTION`

2. Source Layer (`source.yml`)

Defined raw source tables for daily, hourly, monthly, and weekly sales:

version: 2

sources:
  - name: pharma
    description: "{{ doc('drugs_classification_system')}}"
    database: ANALYTICS
    schema: ROW_DATA
    tables:
      - name: stg_daily
        description: 'Raw table loaded from CSV'
        identifier: SALES_DAILY
        config:
         freshness: 
            warn_after: {count: 24, period: hour}
            error_after: {count: 1, period: day}
         loaded_at_field: LOADED_AT_TIMESTAMP
      - name: stg_hourly
        description: 'Raw table loaded from CSV'
        identifier: SALES_HOURLY
      - name: stg_monthly
        description: 'Raw table loaded from CSV'
        identifier: SALES_MONTHLY
      - name: stg_weekly
        description: 'Raw table loaded from CSV'
        identifier: SALES_WEEKLY
        columns:
          - name: M01AB
            description: "Drug type"
            tests:
                - not_null

3. Model Layer (`model.yml`)

Defined tests as a column and model level:

version: 2


models:
  - name: stg_daily
    description: 'Daily sales activity'
    columns:
      - name: DATUM
        tests:
          - not_null

4. Staging Layer

Standardized and cleaned the data:

Example (stg_hourly.sql):

{{
    config(
        materialized='view'
    )
}}

SELECT *
FROM {{ source('pharma', 'stg_hourly') }}

5. Marts Layer

Built a final dim_pharma table using staged data.

Example (dim_e_pharma.sql):

{{
    config(
        materialized='table'
    )
}}

WITH PHARMA_DATA AS(
  SELECT *
  FROM {{ ref('stg_daily') }}
  ),

HOURLY AS(
  SELECT *
  FROM {{ ref('stg_hourly') }}
  ),

MONTHLY AS(
  SELECT *
  FROM {{ ref('stg_monthly') }}
  ),

WEEKLY AS(
  SELECT *
  FROM {{ ref('stg_weekly') }}
  )

SELECT *
FROM PHARMA_DATA

6. Data Quality tests at Source and Model levels

Created data quality tests at the source and model levels. The model-level data quality checks are used to check data tests like uniqueness and non-null values of my models. The source-level data tests I used it to check our row data ingested from an outside source, if it fits our data test logic.

Here is an example for a data test at the model level:

models:
  - name: stg_daily
    columns:
      - name: DATUM
        tests:
          - not_null

I also created a custom based SQL logic test to check the hourly activity column:

WITH HOURLY_CHECK AS(
    SELECT *
    FROM {{ ref('stg_hourly') }}
)

select 
   HOUR 
from HOURLY_CHECK
where HOUR < 0

6. Documentation

I created multiple documentation at the model level, column level, and source level. Here is an example of my documentation at source level:

{% docs drugs_classification_system %}


| status         | definition                                                                                                      |
|----------------|-----------------------------------------------------------------------------------------------------------------|
| M01AB          | Anti-inflammatory and antirheumatic products (non-steroids, Acetic acid derivatives and related substances)     |
| M01AE          | Anti-inflammatory and antirheumatic products (non-steroids, Propionic acid derivatives)                         |
| N02BA          | Other analgesics and antipyretics (Salicylic acid and derivatives)                                              |
| N02BE/B        | Other analgesics and antipyretics (Pyrazolones and Anilides)                                                    |
| N05B           | Psycholeptics (Anxiolytic drugs)                                                                                |
| N05C           | Psycholeptics (Hypnotics and sedatives drugs)                                                                   |
| R03            | Drugs for obstructive airway diseases                                                                           |
| R06.           | Drugs for obstructive airway diseases                                                                           |
{% enddocs %}

7. Final Outcome

I used dbt run and dbt build to compile and execute the models in the targeted Snowflake warehouse. I also configured a production environment in dbt and scheduled jobs to run at specific hours.

After execution, I successfully verified that the models were materialized in the ANALYTICS_PRODUCTION schema in Snowflake.

8. Skills Demonstrated

dbt Cloud & CLI
SQL transformation logic
Snowflake warehouse integration
Jinja templating & YAML config
Testing & documentation in dbt
Data modeling best practices (modular layers)

9. The DAG

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧱 dbt pharmaceutical dataset

📁 Project Folder Structure

Table of contents

Project Explanation

Overview

Objectives

Project Structure

1. Snowflake Setup

2. Source Layer (`source.yml`)

3. Model Layer (`model.yml`)

4. Staging Layer

5. Marts Layer

6. Data Quality tests at Source and Model levels

6. Documentation

7. Final Outcome

8. Skills Demonstrated

9. The DAG

About

Uh oh!

Releases

Packages

Danieltadele777/dbt-pharmaceutical-dataset

Folders and files

Latest commit

History

Repository files navigation

🧱 dbt pharmaceutical dataset

📁 Project Folder Structure

Table of contents

Project Explanation

Overview

Objectives

Project Structure

1. Snowflake Setup

2. Source Layer (source.yml)

3. Model Layer (model.yml)

4. Staging Layer

5. Marts Layer

6. Data Quality tests at Source and Model levels

6. Documentation

7. Final Outcome

8. Skills Demonstrated

9. The DAG

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

2. Source Layer (`source.yml`)

3. Model Layer (`model.yml`)

Packages