The Trail Trekker growth team needs reliable, timely insight into how customers change their subscriptions over time. They lack a warehouse and scheduled transformations to answer questions such as which plans customers upgrade to, when churn events occur, and the revenue impact of plan changes.
Source files live in data/ and are ingested into a persistent DuckDB file trail_trekker.db.
- Tables:
features,plans,customers,plan_features,subscriptions - Row counts after ingestion: features 8, plans 6, customers 25, plan_features 24, subscriptions 41
- Notable data quality issue: some date fields contained the literal string "NULL" rather than SQL NULL
Schemas are inferred by DuckDB at ingest and validated in staging models.
This project was built incrementally in three layers.
- Ingestion (Python + DuckDB)
- Persistent database connection using
duckdb.connect('trail_trekker.db') - Idempotent table creation using
CREATE OR REPLACE TABLE - Minimal, clear CLI script for repeatable runs
- Transformation (SQLMesh)
- Staging views standardize types, names, and audits
- Dimensional layer with SCD Type 2 dimensions for customers and plans
- Transaction fact table for subscription change events
- Orchestration (SQLMesh run + cron)
- Hourly execution configured on the fact model
- Local cron job invokes
sqlmesh run prodand writes logs
trail-trekker-demo/
data/ # CSV sources
models/ # SQLMesh models
staging_*.sql # staging layer views
dim_customers.sql # SCD Type 2
dim_plans.sql # SCD Type 2
dim_time.sql # static date dimension
fct_subscription_changes.sql # transaction fact
ingest_data.py # DuckDB ingestion via CLI
config.yml # SQLMesh configuration (DuckDB gateway)
README.md # this document
- DuckDB: fast local analytics engine with excellent CSV parsing and persistence
- SQLMesh: dependency-aware, audit-enabled transformations with environment isolation
- Cron: simple, reliable local scheduling
staging.features,staging.plans,staging.customers,staging.subscriptionsdim.customers(SCD Type 2 witheffective_date,expiration_date,is_current,updated_at)dim.plans(SCD Type 2 with tier mapping and pricing)dim.dates(calendar spine for analysis)fct.subscription_changes(one row per customer per subscription per plan change event)
- NULL string in dates
- Problem: DuckDB casting failed on values like "NULL"
- Root cause: string literal instead of SQL NULL in source CSV
- Fix in
staging_subscriptions.sqlusing CASE logic:
CASE
WHEN subscription_end_date IS NULL OR subscription_end_date = 'NULL' THEN NULL
ELSE subscription_end_date::timestamp
END AS subscription_ended_at,
CASE
WHEN next_billing_date IS NULL OR next_billing_date = 'NULL' THEN NULL
ELSE next_billing_date::timestamp
END AS next_billing_at- SQLMesh configuration did not see models
- Cause: file named
config.yamlinstead ofconfig.ymland defaulting to a separatedb.db - Resolution: rename to
config.ymland explicitly setdatabase: trail_trekker.db
- SCD Type 2 requirements in SQLMesh
- SQLMesh required an
updated_atcolumn for change detection - Resolution: add
updated_atto SCD2 selects and configurevalid_from_nameandvalid_to_name
- Date dimension scope
- Strategy: generate a date spine covering data plus buffer
- Final choice: 2022-08-01 through 2025-10-31 to cover subscriptions and near-term planning
dim.customersanddim.plansuse SCD Type 2 to preserve history of attributes that may changedim.datesprovides a complete day-level calendar for joins and trend analysis
- Grain: one row per customer per subscription per change
- Change detection with window functions (simplified for learning):
LAG(plan_id) OVER (PARTITION BY customer_id ORDER BY subscription_started_at) AS previous_plan_id- Primary metrics:
plan_change_amount,days_on_previous_plan, boolean flags for upgrade, downgrade, cancellation
- 13 plan changes detected
- Most common transitions:
- Basic Monthly to Pro Annual: 5
- Basic Monthly to Premium Monthly: 3
- Premium Monthly to Basic Monthly: 3
These outputs directly support the growth team in targeting upgrade opportunities and addressing downgrade risks.
Add a schedule to the fact model:
MODEL (
name fct.subscription_changes,
cron '@hourly'
);Local cron job to run hourly:
0 * * * * cd /Users/satkarkarki/Desktop/portfolio/trail-trekker-demo && \
source .venv/bin/activate && \
sqlmesh run prod >> ~/trail_trekker_cron.log 2>&1- Create and activate a virtualenv, then install SQLMesh
python -m venv .venv && source .venv/bin/activate
pip install sqlmesh- Ingest CSVs into DuckDB
python3 ingest_data.py- Plan and apply in a dev environment
sqlmesh plan dev- Query models
sqlmesh fetchdf "SELECT * FROM fct__dev.subscription_changes LIMIT 5"- Guard against string representations of NULL when casting timestamps
- Prefer idempotent ingestion with
CREATE OR REPLACEto support repeatable runs - Use SCD Type 2 for entities where attributes change over time
- Define a calendar spine explicitly and scope it to the observed data window plus buffer
- Favor simple, testable steps with tight feedback loops
- Add weekly periodic snapshot and lifecycle accumulating snapshot facts
- Introduce incremental materialization for large datasets
- Add additional audits and referential integrity checks
- Connect to a BI tool for dashboards
I am an analytics engineer focused on clear documentation, pragmatic modeling, and reliable pipelines. This project demonstrates hands-on skills in DuckDB, SQLMesh, dimensional modeling, data quality, and orchestration. If you are reviewing this as a hiring manager or collaborator, the repository is organized for reproducibility and clarity, and the README documents tradeoffs and reasoning so the work can be maintained by any future owner.