A scalable, end-to-end retail data processing and analytics system built for the HCLTech Hackathon.
This project demonstrates the complete lifecycle of retail data — from ingestion and quality validation to analytics, customer segmentation, and automated notifications — using clean data engineering and analytics best practices.
The hackathon problem focuses on designing a robust retail data platform that can:
- Ingest high-volume transactional data
- Enforce strong data quality rules
- Generate actionable business insights
- Enable customer-centric analytics and engagement
The solution is structured around 6 real-world retail use cases, each mapped to a clean, modular pipeline.
- Build a reliable data ingestion and validation pipeline
- Generate meaningful, high-quality synthetic datasets suitable for ML training
- Enable promotion effectiveness analysis
- Implement loyalty point calculation and notifications
- Perform customer segmentation using RFM analysis
- Analyze inventory impact on sales performance
- Maintain data quality, traceability, and scalability
- Ingest sales header and line item data from CSV
- Load raw data without validation
- Apply business data quality rules
- Route clean data to staging and bad data to quarantine
- Compare promoted vs non-promoted sales
- Measure sales lift by promotion and category
- Rank top-performing promotions
- Apply rule-based loyalty logic per transaction
- Accrue and update customer loyalty points
- Maintain transaction-level traceability
- Compute RFM (Recency, Frequency, Monetary) metrics
- Segment customers into:
- High-Spenders
- At-Risk Customers
- Generate personalized loyalty emails
- Simulate email delivery via logs or tables
- Close the customer engagement loop
- Identify top-selling products
- Analyze stock-out impact
- Estimate potential lost sales
CSV Files
↓
Raw Schema (No Validation)
↓
Data Quality Rules Engine
├── Clean Data → Staging Schema
└── Bad Data → Quarantine Schema
↓
Analytics & Business Logic Layer
├── Promotion Analysis
├── Loyalty Engine
├── Customer Segmentation
└── Inventory Correlation
↓
Reporting / Dashboards / Notifications
- Fully normalized 3NF relational schema
- Clear separation of concerns:
- Raw schema – raw ingested data
- Staging schema – clean, validated data
- Quarantine schema – rejected records with error reasons
- Standard retail entities:
- Stores, Products, Customers
- Sales Header & Line Items
- Promotions & Loyalty Rules
Design follows industry-standard retail transaction modeling.
- Synthetic data generated using LLM-assisted pipelines
- Ensures:
- Referential integrity across entities
- Realistic sales, promotion, and customer behavior
- Controlled injection of data quality issues
- Simulates:
- Multiple stores
- Multiple promotions
- 7+ days of transactional data
📄 Detailed planning is documented separately in dataset_generation.md.
product_idmust not be NULLline_item_amountmust be positivestore_idandcustomer_idmust existtransaction_datemust be valid- Header
total_amountmust match sum of line items
Rejected records are preserved for auditing and analysis.
| Layer | Tools |
|---|---|
| Language | Python 3 |
| Data Processing | pandas, SQL |
| Database | PostgreSQL / MySQL |
| Visualization | matplotlib / plotly |
| Orchestration | Python scripts |
| Data Format | CSV → Database |
| ML Readiness | Clean, labeled datasets |
├── data/
│ ├── raw/
│ ├── generated/
│ └── reference/
├── ingestion/
├── data_quality/
├── analytics/
│ ├── promotion_analysis/
│ ├── loyalty_engine/
│ ├── segmentation/
│ └── inventory_analysis/
├── notifications/
├── dataset_generation.md
├── README.md
└── requirements.txt
```````````
---
## 👥 Team Responsibilities
``````````
| Team Member | Responsibility |
|-----------|---------------|
| **Prateek** | Data generation pipeline implementation |
| **Surabhi** | Model training & ML experimentation |
| **Ravindar** | Hosting, deployment & infrastructure |
| **Manoranjan** | End-to-end pipeline orchestration |
---
## 🧩 Design Principles
- **Simplicity first** – clear, readable pipelines
- **Modularity** – each use case independently runnable
- **Traceability** – no silent data loss
- **Scalability** – easy to extend for ML & real-time use
- **Industry-aligned modeling** – real retail data patterns
---
## 🚀 Future Enhancements
- Real-time streaming ingestion (Kafka)
- ML-based customer churn prediction
- Promotion recommendation engine
- Cloud-native orchestration (Airflow)
- Real email/SMS notification integration
---
## 📜 License
This project is developed as part of the **HCLTech Hackathon** and is intended for educational and demonstration purposes.
---
## 🙌 Acknowledgements
- HCLTech Hackathon Team
- Open-source Python ecosystem
- Retail data modeling best practices