ETL Pipeline - Fashion Studio Data Processing

This is an ETL (Extract, Transform, Load) application that scrapes product data from Fashion Studio website, cleans the data, and stores it across multiple repositories.

What Does It Do?

Basically, the app:

Extract: Scrapes 1000+ products from Fashion Studio (50 pages)
Transform: Cleans data (removes duplicates, validates prices, ratings, etc)
Load: Saves to CSV, PostgreSQL, and Google Sheets

From 1000 raw products, we get around 867 clean data, validated products ready to use.

Features

Web Scraping: Extracts product data from dynamic website with proper error handling
Data Validation: Validates prices, ratings, colors, and other fields with specific rules
Multi-Repository Storage: Saves cleaned data to CSV, PostgreSQL, and Google Sheets simultaneously
Currency Conversion: Converts USD prices to IDR automatically
Duplicate Detection: Identifies and removes duplicate products
Timestamp Tracking: Records extraction time for each product
Error Handling: Comprehensive try-catch blocks throughout the pipeline
Unit Testing: 95% code coverage with 58 test cases

Techniques & Technologies Used

Data Extraction

Web Scraping: BeautifulSoup4 for HTML parsing
HTTP Requests: requests library for fetching web pages
CSS Selectors: Pattern matching to locate product elements

Data Transformation

Data Cleaning: Regex patterns for extracting prices, ratings, sizes
Validation Rules: Custom validators for data quality checks
Pandas Operations: DataFrame manipulation for data processing
Deduplication: SQL-like operations for removing duplicates

Data Storage

CSV Export: Pandas to_csv with UTF-8 encoding
PostgreSQL: SQLAlchemy ORM for database operations with AUTOCOMMIT isolation level
Google Sheets API: gspread library for API integration with service account authentication

Testing & Quality

Unit Testing: pytest framework with mocking for external dependencies
Test Coverage: coverage.py for measuring code coverage
Error Scenarios: Tests for edge cases, missing data, connection failures
Mock Testing: Simulating database and API calls without real connections

Setup

1. Install Dependencies

pip install -r requirements.txt

2. Database Setup (PostgreSQL)

python setup_db.py

This creates the productsdb database and products table.

3. Google Sheets (Optional)

If you want to save to Google Sheets:

Place client_secret.json in root folder
Sheet ID is already configured: 1FJm35pZXkW_xVuVzGQLoXFQz8GfBTomS-c9RUY6JB3Q

Usage

Run Pipeline

python main.py

Data gets saved to:

products.csv - CSV file
PostgreSQL database
Google Sheets (if credentials present)

Run Tests

pytest tests -v

Check Test Coverage

coverage run -m pytest tests
coverage report

Project Structure

.
├── main.py                 # Entry point
├── utils/
│   ├── extract.py         # Web scraping
│   ├── transform.py       # Data cleaning
│   ├── load.py            # Save to repositories
│   └── __init__.py
├── tests/
│   ├── test_extract.py
│   ├── test_load.py
│   └── test_transform.py
├── products.csv           # Output file
├── client_secret.json     # Google credentials
├── requirements.txt
└── setup_db.py           # Database initialization

Output

After running python main.py, you get:

CSV File (products.csv):

Title
Price (in IDR)
Rating
Colors
Size
Gender
Timestamp

PostgreSQL Database:

Database: productsdb
Table: products

Google Sheets:

Same format as CSV, but in Google Sheets

Notes

Currency conversion: USD → IDR (1 USD = 16,000 Rp)
Invalid products removed (0 price, missing data, etc)
Duplicates removed based on Title + Price + Size + Gender
All data includes extraction timestamp

Testing

The project includes 58 test cases with 95% code coverage.

Tests cover:

Web scraping with edge cases
Data transformation
Error handling
CSV/PostgreSQL/Google Sheets operations
Data validation

Run: pytest tests -v

Results

This process gives you:

✅ Clean, validated dataset
✅ Multiple repository backup (CSV + PostgreSQL + Google Sheets)
✅ Extraction timestamp tracking
✅ Comprehensive test suite
✅ Production-ready error handling

Author & Disclaimer

This project was created by Aditiya Ihzar Eka Prayogo as a submission for Dicoding's "Belajar Fundamental Pemrosesan Data dengan Python" course.

Important: This is original work created for educational purposes. Plagiarism or copying this code without proper attribution is a serious violation of academic integrity and intellectual property rights. Please cite this project appropriately if you use any part of it.

License: Use this project as a learning reference only. For commercial or redistribution purposes, proper credits and permissions are required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Pipeline - Fashion Studio Data Processing

What Does It Do?

Features

Techniques & Technologies Used

Data Extraction

Data Transformation

Data Storage

Testing & Quality

Setup

1. Install Dependencies

2. Database Setup (PostgreSQL)

3. Google Sheets (Optional)

Usage

Run Pipeline

Run Tests

Check Test Coverage

Project Structure

Output

Notes

Testing

Results

Author & Disclaimer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
tests		tests
utils		utils
.coverage		.coverage
.gitignore		.gitignore
README.md		README.md
main.py		main.py
products.csv		products.csv
requirements.txt		requirements.txt
setup_db.py		setup_db.py

Aditprayogo/fashion_studio_etl_processing

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline - Fashion Studio Data Processing

What Does It Do?

Features

Techniques & Technologies Used

Data Extraction

Data Transformation

Data Storage

Testing & Quality

Setup

1. Install Dependencies

2. Database Setup (PostgreSQL)

3. Google Sheets (Optional)

Usage

Run Pipeline

Run Tests

Check Test Coverage

Project Structure

Output

Notes

Testing

Results

Author & Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages