Skip to content

Aditprayogo/fashion_studio_etl_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Pipeline - Fashion Studio Data Processing

This is an ETL (Extract, Transform, Load) application that scrapes product data from Fashion Studio website, cleans the data, and stores it across multiple repositories.

What Does It Do?

Basically, the app:

  • Extract: Scrapes 1000+ products from Fashion Studio (50 pages)
  • Transform: Cleans data (removes duplicates, validates prices, ratings, etc)
  • Load: Saves to CSV, PostgreSQL, and Google Sheets

From 1000 raw products, we get around 867 clean data, validated products ready to use.

Features

  • Web Scraping: Extracts product data from dynamic website with proper error handling
  • Data Validation: Validates prices, ratings, colors, and other fields with specific rules
  • Multi-Repository Storage: Saves cleaned data to CSV, PostgreSQL, and Google Sheets simultaneously
  • Currency Conversion: Converts USD prices to IDR automatically
  • Duplicate Detection: Identifies and removes duplicate products
  • Timestamp Tracking: Records extraction time for each product
  • Error Handling: Comprehensive try-catch blocks throughout the pipeline
  • Unit Testing: 95% code coverage with 58 test cases

Techniques & Technologies Used

Data Extraction

  • Web Scraping: BeautifulSoup4 for HTML parsing
  • HTTP Requests: requests library for fetching web pages
  • CSS Selectors: Pattern matching to locate product elements

Data Transformation

  • Data Cleaning: Regex patterns for extracting prices, ratings, sizes
  • Validation Rules: Custom validators for data quality checks
  • Pandas Operations: DataFrame manipulation for data processing
  • Deduplication: SQL-like operations for removing duplicates

Data Storage

  • CSV Export: Pandas to_csv with UTF-8 encoding
  • PostgreSQL: SQLAlchemy ORM for database operations with AUTOCOMMIT isolation level
  • Google Sheets API: gspread library for API integration with service account authentication

Testing & Quality

  • Unit Testing: pytest framework with mocking for external dependencies
  • Test Coverage: coverage.py for measuring code coverage
  • Error Scenarios: Tests for edge cases, missing data, connection failures
  • Mock Testing: Simulating database and API calls without real connections

Setup

1. Install Dependencies

pip install -r requirements.txt

2. Database Setup (PostgreSQL)

python setup_db.py

This creates the productsdb database and products table.

3. Google Sheets (Optional)

If you want to save to Google Sheets:

  • Place client_secret.json in root folder
  • Sheet ID is already configured: 1FJm35pZXkW_xVuVzGQLoXFQz8GfBTomS-c9RUY6JB3Q

Usage

Run Pipeline

python main.py

Data gets saved to:

  • products.csv - CSV file
  • PostgreSQL database
  • Google Sheets (if credentials present)

Run Tests

pytest tests -v

Check Test Coverage

coverage run -m pytest tests
coverage report

Project Structure

.
├── main.py                 # Entry point
├── utils/
│   ├── extract.py         # Web scraping
│   ├── transform.py       # Data cleaning
│   ├── load.py            # Save to repositories
│   └── __init__.py
├── tests/
│   ├── test_extract.py
│   ├── test_load.py
│   └── test_transform.py
├── products.csv           # Output file
├── client_secret.json     # Google credentials
├── requirements.txt
└── setup_db.py           # Database initialization

Output

After running python main.py, you get:

CSV File (products.csv):

  • Title
  • Price (in IDR)
  • Rating
  • Colors
  • Size
  • Gender
  • Timestamp

PostgreSQL Database:

  • Database: productsdb
  • Table: products

Google Sheets:

  • Same format as CSV, but in Google Sheets

Notes

  • Currency conversion: USD → IDR (1 USD = 16,000 Rp)
  • Invalid products removed (0 price, missing data, etc)
  • Duplicates removed based on Title + Price + Size + Gender
  • All data includes extraction timestamp

Testing

The project includes 58 test cases with 95% code coverage.

Tests cover:

  • Web scraping with edge cases
  • Data transformation
  • Error handling
  • CSV/PostgreSQL/Google Sheets operations
  • Data validation

Run: pytest tests -v

Results

This process gives you:

  • ✅ Clean, validated dataset
  • ✅ Multiple repository backup (CSV + PostgreSQL + Google Sheets)
  • ✅ Extraction timestamp tracking
  • ✅ Comprehensive test suite
  • ✅ Production-ready error handling

Author & Disclaimer

This project was created by Aditiya Ihzar Eka Prayogo as a submission for Dicoding's "Belajar Fundamental Pemrosesan Data dengan Python" course.

Important: This is original work created for educational purposes. Plagiarism or copying this code without proper attribution is a serious violation of academic integrity and intellectual property rights. Please cite this project appropriately if you use any part of it.

License: Use this project as a learning reference only. For commercial or redistribution purposes, proper credits and permissions are required.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages