This is an ETL (Extract, Transform, Load) application that scrapes product data from Fashion Studio website, cleans the data, and stores it across multiple repositories.
Basically, the app:
- Extract: Scrapes 1000+ products from Fashion Studio (50 pages)
- Transform: Cleans data (removes duplicates, validates prices, ratings, etc)
- Load: Saves to CSV, PostgreSQL, and Google Sheets
From 1000 raw products, we get around 867 clean data, validated products ready to use.
- Web Scraping: Extracts product data from dynamic website with proper error handling
- Data Validation: Validates prices, ratings, colors, and other fields with specific rules
- Multi-Repository Storage: Saves cleaned data to CSV, PostgreSQL, and Google Sheets simultaneously
- Currency Conversion: Converts USD prices to IDR automatically
- Duplicate Detection: Identifies and removes duplicate products
- Timestamp Tracking: Records extraction time for each product
- Error Handling: Comprehensive try-catch blocks throughout the pipeline
- Unit Testing: 95% code coverage with 58 test cases
- Web Scraping: BeautifulSoup4 for HTML parsing
- HTTP Requests: requests library for fetching web pages
- CSS Selectors: Pattern matching to locate product elements
- Data Cleaning: Regex patterns for extracting prices, ratings, sizes
- Validation Rules: Custom validators for data quality checks
- Pandas Operations: DataFrame manipulation for data processing
- Deduplication: SQL-like operations for removing duplicates
- CSV Export: Pandas to_csv with UTF-8 encoding
- PostgreSQL: SQLAlchemy ORM for database operations with AUTOCOMMIT isolation level
- Google Sheets API: gspread library for API integration with service account authentication
- Unit Testing: pytest framework with mocking for external dependencies
- Test Coverage: coverage.py for measuring code coverage
- Error Scenarios: Tests for edge cases, missing data, connection failures
- Mock Testing: Simulating database and API calls without real connections
pip install -r requirements.txtpython setup_db.pyThis creates the productsdb database and products table.
If you want to save to Google Sheets:
- Place
client_secret.jsonin root folder - Sheet ID is already configured:
1FJm35pZXkW_xVuVzGQLoXFQz8GfBTomS-c9RUY6JB3Q
python main.pyData gets saved to:
products.csv- CSV file- PostgreSQL database
- Google Sheets (if credentials present)
pytest tests -vcoverage run -m pytest tests
coverage report.
├── main.py # Entry point
├── utils/
│ ├── extract.py # Web scraping
│ ├── transform.py # Data cleaning
│ ├── load.py # Save to repositories
│ └── __init__.py
├── tests/
│ ├── test_extract.py
│ ├── test_load.py
│ └── test_transform.py
├── products.csv # Output file
├── client_secret.json # Google credentials
├── requirements.txt
└── setup_db.py # Database initialization
After running python main.py, you get:
CSV File (products.csv):
- Title
- Price (in IDR)
- Rating
- Colors
- Size
- Gender
- Timestamp
PostgreSQL Database:
- Database:
productsdb - Table:
products
Google Sheets:
- Same format as CSV, but in Google Sheets
- Currency conversion: USD → IDR (1 USD = 16,000 Rp)
- Invalid products removed (0 price, missing data, etc)
- Duplicates removed based on Title + Price + Size + Gender
- All data includes extraction timestamp
The project includes 58 test cases with 95% code coverage.
Tests cover:
- Web scraping with edge cases
- Data transformation
- Error handling
- CSV/PostgreSQL/Google Sheets operations
- Data validation
Run: pytest tests -v
This process gives you:
- ✅ Clean, validated dataset
- ✅ Multiple repository backup (CSV + PostgreSQL + Google Sheets)
- ✅ Extraction timestamp tracking
- ✅ Comprehensive test suite
- ✅ Production-ready error handling
This project was created by Aditiya Ihzar Eka Prayogo as a submission for Dicoding's "Belajar Fundamental Pemrosesan Data dengan Python" course.
Important: This is original work created for educational purposes. Plagiarism or copying this code without proper attribution is a serious violation of academic integrity and intellectual property rights. Please cite this project appropriately if you use any part of it.
License: Use this project as a learning reference only. For commercial or redistribution purposes, proper credits and permissions are required.