This project demonstrates a complete data pipeline to extract structured product information from e-commerce web pages using Python, web scraping, and OpenAI's GPT.
It automatically scrapes product pages, uses GPT to parse product attributes like name, GTIN, ingredients, and allergen info, validates the output using Pydantic, and stores the result as JSON.
- Python
- Selenium & BeautifulSoup β For web scraping
- OpenAI GPT API β For AI-based structured extraction
- n8n (planned integration) β Workflow automation
- Pydantic β Data validation
- JSON / Local File Storage
project_root/ βββ src/ β βββ scrapers/ # Web scraping logic β βββ gpt_parser/ # GPT prompt and API call β βββ validators/ # Pydantic-based data validation β βββ main.py # Orchestration script βββ data/ β βββ sample_input/ # Example HTML inputs β βββ output_json/ # Extracted JSON data βββ requirements.txt # Python dependencies βββ README.md
pip install -r requirements.txtexport OPENAI_API_KEY=your-key-herepython src/main.pyYouβll be prompted to enter a product URL.
{
"Product Name": "Almond Banana Chips",
"GTIN": "9312345678901",
"Description": "Delicious crispy banana chips with almond flavor...",
"Ingredients": "Banana, Almond Oil, Salt",
"Allergen Info": "May contain peanuts"
}This extractor pipeline was built to automate data collection and QA for a client scraping 500+ retailers weekly. It reduced manual QA efforts by 60%, improved speed 3x, and enhanced accuracy through AI.
- Add support for image-based OCR + GPT
- Integrate n8n for full workflow scheduling & alerting
- Extend to store data in PostgreSQL or AWS S3
Built by Biswajit Biswal β contributions welcome!