Skip to content

Scrapes product pages using Python & Selenium, extracts structured attributes using OpenAI GPT, validates fields via Pydantic, and outputs clean JSON.

Notifications You must be signed in to change notification settings

MJS-Tech-Ventures/product-data-extractor-gpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›’ AI-Powered Product Data Extractor

This project demonstrates a complete data pipeline to extract structured product information from e-commerce web pages using Python, web scraping, and OpenAI's GPT.

It automatically scrapes product pages, uses GPT to parse product attributes like name, GTIN, ingredients, and allergen info, validates the output using Pydantic, and stores the result as JSON.


πŸ”§ Tech Stack

  • Python
  • Selenium & BeautifulSoup – For web scraping
  • OpenAI GPT API – For AI-based structured extraction
  • n8n (planned integration) – Workflow automation
  • Pydantic – Data validation
  • JSON / Local File Storage

πŸ“‚ Project Structure

project_root/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ scrapers/ # Web scraping logic β”‚ β”œβ”€β”€ gpt_parser/ # GPT prompt and API call β”‚ β”œβ”€β”€ validators/ # Pydantic-based data validation β”‚ └── main.py # Orchestration script β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ sample_input/ # Example HTML inputs β”‚ └── output_json/ # Extracted JSON data β”œβ”€β”€ requirements.txt # Python dependencies └── README.md


πŸš€ How to Run

1. Clone the repo & install dependencies

pip install -r requirements.txt

2. Set your OpenAI API key

export OPENAI_API_KEY=your-key-here

3. Run the pipeline

python src/main.py

You’ll be prompted to enter a product URL.


πŸ’‘ Sample Output

{
  "Product Name": "Almond Banana Chips",
  "GTIN": "9312345678901",
  "Description": "Delicious crispy banana chips with almond flavor...",
  "Ingredients": "Banana, Almond Oil, Salt",
  "Allergen Info": "May contain peanuts"
}

πŸ“Š Business Use Case

This extractor pipeline was built to automate data collection and QA for a client scraping 500+ retailers weekly. It reduced manual QA efforts by 60%, improved speed 3x, and enhanced accuracy through AI.


🧠 Future Enhancements

  • Add support for image-based OCR + GPT
  • Integrate n8n for full workflow scheduling & alerting
  • Extend to store data in PostgreSQL or AWS S3

πŸ“¬ Contact

Built by Biswajit Biswal β€” contributions welcome!

About

Scrapes product pages using Python & Selenium, extracts structured attributes using OpenAI GPT, validates fields via Pydantic, and outputs clean JSON.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages