🛒 AI-Powered Product Data Extractor

This project demonstrates a complete data pipeline to extract structured product information from e-commerce web pages using Python, web scraping, and OpenAI's GPT.

It automatically scrapes product pages, uses GPT to parse product attributes like name, GTIN, ingredients, and allergen info, validates the output using Pydantic, and stores the result as JSON.

🔧 Tech Stack

Python
Selenium & BeautifulSoup – For web scraping
OpenAI GPT API – For AI-based structured extraction
n8n (planned integration) – Workflow automation
Pydantic – Data validation
JSON / Local File Storage

📂 Project Structure

project_root/ ├── src/ │ ├── scrapers/ # Web scraping logic │ ├── gpt_parser/ # GPT prompt and API call │ ├── validators/ # Pydantic-based data validation │ └── main.py # Orchestration script ├── data/ │ ├── sample_input/ # Example HTML inputs │ └── output_json/ # Extracted JSON data ├── requirements.txt # Python dependencies └── README.md

🚀 How to Run

1. Clone the repo & install dependencies

pip install -r requirements.txt

2. Set your OpenAI API key

export OPENAI_API_KEY=your-key-here

3. Run the pipeline

python src/main.py

You’ll be prompted to enter a product URL.

💡 Sample Output

{
  "Product Name": "Almond Banana Chips",
  "GTIN": "9312345678901",
  "Description": "Delicious crispy banana chips with almond flavor...",
  "Ingredients": "Banana, Almond Oil, Salt",
  "Allergen Info": "May contain peanuts"
}

📊 Business Use Case

This extractor pipeline was built to automate data collection and QA for a client scraping 500+ retailers weekly. It reduced manual QA efforts by 60%, improved speed 3x, and enhanced accuracy through AI.

🧠 Future Enhancements

Add support for image-based OCR + GPT
Integrate n8n for full workflow scheduling & alerting
Extend to store data in PostgreSQL or AWS S3

📬 Contact

Built by Biswajit Biswal — contributions welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛒 AI-Powered Product Data Extractor

🔧 Tech Stack

📂 Project Structure

🚀 How to Run

1. Clone the repo & install dependencies

2. Set your OpenAI API key

3. Run the pipeline

💡 Sample Output

📊 Business Use Case

🧠 Future Enhancements

📬 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

MJS-Tech-Ventures/product-data-extractor-gpt

Folders and files

Latest commit

History

Repository files navigation

🛒 AI-Powered Product Data Extractor

🔧 Tech Stack

📂 Project Structure

🚀 How to Run

1. Clone the repo & install dependencies

2. Set your OpenAI API key

3. Run the pipeline

💡 Sample Output

📊 Business Use Case

🧠 Future Enhancements

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages