SmartScraper is an AI-powered tool designed to generate, execute, and refine Python web scrapers. It leverages LLMs (GPT-5.x/Codex) to bridge the gap between human intent and executable Playwright/BeautifulSoup code.
SmartScraper implements an iterative generation process to handle complex DOM structures:
- Generate: AI analyzes the webpage structure (HTML + Screenshot) and writes an initial script.
- Execute: The script is run in a secure, isolated sandbox.
- Refine: If execution fails or data is missing, the AI analyzes the error log and user feedback to patch the code automatically.
- Visual Analysis: Uses GPT-Vision to understand page layout and identify target data tables/lists.
- Secure Sandbox: Executes generated code in a restricted environment (whitelisted imports only, no FS/Network access beyond scraping).
- Recursive Correction: The
/fixendpoint accepts error logs and user feedback to patch the script intelligently. - Portable Export: Download the finalized scraper as a ZIP package with auto-scheduling scripts (
setup_task.ps1) for deployment.
The system follows a 3-stage funnel designed for stability and cost-efficiency:
-
Browser Layer (Playwright)
- Injects JavaScript to simplify the DOM, removing noise (
script,svg,style) and limiting element depth (StockQ optimized: Limit 300 / Depth 8). - Reduces token context by ~90% while retaining structural integrity.
- Injects JavaScript to simplify the DOM, removing noise (
-
Analyzer Agent (GPT-5.2)
- Reads the simplified HTML + Screenshot.
- Outputs a JSON specification (Selectors, Data Structure) for the generator.
-
Generator Agent (Codex)
- Translates the specification into robust Python code (
requests+BeautifulSoup+urllib). - Includes
User-Agentrotation and reliable error handling.
- Translates the specification into robust Python code (
- Python 3.10+
- Azure OpenAI Endpoint (GPT-5.2-Chat & GPT-5.1-Codex)
- Clone repository:
git clone https://github.com/breezy89757/SmartScraper.git cd SmartScraper - Install dependencies (using
uvis recommended):uv sync
- Configure environment:
- Copy
.env.exampleto.env - Set
AZURE_OPENAI_API_KEY,ENDPOINT,DEPLOYMENT_NAME.
- Copy
uv run python main.pyOpen browser at http://localhost:8081.
This tool executes AI-generated code. While the SandboxExecutor restricts imports to a safe list (requests, bs4, json, urllib), never run this server on a public-facing network without additional authentication layers.