This tool helps automate the process of scraping job listings from various job boards by intelligently detecting pagination and job listing patterns using OpenAI's GPT models.
The name "Scrythe" is a bit of a combination between scrying, scythe, and scrape.
- Python 3.8+
- OpenAI API Key
- Selenium WebDriver
- Required Python packages (install via
pip install -r requirements.txt)
- Set your OpenAI API key as a system environment variable:
export OPENAI_API_KEY='your_api_key_here'- Install dependencies:
pip install -r requirements.txtRun build_scraper.py with the URL of a job board:
python build_scraper.py https://example.com/jobs [-v]Options:
-v, --verbose: Enable verbose output showing detailed progress and timing information
This script will:
- Navigate to the job board
- Detect job listing and pagination patterns using GPT-4o-mini
- Analyze HTML structure and generate XPath patterns
- Verify pagination functionality
- Write configuration to
sites_to_scrape.csv
The build process typically costs around $0.05 per site in OpenAI API usage.
Run run_scraper.py to scrape all configured job boards:
python run_scraper.pyThis script will:
- Read configurations from
sites_to_scrape.csv - Randomize the order of sites to scrape
- Scrape job listings from each configured job board
- Download job descriptions with caching
- Handle pagination automatically
- Scraped job descriptions are saved in a
cachedirectory - Each cached file includes:
- Original job listing URL as a comment in the first line
- Full HTML content of the job description
- Cache is maintained for 28 days by default
- Files are named using a combination of site name and URL hash
The scraper supports:
- Numbered pagination (page 1, page 2, page 3)
- Offset pagination (start with item 0, item 10, item 20)
- Various URL patterns and increments
- Handles both relative and absolute URLs
- Intelligent detection of pagination patterns using GPT models
- Built-in rate limiting and randomized delays
- Smart caching with configurable expiration
- Anti-detection measures via Selenium stealth
- Ensure you have the appropriate Selenium WebDriver installed
- Some job boards may have anti-scraping measures that could interrupt the process, but Selenium stealth helps mitigate this
- The scraper automatically cleans and processes HTML content to optimize token usage
- Built-in error handling and retry mechanisms for robustness