Production-grade Python lead generation pipeline:
- Load structured search queries from JSON.
- Execute queries via an abstracted search client (Google Custom Search API compatible).
- Collect organic result URLs.
- Fetch and scrape target websites for contact information.
- Normalize, validate, and deduplicate lead data.
- Output clean, deterministic CSV results.
- Fetches each Google result URL, then (if needed) also fetches the domain homepage and up to 1 contact-like page (max
LEADGEN_MAX_PAGES_PER_DOMAINtotal). - Discovers contact-like pages by parsing internal links containing
contact,support/help,about/team, plus a small fixed list (e.g./contact,/contact-us). - Retries fetches on
429(rate limit; respectsRetry-Afterwhen present) and5xxerrors; does not retry on403.
- Python 3.11+
python -m pip install -e ".[dev]"Validate a queries JSON file:
leadgen validate-queries inputs/queries.example.jsonRun the pipeline:
leadgen run --queries inputs/queries.example.json --output outputs/leads.csvSet these to use the google-cse search client:
LEADGEN_GOOGLE_API_KEYLEADGEN_GOOGLE_CSE_ID
Optional:
LEADGEN_GOOGLE_BASE_URL(default:https://www.googleapis.com/customsearch/v1)LEADGEN_RATE_LIMIT_PER_SECOND(default:1.0)LEADGEN_REQUEST_TIMEOUT_SECONDS(default:15.0)LEADGEN_MAX_RESULTS_PER_QUERY(default:20)LEADGEN_MAX_PAGES_PER_DOMAIN(default:3)LEADGEN_SCRAPE_DELAY_SECONDS(default:1.0)LEADGEN_MAX_OUTPUT_LEADS(default:1000)LEADGEN_CHECKPOINT_DIR(default: unset)LEADGEN_DOMAIN_DENYLIST(default: common directories/social sites; comma-separated)LEADGEN_RESUME(default:false; set totrueto resume fromcheckpoints/events.jsonl)
Logging is enabled via Python's logging module (stdout/stderr). Control verbosity with:
LEADGEN_LOG_LEVEL(default:INFO; common values:DEBUG,INFO,WARNING,ERROR)
To persist intermediate pipeline data for verification/debugging, set:
LEADGEN_CHECKPOINT_DIR(example:checkpoints)
This will append newline-delimited JSON events to checkpoints/events.jsonl (append-only event log).
To resume after an interruption (e.g. Google 429 rate limit), keep the same LEADGEN_CHECKPOINT_DIR and set:
LEADGEN_RESUME=true
- Example input:
inputs/queries.example.json - Example output:
outputs/leads.example.csv
Output CSV columns:
title,url,domain,email,phone,city,state,industry,source_query_id
Notes:
- Leads with neither a valid email nor a valid phone are removed.
- Missing
email/phonevalues are written as the literalnone.