Skip to content

Python-based lead generation pipeline using programmable search queries and web scraping to extract, normalize, and deduplicate business contact data with tests and CI-ready architecture.

Notifications You must be signed in to change notification settings

JCOM127/Lead_Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

leadgen

Production-grade Python lead generation pipeline:

  1. Load structured search queries from JSON.
  2. Execute queries via an abstracted search client (Google Custom Search API compatible).
  3. Collect organic result URLs.
  4. Fetch and scrape target websites for contact information.
  5. Normalize, validate, and deduplicate lead data.
  6. Output clean, deterministic CSV results.

Scraping behavior

  • Fetches each Google result URL, then (if needed) also fetches the domain homepage and up to 1 contact-like page (max LEADGEN_MAX_PAGES_PER_DOMAIN total).
  • Discovers contact-like pages by parsing internal links containing contact, support/help, about/team, plus a small fixed list (e.g. /contact, /contact-us).
  • Retries fetches on 429 (rate limit; respects Retry-After when present) and 5xx errors; does not retry on 403.

Requirements

  • Python 3.11+

Install

python -m pip install -e ".[dev]"

CLI

Validate a queries JSON file:

leadgen validate-queries inputs/queries.example.json

Run the pipeline:

leadgen run --queries inputs/queries.example.json --output outputs/leads.csv

Environment variables (Google CSE)

Set these to use the google-cse search client:

  • LEADGEN_GOOGLE_API_KEY
  • LEADGEN_GOOGLE_CSE_ID

Optional:

  • LEADGEN_GOOGLE_BASE_URL (default: https://www.googleapis.com/customsearch/v1)
  • LEADGEN_RATE_LIMIT_PER_SECOND (default: 1.0)
  • LEADGEN_REQUEST_TIMEOUT_SECONDS (default: 15.0)
  • LEADGEN_MAX_RESULTS_PER_QUERY (default: 20)
  • LEADGEN_MAX_PAGES_PER_DOMAIN (default: 3)
  • LEADGEN_SCRAPE_DELAY_SECONDS (default: 1.0)
  • LEADGEN_MAX_OUTPUT_LEADS (default: 1000)
  • LEADGEN_CHECKPOINT_DIR (default: unset)
  • LEADGEN_DOMAIN_DENYLIST (default: common directories/social sites; comma-separated)
  • LEADGEN_RESUME (default: false; set to true to resume from checkpoints/events.jsonl)

Logging

Logging is enabled via Python's logging module (stdout/stderr). Control verbosity with:

  • LEADGEN_LOG_LEVEL (default: INFO; common values: DEBUG, INFO, WARNING, ERROR)

Checkpoints

To persist intermediate pipeline data for verification/debugging, set:

  • LEADGEN_CHECKPOINT_DIR (example: checkpoints)

This will append newline-delimited JSON events to checkpoints/events.jsonl (append-only event log).

To resume after an interruption (e.g. Google 429 rate limit), keep the same LEADGEN_CHECKPOINT_DIR and set:

  • LEADGEN_RESUME=true

Inputs and outputs

  • Example input: inputs/queries.example.json
  • Example output: outputs/leads.example.csv

Output CSV columns:

  • title, url, domain, email, phone, city, state, industry, source_query_id

Notes:

  • Leads with neither a valid email nor a valid phone are removed.
  • Missing email/phone values are written as the literal none.

About

Python-based lead generation pipeline using programmable search queries and web scraping to extract, normalize, and deduplicate business contact data with tests and CI-ready architecture.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages