leadgen

Production-grade Python lead generation pipeline:

Load structured search queries from JSON.
Execute queries via an abstracted search client (Google Custom Search API compatible).
Collect organic result URLs.
Fetch and scrape target websites for contact information.
Normalize, validate, and deduplicate lead data.
Output clean, deterministic CSV results.

Scraping behavior

Fetches each Google result URL, then (if needed) also fetches the domain homepage and up to 1 contact-like page (max LEADGEN_MAX_PAGES_PER_DOMAIN total).
Discovers contact-like pages by parsing internal links containing contact, support/help, about/team, plus a small fixed list (e.g. /contact, /contact-us).
Retries fetches on 429 (rate limit; respects Retry-After when present) and 5xx errors; does not retry on 403.

Requirements

Python 3.11+

Install

python -m pip install -e ".[dev]"

CLI

Validate a queries JSON file:

leadgen validate-queries inputs/queries.example.json

Run the pipeline:

leadgen run --queries inputs/queries.example.json --output outputs/leads.csv

Environment variables (Google CSE)

Set these to use the google-cse search client:

LEADGEN_GOOGLE_API_KEY
LEADGEN_GOOGLE_CSE_ID

Optional:

LEADGEN_GOOGLE_BASE_URL (default: https://www.googleapis.com/customsearch/v1)
LEADGEN_RATE_LIMIT_PER_SECOND (default: 1.0)
LEADGEN_REQUEST_TIMEOUT_SECONDS (default: 15.0)
LEADGEN_MAX_RESULTS_PER_QUERY (default: 20)
LEADGEN_MAX_PAGES_PER_DOMAIN (default: 3)
LEADGEN_SCRAPE_DELAY_SECONDS (default: 1.0)
LEADGEN_MAX_OUTPUT_LEADS (default: 1000)
LEADGEN_CHECKPOINT_DIR (default: unset)
LEADGEN_DOMAIN_DENYLIST (default: common directories/social sites; comma-separated)
LEADGEN_RESUME (default: false; set to true to resume from checkpoints/events.jsonl)

Logging

Logging is enabled via Python's logging module (stdout/stderr). Control verbosity with:

LEADGEN_LOG_LEVEL (default: INFO; common values: DEBUG, INFO, WARNING, ERROR)

Checkpoints

To persist intermediate pipeline data for verification/debugging, set:

LEADGEN_CHECKPOINT_DIR (example: checkpoints)

This will append newline-delimited JSON events to checkpoints/events.jsonl (append-only event log).

To resume after an interruption (e.g. Google 429 rate limit), keep the same LEADGEN_CHECKPOINT_DIR and set:

LEADGEN_RESUME=true

Inputs and outputs

Example input: inputs/queries.example.json
Example output: outputs/leads.example.csv

Output CSV columns:

title, url, domain, email, phone, city, state, industry, source_query_id

Notes:

Leads with neither a valid email nor a valid phone are removed.
Missing email/phone values are written as the literal none.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
inputs		inputs
leadgen		leadgen
outputs		outputs
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

leadgen

Scraping behavior

Requirements

Install

CLI

Environment variables (Google CSE)

Logging

Checkpoints

Inputs and outputs

About

Uh oh!

Releases

Packages

Languages

JCOM127/Lead_Generator

Folders and files

Latest commit

History

Repository files navigation

leadgen

Scraping behavior

Requirements

Install

CLI

Environment variables (Google CSE)

Logging

Checkpoints

Inputs and outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages