A Python toolkit for converting PDF files to clean, readable EPUB format using AI-powered post-OCR correction and text refinement.
This project provides a complete pipeline for converting PDF files to EPUB format:
- PDF OCR - Extract text from PDF files using OCR when necessary
- OCR Cleanup - Remove common OCR artifacts using regex and heuristics
- AI Refinement - Use OpenAI GPT-4.1 to correct spelling, punctuation, and OCR errors
- EPUB Generation - Convert the cleaned text to properly formatted EPUB files
- Python orchestrator – Single entry point (
python -m pipeline) manages OCR, cleaning, AI refinement, and EPUB generation with shared configuration. - Page-aware workflow – Every stage preserves page boundaries so EPUBs mirror the source pagination and plain-text exports stay easy to audit.
- Flexible extraction – Tries direct text extraction first, with optional Surya OCR for page images when the PDF lacks a text layer.
- Robust preprocessing – Removes headers/footers, repairs soft-hyphen breaks, normalises whitespace, and protects list/heading structure.
- OpenAI refinement – Deterministic prompts with concurrency, cost tracking, and skip logic for empty pages; works with any GPT-4.1-compatible key.
- Cost and progress telemetry – Live token counts, cost estimates, and guardrails (
--max-cost,--confirm-cost) for long documents. - Modular CLI utilities – Each stage remains available as a standalone script for debugging or bespoke workflows.
- Python 3.7+
- OpenAI API key
Install required packages:
# Install from requirements.txt (recommended)
pip install -r requirements.txt
# Or install individually:
pip install openai python-dotenv ebooklib tqdm tiktoken PyMuPDF pillow surya-ocrSystem Notes:
- Surya ships large neural checkpoints; the first run will download ~2 GB of weights to your cache.
- GPU acceleration (CUDA or Apple M‑series MPS) dramatically speeds up Surya. CPU-only runs work but can be slow.
- Ensure you have enough disk space in your cache directory (defaults to
~/Library/Caches/datalabon macOS).
- Clone this repository
- Create a
.envfile in the project root (the orchestrator loads these values automatically):Adjust any paths or limits to suit your environment. CLI flags always take precedence overOPENAI_API_KEY=sk-your-key AI_MODEL=gpt-4.1 OUTPUT_DIR=output TEMP_DIR=temp MAX_COST_LIMIT=5
.envvalues.
Run the full workflow with the new CLI:
python -m pipeline run "document.pdf" --ocr-engine surya --max-cost 5 --confirm-costUseful flags:
--ocr-engine {text|surya}– choose between direct text extraction or Surya OCR.--page-range 5-120– limit processing to a 1-based inclusive slice.--skip-ai– stop after heuristic cleanup (no OpenAI cost).--max-pages N– cap the number of pages processed after other filters (handy for smoke tests).--force-ocr– bypass direct extraction and run the chosen OCR engine.--max-cost/--confirm-cost– cost guardrails for OpenAI refinement.
For manual control or debugging, each stage is available as a subcommand:
# Step 1: Extract text from PDF (OCR if needed)
python -m pipeline ocr document.pdf --output temp/document_ocr.txt --max-pages 5
# Step 2: Clean OCR text with the heuristic pipeline
python -m pipeline clean --input temp/document_ocr.txt --output temp/document_clean.txt
# Step 3: AI refinement (costs money!)
python -m pipeline refine --input temp/document_clean.txt --output temp/document_refined.txt --max-cost 5 --confirm-cost
# Step 4: Convert to EPUB while preserving page boundaries
python -m pipeline epub --input temp/document_refined.txt --output output/document.epubExtracts text from PDFs, automatically falling back to OCR when needed.
python -m pipeline ocr document.pdf --ocr-engine surya --output document_ocr.txt --page-range 10-25--ocr-engine {text|surya}selects the extraction method.--force-ocrforces Surya even when a text layer exists.--page-rangeand--max-pagescombine to keep experiments focused.
Applies regex/heuristic cleanup to page-marked text.
python -m pipeline clean --input document_ocr.txt --output document_clean.txtRuns the OpenAI-based refinement stage.
python -m pipeline refine --input document_clean.txt --output document_refined.txt --max-cost 5 --confirm-cost--ai-modeloverrides the model declared in.env.--max-cost/--confirm-costkeep spending in check.--pdfoptionally links back to the original PDF for metadata.
Generates an EPUB with page-by-page spine entries.
python -m pipeline epub --input document_refined.txt --output document.epub--pdf(optional) improves metadata when the original PDF is still available.
No automated regression suite ships with the repository today. When changing the pipeline, run a targeted Surya-backed conversion with a small --page-range to smoke test the extraction/cleanup flow before processing a full document.
The orchestrator and the per-stage CLIs read their defaults from .env:
| Key | Description | Default |
|---|---|---|
OPENAI_API_KEY |
API key used by openai library. |
required |
AI_MODEL |
Chat-completions model for refinement. | gpt-4.1 |
OUTPUT_DIR |
Directory for final EPUB/TXT artefacts. | output/ |
TEMP_DIR |
Scratch space for OCR/intermediate text. | temp/ |
MAX_TOKENS_PER_CHUNK |
Upper bound for AI chunk size. | 3000 |
MAX_COST_LIMIT |
USD ceiling before prompting for confirmation. | 10.0 |
Override any of these at runtime via CLI flags (e.g., python -m pipeline doc.pdf --max-cost 3).
The AI refinement tool splits text into chunks for processing:
- Default: 2500 tokens per chunk (optimal for context and cost)
- Chunks are created at paragraph boundaries when possible
- Large paragraphs are split at sentence boundaries
Supported OpenAI models:
gpt-4.1(recommended) - Best balance of quality and costgpt-4- Higher quality, higher costgpt-4o- Alternative model option
The AI refinement tool processes chunks concurrently:
- Default: 5 concurrent requests
- Includes rate limiting and error handling
- Progress tracking with cost estimates
pdf-to-epub-ai/
├── pipeline/ # Shared infrastructure used by every entry point
│ ├── __init__.py
│ ├── __main__.py # Enables `python -m pipeline`
│ ├── config.py # Env/CLI configuration resolver
│ ├── logging.py # Central logging config
│ ├── main.py # Orchestrator CLI
│ ├── ocr.py # Direct + OCR extraction helpers
│ ├── pages.py # Page dataclasses
│ ├── preprocess.py # Heuristic text cleanup
│ ├── refine.py # OpenAI integration and cost tracking
│ └── storage.py # Page-oriented I/O helpers
├── pdf_ocr.py # Standalone OCR CLI wrapper
├── process_ocr.py # Standalone cleanup CLI wrapper
├── openai_cleaner.py # Standalone AI refinement CLI wrapper
├── convert_to_epub.py # Standalone EPUB conversion CLI wrapper
├── test_*.py / *.bat # Utility scripts for smoke tests
├── requirements.txt # Python dependencies
├── README.md # This documentation
└── LICENSE
- API failures: Automatic retry with exponential backoff
- Rate limiting: Built-in delays and concurrency limits
- File errors: Comprehensive error messages and graceful degradation
- Token limits: Automatic text chunking with size validation
The AI refinement tool provides detailed cost tracking:
- Real-time cost accumulation
- Estimated total cost based on progress
- Token usage statistics (input/output)
- Cost per chunk analysis
Rule of thumb: with gpt-4.1, expect roughly $0.005–$0.006 per source page (about 600 input / 540 output tokens). Use --max-cost and --confirm-cost to prevent surprises on very long documents.
- OCR Quality: Start with the highest quality OCR possible
- Preprocessing: Use
process_ocr.pybefore AI refinement to reduce costs - Testing: Use
test_subset.pyto validate results before processing large files - Model Selection: GPT-4.1 provides excellent results at lower cost than GPT-4
- Chunking: Default settings work well, but adjust if you notice context issues
"OPENAI_API_KEY not found"
- Ensure
.envfile exists with valid API key - Check file is in project root directory
"API rate limit exceeded"
- Reduce concurrent processing (modify
batch_sizein code) - Add delays between requests
"Chunks too large"
- Reduce
max_tokensparameter in chunking - Check for very long paragraphs without proper breaks
Poor chapter detection
- Ensure chapter headings follow standard format (Chapter 1, Part I, etc.)
- Manual heading cleanup may be needed before conversion
This is a specialized tool for PDF to EPUB conversion. Feel free to fork and adapt for your specific needs.
Contributions, bug reports, and feature requests are welcome!
MIT License - see LICENSE file for details.