PDF Extraction Playground

Current Website:https://pdf-playground-8wlek8vg8-mevin-joses-projects.vercel.app/

A full-stack application designed to upload PDF files, process them using various extraction models, and display the results for comparison. It's a monorepo consisting of a Next.js frontend and a FastAPI backend.

The application allows users to upload PDFs and compare the extraction results from three models: Surya, Docling, and MinerU. The system attempts to use these specialized models for optimal extraction, with automatic fallback to PyMuPDF to ensure reliable functionality. The UI displays the extracted markdown content alongside the original PDF pages annotated with bounding boxes for detected elements.

Features

Monorepo Architecture: Clean separation of frontend and backend concerns.
PDF Upload: Drag-and-drop or file-picker interface for uploading PDF documents.
Multi-Model Extraction: Attempts to process PDFs with multiple extraction backends, falling back to PyMuPDF when needed.
Page Range Selection: Extract specific page ranges from PDF documents.
Dual-Pane Viewer: View a model's extracted markdown side-by-side with zoomable, annotated page images.
Comparison View: Compare results from all selected models in a grid layout, including a detailed metrics table.
Zoomable Images: Interactive PDF page viewing with zoom, pan, and pinch controls.
Dark/Light Mode: Themed UI for user preference.
Robust Fallback: PyMuPDF fallback ensures functionality even when specialized models are unavailable.
Deployable: Ready for deployment on Modal and Vercel.

Architecture

The project is a monorepo with two main components: frontend and backend.

Backend (FastAPI)

Framework: Built with FastAPI, providing a robust and fast API service.
Entrypoint: backend/app/main.py creates the FastAPI app, configures CORS, and mounts the API router.
Deployment: backend/modal_app.py contains the configuration for one-command deployment to Modal.
Extraction Pipeline: The POST /api/extract endpoint in backend/app/routers/extract.py handles file uploads and utilizes an adapter pattern.
Model Adapters: Three extraction models are integrated via a common BaseAdapter interface:
- SuryaAdapter - Attempts specialized layout detection with PyMuPDF fallback
- DoclingAdapter - Attempts enhanced text formatting with PyMuPDF fallback
- MinerUAdapter - Attempts optimized fast processing with PyMuPDF fallback
PDF Processing: PyMuPDF (backend/app/utils/pdf.py) is the core engine for PDF parsing and rendering with intelligent fallback modes.
Rate Limiting: Built-in rate limiting (12 requests per minute per IP).
Health Monitoring: /health endpoint reports PDF engine status and availability.

Frontend (Next.js)

Framework: Built with Next.js 14 (App Router), React, and styled with Tailwind CSS.
Main Interface: frontend/app/page.tsx implements the complete user workflow.
State Management: React state manages file uploads, processing status, and results.
Interactive Components:
- Sidebar - Navigation and file status
- ZoomableImage - Interactive PDF page viewer with zoom controls
- ThemeToggle - Dark/light mode switching
Responsive Design: Mobile-friendly with MobileSidebar for smaller screens.
Results Display:
- Dual-pane view showing annotated images alongside extracted markdown
- Comparison grid with detailed metrics and side-by-side model outputs
- Performance analytics including processing time, element counts, and confidence scores

Getting Started

Prerequisites

Node.js (v18.17 or later)
npm or yarn
Python (v3.10 or later) and pip

Backend Setup

Navigate to the backend directory:
```
cd backend
```

Create and activate a Python virtual environment:

# For macOS/Linux
python3 -m venv .venv
source .venv/bin/activate

# For Windows (PowerShell)
python -m venv .venv
.\.venv\Scripts\Activate.ps1

Install dependencies:
```
pip install -r requirements.txt
```
Run the development server:
```
uvicorn app.main:create_app --factory --host 0.0.0.0 --port 8000 --reload
```
The backend API will be available at http://localhost:8000.

Frontend Setup

Navigate to the frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```
Configure environment variables:
```
cp .env.local.example .env.local
```
Set NEXT_PUBLIC_BACKEND_URL in frontend/.env.local to http://localhost:8000.
Run the development server:
```
npm run dev
```
The frontend will be available at http://localhost:3000.

Deployment

Backend on Modal

Install Modal CLI:
```
pip install modal
```
Set up Modal token:
```
modal token new
```
Deploy:
```
modal deploy backend/modal_app.py
```

Frontend on Vercel

The project includes vercel.json for automatic configuration:

Push to GitHub/GitLab/Bitbucket
Import repository in Vercel
Set root directory to frontend
Deploy automatically with pre-configured environment variables

API Reference

Extract Endpoint

POST /api/extract

Upload and process a PDF file with selected models.

Form Data:

file (required): PDF file to process
models (optional): Comma-separated list of models (surya,docling,mineru)
page_start (optional): Starting page number (1-indexed)
page_end (optional): Ending page number

Response:

{
  "pages": 3,
  "models": {
    "surya": {
      "text_markdown": "# Extracted content...",
      "annotated_images": ["data:image/png;base64,..."],
      "meta": {
        "time_ms": 1250.0,
        "block_count": 45,
        "ocr_box_count": 0,
        "char_count": 5420,
        "word_count": 892,
        "element_counts": {
          "titles": 3,
          "headers": 8,
          "paragraphs": 12,
          "tables": 2,
          "figures": 1
        },
        "confidence": 0.94
      }
    }
  }
}

Health Check

GET /health

Check backend status and PDF engine availability.

{
  "status": "ok",
  "pdf_engine": "fitz",
  "fitz_available": true,
  "fitz_error": null
}

Troubleshooting

PyMuPDF Issues

The backend uses PyMuPDF for PDF processing. If you encounter issues:

Check engine status:
```
curl http://localhost:8000/health
```
Common Windows fixes:
- Install Visual C++ Redistributable
- Use 64-bit Python
- Clear pip cache: pip cache purge
Detailed troubleshooting: See backend/BACKEND_TROUBLESHOOTING.md

Fallback Mode

When specialized extraction models are unavailable, the system automatically uses PyMuPDF fallback:

Provides reliable text extraction using PyMuPDF's core functionality
Generates page images with basic element detection
Maintains full API compatibility
Sets confidence scores based on extraction quality

Common Issues

CORS errors: Ensure backend CORS_ORIGINS includes your frontend domain
File upload fails: Check file size (max 15MB) and format (PDF only)
Slow processing: Large files with many pages - try reducing page range

Configuration

Environment Variables

Frontend:

NEXT_PUBLIC_BACKEND_URL: Backend API base URL

Backend:

CORS_ORIGINS: Comma-separated list of allowed origins

File Limits

Maximum file size: 15MB
Rate limit: 12 requests per minute per IP
Supported format: PDF only

Development

The project uses:

TypeScript for type safety
Tailwind CSS for styling
React Zoom Pan Pinch for interactive image viewing
React Markdown for content rendering
Pydantic for API validation
FastAPI for backend framework

License

This project is open source and available under standard terms.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.vscode		.vscode
backend		backend
frontend		frontend
--		--
.gitattributes		.gitattributes
.gitignore		.gitignore
.nextdev.pid		.nextdev.pid
.uvicorn.pid		.uvicorn.pid
DEPLOYMENT.md		DEPLOYMENT.md
README.md		README.md
WARP.md		WARP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Extraction Playground

Features

Architecture

Backend (FastAPI)

Frontend (Next.js)

Getting Started

Prerequisites

Backend Setup

Frontend Setup

Deployment

Backend on Modal

Frontend on Vercel

API Reference

Extract Endpoint

Health Check

Troubleshooting

PyMuPDF Issues

Fallback Mode

Common Issues

Configuration

Environment Variables

File Limits

Development

License

About

Uh oh!

Releases

Packages

Languages

MJenius/PDF-Extractor-With-PymuPDF

Folders and files

Latest commit

History

Repository files navigation

PDF Extraction Playground

Features

Architecture

Backend (FastAPI)

Frontend (Next.js)

Getting Started

Prerequisites

Backend Setup

Frontend Setup

Deployment

Backend on Modal

Frontend on Vercel

API Reference

Extract Endpoint

Health Check

Troubleshooting

PyMuPDF Issues

Fallback Mode

Common Issues

Configuration

Environment Variables

File Limits

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages