Skip to content

A GTK4 application that converts document images to organized PDFs using OCR technology. It automatically detects page numbers, organizes documents, and allows manual corrections when OCR fails.

Notifications You must be signed in to change notification settings

EmanuProds/image2doc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📄 Image2DOC

Python GTK Tesseract License: MIT

A GTK4 application that converts document images to organized PDFs using OCR technology. It automatically detects page numbers, organizes documents, and allows manual corrections when OCR fails.

✨ Features

  • ⚡ Parallel OCR Processing: Uses multiple CPU cores for faster image processing
  • 🔍 Automatic Page Detection: Extracts page numbers using Tesseract OCR
  • ✏️ Manual Correction: Interactive dialog for correcting OCR failures
  • 📚 Smart Organization: Automatically organizes PDFs by page numbers (FL. 001, FL. 002, etc.)
  • 💾 Cache System: Skips already processed images to avoid reprocessing
  • 🎨 Modern UI: Built with GTK4 and Libadwaita for a native Linux experience
  • 📊 Real-time Logs: Live monitoring of processing status and errors
  • ⚙️ Configurable Settings: Adjustable maximum pages and processing threads

Prerequisites

System Requirements

  • Linux operating system
  • Python 3.8 or higher
  • GTK4 development libraries
  • Tesseract OCR engine

Installing System Dependencies

Ubuntu/Debian

sudo apt update
sudo apt install python3 python3-pip tesseract-ocr tesseract-ocr-por libgtk-4-dev libadwaita-1-dev

Fedora

sudo dnf install python3 python3-pip tesseract tesseract-langpack-por gtk4-devel libadwaita-devel

Arch Linux

sudo pacman -S python python-pip tesseract tesseract-data-por gtk4 libadwaita

Installation

  1. Clone the repository:
git clone https://github.com/EmanuProds/ncx-book-organizer.git
cd img2doc
  1. Create a virtual environment (recommended):
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install Python dependencies:
pip install pytesseract pillow pygobject

Usage

  1. Activate the virtual environment (if created):
source venv/bin/activate
  1. Run the application:
python main.py

How to Use

  1. Select Input Directory: Choose the folder containing your document images (JPG/JPEG)
  2. Select Output Directory: Choose where the organized PDFs will be saved
  3. Configure Settings (optional):
    • Maximum pages: Set the total number of pages in your document
    • Number of processes: Adjust parallel processing (0 = auto-detect)
  4. Start Processing: Click "Start Processing" and monitor progress in the Logs tab
  5. Manual Corrections: If OCR fails, the app will prompt for manual page number input

Output Structure

The application creates organized PDFs with the following naming convention:

  • FL. 001.pdf, FL. 002.pdf, etc. - Regular pages
  • FL. 001-verso.pdf - Back sides of pages
  • TERMO DE ABERTURA.pdf - Opening terms
  • TERMO DE ENCERRAMENTO.pdf - Closing terms
  • ERRO_OCR_filename.pdf - Files that couldn't be processed

Configuration

OCR Settings

  • Language: Portuguese (por)
  • PSM Mode: 6 (Uniform block of text)
  • ROI: Configurable region of interest for page number detection

Processing Settings

  • Maximum Pages: Default 300 pages
  • Parallel Processes: Default 4 workers
  • Cache System: Automatically detects and skips already processed files

Architecture

The application follows a modern, service-oriented architecture with clear separation of concerns:

src/
├── models.py           # Data models and domain entities (dataclasses & enums)
├── exceptions.py       # Custom exception hierarchy
├── config.py           # Application configuration
├── core.py             # Legacy processing logic (backward compatibility)
├── services/           # Modern service layer
│   ├── file_service.py     # File operations and caching
│   ├── ocr_service.py      # OCR processing and image manipulation
│   └── processing_service.py # Main processing coordination
├── interface/          # GTK4 UI layer
│   ├── entrypoint.py       # Application initialization
│   ├── gui.py              # Main window and navigation
│   ├── home.py             # Processing interface
│   ├── pref.py             # Preferences/settings page
│   ├── logs.py             # Logging interface
│   └── about.py            # About dialog
├── ocr.py              # Legacy OCR functions (deprecated)
└── __init__.py         # Package initialization

Development

Project Structure

  • main.py: Application entry point
  • src/: Main source code (modern architecture)
  • README.md: English documentation
  • README.pt-BR.md: Portuguese documentation

Recent Changes (v1.0.0)

  • 🏗️ Architecture Refactoring: Complete modernization with service-oriented design
  • 📁 File Organization: Renamed interface files for consistency (removed _page/_dialog suffixes)
  • 🏷️ Project Renaming: Changed from "Image2PDF" to "Image2DOC" for clarity
  • 🧹 Code Cleanup: Removed deprecated files and legacy code
  • 📚 Documentation: Updated READMEs with current project structure

Key Technologies

  • GTK4: Modern GUI framework
  • Libadwaita: Adaptive UI components
  • Tesseract: OCR engine
  • Pillow: Image processing
  • Concurrent.futures: Parallel processing

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

Troubleshooting

Common Issues

Tesseract not found

Error: Tesseract not found
  • Install Tesseract: sudo apt install tesseract-ocr
  • Ensure it's in PATH: which tesseract

GTK4 not available

ImportError: GTK4 libraries not found
  • Install GTK4 development packages
  • Ensure PyGObject is properly installed

OCR accuracy issues

  • Ensure images are clear and well-lit
  • Check that page numbers are in the expected region
  • Use manual correction when automatic detection fails

Performance Tips

  • Use SSD storage for faster I/O
  • Increase parallel processes for multi-core systems
  • Process images in batches for better cache utilization

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Developed by Emanuel Pereira

Acknowledgments

  • Tesseract OCR project
  • GTK and GNOME communities
  • Python Pillow library contributors

About

A GTK4 application that converts document images to organized PDFs using OCR technology. It automatically detects page numbers, organizes documents, and allows manual corrections when OCR fails.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages