📄 Image2DOC

A GTK4 application that converts document images to organized PDFs using OCR technology. It automatically detects page numbers, organizes documents, and allows manual corrections when OCR fails.

✨ Features

⚡ Parallel OCR Processing: Uses multiple CPU cores for faster image processing
🔍 Automatic Page Detection: Extracts page numbers using Tesseract OCR
✏️ Manual Correction: Interactive dialog for correcting OCR failures
📚 Smart Organization: Automatically organizes PDFs by page numbers (FL. 001, FL. 002, etc.)
💾 Cache System: Skips already processed images to avoid reprocessing
🎨 Modern UI: Built with GTK4 and Libadwaita for a native Linux experience
📊 Real-time Logs: Live monitoring of processing status and errors
⚙️ Configurable Settings: Adjustable maximum pages and processing threads

Prerequisites

System Requirements

Linux operating system
Python 3.8 or higher
GTK4 development libraries
Tesseract OCR engine

Installing System Dependencies

Ubuntu/Debian

sudo apt update
sudo apt install python3 python3-pip tesseract-ocr tesseract-ocr-por libgtk-4-dev libadwaita-1-dev

Fedora

sudo dnf install python3 python3-pip tesseract tesseract-langpack-por gtk4-devel libadwaita-devel

Arch Linux

sudo pacman -S python python-pip tesseract tesseract-data-por gtk4 libadwaita

Installation

Clone the repository:

git clone https://github.com/EmanuProds/ncx-book-organizer.git
cd img2doc

Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Python dependencies:

pip install pytesseract pillow pygobject

Usage

Activate the virtual environment (if created):

source venv/bin/activate

Run the application:

python main.py

How to Use

Select Input Directory: Choose the folder containing your document images (JPG/JPEG)
Select Output Directory: Choose where the organized PDFs will be saved
Configure Settings (optional):
- Maximum pages: Set the total number of pages in your document
- Number of processes: Adjust parallel processing (0 = auto-detect)
Start Processing: Click "Start Processing" and monitor progress in the Logs tab
Manual Corrections: If OCR fails, the app will prompt for manual page number input

Output Structure

The application creates organized PDFs with the following naming convention:

FL. 001.pdf, FL. 002.pdf, etc. - Regular pages
FL. 001-verso.pdf - Back sides of pages
TERMO DE ABERTURA.pdf - Opening terms
TERMO DE ENCERRAMENTO.pdf - Closing terms
ERRO_OCR_filename.pdf - Files that couldn't be processed

Configuration

OCR Settings

Language: Portuguese (por)
PSM Mode: 6 (Uniform block of text)
ROI: Configurable region of interest for page number detection

Processing Settings

Maximum Pages: Default 300 pages
Parallel Processes: Default 4 workers
Cache System: Automatically detects and skips already processed files

Architecture

The application follows a modern, service-oriented architecture with clear separation of concerns:

src/
├── models.py           # Data models and domain entities (dataclasses & enums)
├── exceptions.py       # Custom exception hierarchy
├── config.py           # Application configuration
├── core.py             # Legacy processing logic (backward compatibility)
├── services/           # Modern service layer
│   ├── file_service.py     # File operations and caching
│   ├── ocr_service.py      # OCR processing and image manipulation
│   └── processing_service.py # Main processing coordination
├── interface/          # GTK4 UI layer
│   ├── entrypoint.py       # Application initialization
│   ├── gui.py              # Main window and navigation
│   ├── home.py             # Processing interface
│   ├── pref.py             # Preferences/settings page
│   ├── logs.py             # Logging interface
│   └── about.py            # About dialog
├── ocr.py              # Legacy OCR functions (deprecated)
└── __init__.py         # Package initialization

Development

Project Structure

main.py: Application entry point
src/: Main source code (modern architecture)
README.md: English documentation
README.pt-BR.md: Portuguese documentation

Recent Changes (v1.0.0)

🏗️ Architecture Refactoring: Complete modernization with service-oriented design
📁 File Organization: Renamed interface files for consistency (removed _page/_dialog suffixes)
🏷️ Project Renaming: Changed from "Image2PDF" to "Image2DOC" for clarity
🧹 Code Cleanup: Removed deprecated files and legacy code
📚 Documentation: Updated READMEs with current project structure

Key Technologies

GTK4: Modern GUI framework
Libadwaita: Adaptive UI components
Tesseract: OCR engine
Pillow: Image processing
Concurrent.futures: Parallel processing

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

Troubleshooting

Common Issues

Tesseract not found

Error: Tesseract not found

Install Tesseract: sudo apt install tesseract-ocr
Ensure it's in PATH: which tesseract

GTK4 not available

ImportError: GTK4 libraries not found

Install GTK4 development packages
Ensure PyGObject is properly installed

OCR accuracy issues

Ensure images are clear and well-lit
Check that page numbers are in the expected region
Use manual correction when automatic detection fails

Performance Tips

Use SSD storage for faster I/O
Increase parallel processes for multi-core systems
Process images in batches for better cache utilization

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Developed by Emanuel Pereira

Acknowledgments

Tesseract OCR project
GTK and GNOME communities
Python Pillow library contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📄 Image2DOC

✨ Features

Prerequisites

System Requirements

Installing System Dependencies

Ubuntu/Debian

Fedora

Arch Linux

Installation

Usage

How to Use

Output Structure

Configuration

OCR Settings

Processing Settings

Architecture

Development

Project Structure

Recent Changes (v1.0.0)

Key Technologies

Contributing

Troubleshooting

Common Issues

Performance Tips

License

Author

Acknowledgments

About

Uh oh!

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
src		src
README.md		README.md
README.pt-BR.md		README.pt-BR.md
main.py		main.py

EmanuProds/image2doc

Folders and files

Latest commit

History

Repository files navigation

📄 Image2DOC

✨ Features

Prerequisites

System Requirements

Installing System Dependencies

Ubuntu/Debian

Fedora

Arch Linux

Installation

Usage

How to Use

Output Structure

Configuration

OCR Settings

Processing Settings

Architecture

Development

Project Structure

Recent Changes (v1.0.0)

Key Technologies

Contributing

Troubleshooting

Common Issues

Performance Tips

License

Author

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages