A GTK4 application that converts document images to organized PDFs using OCR technology. It automatically detects page numbers, organizes documents, and allows manual corrections when OCR fails.
- ⚡ Parallel OCR Processing: Uses multiple CPU cores for faster image processing
- 🔍 Automatic Page Detection: Extracts page numbers using Tesseract OCR
- ✏️ Manual Correction: Interactive dialog for correcting OCR failures
- 📚 Smart Organization: Automatically organizes PDFs by page numbers (FL. 001, FL. 002, etc.)
- 💾 Cache System: Skips already processed images to avoid reprocessing
- 🎨 Modern UI: Built with GTK4 and Libadwaita for a native Linux experience
- 📊 Real-time Logs: Live monitoring of processing status and errors
- ⚙️ Configurable Settings: Adjustable maximum pages and processing threads
- Linux operating system
- Python 3.8 or higher
- GTK4 development libraries
- Tesseract OCR engine
sudo apt update
sudo apt install python3 python3-pip tesseract-ocr tesseract-ocr-por libgtk-4-dev libadwaita-1-devsudo dnf install python3 python3-pip tesseract tesseract-langpack-por gtk4-devel libadwaita-develsudo pacman -S python python-pip tesseract tesseract-data-por gtk4 libadwaita- Clone the repository:
git clone https://github.com/EmanuProds/ncx-book-organizer.git
cd img2doc- Create a virtual environment (recommended):
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install Python dependencies:
pip install pytesseract pillow pygobject- Activate the virtual environment (if created):
source venv/bin/activate- Run the application:
python main.py- Select Input Directory: Choose the folder containing your document images (JPG/JPEG)
- Select Output Directory: Choose where the organized PDFs will be saved
- Configure Settings (optional):
- Maximum pages: Set the total number of pages in your document
- Number of processes: Adjust parallel processing (0 = auto-detect)
- Start Processing: Click "Start Processing" and monitor progress in the Logs tab
- Manual Corrections: If OCR fails, the app will prompt for manual page number input
The application creates organized PDFs with the following naming convention:
FL. 001.pdf,FL. 002.pdf, etc. - Regular pagesFL. 001-verso.pdf- Back sides of pagesTERMO DE ABERTURA.pdf- Opening termsTERMO DE ENCERRAMENTO.pdf- Closing termsERRO_OCR_filename.pdf- Files that couldn't be processed
- Language: Portuguese (por)
- PSM Mode: 6 (Uniform block of text)
- ROI: Configurable region of interest for page number detection
- Maximum Pages: Default 300 pages
- Parallel Processes: Default 4 workers
- Cache System: Automatically detects and skips already processed files
The application follows a modern, service-oriented architecture with clear separation of concerns:
src/
├── models.py # Data models and domain entities (dataclasses & enums)
├── exceptions.py # Custom exception hierarchy
├── config.py # Application configuration
├── core.py # Legacy processing logic (backward compatibility)
├── services/ # Modern service layer
│ ├── file_service.py # File operations and caching
│ ├── ocr_service.py # OCR processing and image manipulation
│ └── processing_service.py # Main processing coordination
├── interface/ # GTK4 UI layer
│ ├── entrypoint.py # Application initialization
│ ├── gui.py # Main window and navigation
│ ├── home.py # Processing interface
│ ├── pref.py # Preferences/settings page
│ ├── logs.py # Logging interface
│ └── about.py # About dialog
├── ocr.py # Legacy OCR functions (deprecated)
└── __init__.py # Package initialization
main.py: Application entry pointsrc/: Main source code (modern architecture)README.md: English documentationREADME.pt-BR.md: Portuguese documentation
- 🏗️ Architecture Refactoring: Complete modernization with service-oriented design
- 📁 File Organization: Renamed interface files for consistency (removed
_page/_dialogsuffixes) - 🏷️ Project Renaming: Changed from "Image2PDF" to "Image2DOC" for clarity
- 🧹 Code Cleanup: Removed deprecated files and legacy code
- 📚 Documentation: Updated READMEs with current project structure
- GTK4: Modern GUI framework
- Libadwaita: Adaptive UI components
- Tesseract: OCR engine
- Pillow: Image processing
- Concurrent.futures: Parallel processing
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
Tesseract not found
Error: Tesseract not found
- Install Tesseract:
sudo apt install tesseract-ocr - Ensure it's in PATH:
which tesseract
GTK4 not available
ImportError: GTK4 libraries not found
- Install GTK4 development packages
- Ensure PyGObject is properly installed
OCR accuracy issues
- Ensure images are clear and well-lit
- Check that page numbers are in the expected region
- Use manual correction when automatic detection fails
- Use SSD storage for faster I/O
- Increase parallel processes for multi-core systems
- Process images in batches for better cache utilization
This project is licensed under the MIT License - see the LICENSE file for details.
Developed by Emanuel Pereira
- Tesseract OCR project
- GTK and GNOME communities
- Python Pillow library contributors