Skip to content

a python OCR tool that uses tesseract and poppler to extract text from pdf files, both image and text based

Notifications You must be signed in to change notification settings

cozyCodr/python-ocr-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Python OCR Extractor

A web application that extracts text from scanned PDF documents using OCR (Optical Character Recognition) technology. The application consists of a Python Flask backend and a frontend interface.

Features

  • PDF file upload functionality
  • OCR text extraction from scanned PDFs
  • Real-time text extraction processing
  • Cross-Origin Resource Sharing (CORS) enabled
  • Clean and simple API endpoint

Prerequisites

Before running this application, make sure you have the following installed:

Installation

  1. Clone the repository:
git clone https://github.com/cozyCodr/python-ocr-extractor.git
cd python-ocr-extractor
  1. Install backend dependencies
cd backend
pip install -r requirements.txt
  1. Configure Tesseract and Poppler paths:
  • Open backend/app.py
  • Update the following paths according to your system
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    POPPLER_PATH = r"C:\poppler-24.08.0\Library\bin"

Project Structure

python-ocr-extractor/
├── backend/
│   ├── app.py              # Flask application
│   └── requirements.txt    # Python dependencies
├── frontend-app/          # Frontend application
└── .gitignore

API Endpoints

POST /extract_text Extracts text from an uploaded PDF file.

Request:

Method: POST
Content-Type: multipart/form-data
Body: pdf_file (PDF file)

Usage

  1. Start the backend server:
cd backend
python app.py
  1. The server will start running on http://localhost:5000

About

a python OCR tool that uses tesseract and poppler to extract text from pdf files, both image and text based

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published