Skip to content

Add to_searchable_pdf: overlay invisible OCR text layer onto image-based PDFs#29

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/research-pdf-annotation-python
Draft

Add to_searchable_pdf: overlay invisible OCR text layer onto image-based PDFs#29
Copilot wants to merge 2 commits intomainfrom
copilot/research-pdf-annotation-python

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 6, 2026

Image-only PDF pages have no selectable/searchable text. This adds RapidOCRPDF.to_searchable_pdf() which runs OCR on such pages and writes the results back as an invisible (render_mode=3) text layer, making the PDF fully text-searchable without altering its visual appearance.

Implementation

  • _CJK_FONT_RANGES — module-level constant mapping Unicode ranges to PyMuPDF built-in font names (china-s, japan, korea)
  • _select_font(text) — static helper; picks the appropriate built-in font for the text's script, falling back to helv for Latin
  • _insert_ocr_text(page, box, txt, scale) — static helper that converts pixel-space OCR bounding boxes to PDF points (scale = 72 / dpi) and calls page.insert_text(..., render_mode=3)
  • to_searchable_pdf(content, output_path, force_ocr, page_num_list) → bytes — main new API; reuses extract_texts to identify image-only pages, OCRs them, overlays invisible text, and optionally writes the result to disk
  • CLI: --output_pdf <path> flag added to parse_args; main() routes to to_searchable_pdf when it is set

Usage

from rapidocr_pdf import RapidOCRPDF

engine = RapidOCRPDF()
engine.to_searchable_pdf("scan.pdf", output_path="searchable.pdf")
rapidocr_pdf scan.pdf --output_pdf searchable.pdf

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…sed PDFs

Co-authored-by: SWHL <28639377+SWHL@users.noreply.github.com>
Copilot AI changed the title [WIP] Research how to annotate PDF results using Python Add to_searchable_pdf: overlay invisible OCR text layer onto image-based PDFs Mar 6, 2026
Copilot finished work on behalf of SWHL March 6, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants