Add `to_searchable_pdf`: overlay invisible OCR text layer onto image-based PDFs by Copilot · Pull Request #29 · RapidAI/RapidOCRPDF

Copilot · 2026-03-06T08:14:26Z

Image-only PDF pages have no selectable/searchable text. This adds RapidOCRPDF.to_searchable_pdf() which runs OCR on such pages and writes the results back as an invisible (render_mode=3) text layer, making the PDF fully text-searchable without altering its visual appearance.

Implementation

_CJK_FONT_RANGES — module-level constant mapping Unicode ranges to PyMuPDF built-in font names (china-s, japan, korea)
_select_font(text) — static helper; picks the appropriate built-in font for the text's script, falling back to helv for Latin
_insert_ocr_text(page, box, txt, scale) — static helper that converts pixel-space OCR bounding boxes to PDF points (scale = 72 / dpi) and calls page.insert_text(..., render_mode=3)
to_searchable_pdf(content, output_path, force_ocr, page_num_list) → bytes — main new API; reuses extract_texts to identify image-only pages, OCRs them, overlays invisible text, and optionally writes the result to disk
CLI: --output_pdf <path> flag added to parse_args; main() routes to to_searchable_pdf when it is set

Usage

from rapidocr_pdf import RapidOCRPDF

engine = RapidOCRPDF()
engine.to_searchable_pdf("scan.pdf", output_path="searchable.pdf")

rapidocr_pdf scan.pdf --output_pdf searchable.pdf

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…sed PDFs Co-authored-by: SWHL <28639377+SWHL@users.noreply.github.com>

Initial plan

fd4d9fd

Copilot AI assigned Copilot and SWHL Mar 6, 2026

Copilot started work on behalf of SWHL March 6, 2026 08:14 View session

Add to_searchable_pdf: overlay invisible OCR text layer onto image-ba…

adcc8c1

…sed PDFs Co-authored-by: SWHL <28639377+SWHL@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Research how to annotate PDF results using Python~~ Add to_searchable_pdf: overlay invisible OCR text layer onto image-based PDFs Mar 6, 2026

Copilot finished work on behalf of SWHL March 6, 2026 08:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `to_searchable_pdf`: overlay invisible OCR text layer onto image-based PDFs#29

Add `to_searchable_pdf`: overlay invisible OCR text layer onto image-based PDFs#29
Copilot wants to merge 2 commits intomainfrom
copilot/research-pdf-annotation-python

Copilot AI commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 6, 2026 •

edited

Loading