Add to_searchable_pdf: overlay invisible OCR text layer onto image-based PDFs#29
Draft
Add to_searchable_pdf: overlay invisible OCR text layer onto image-based PDFs#29
to_searchable_pdf: overlay invisible OCR text layer onto image-based PDFs#29Conversation
…sed PDFs Co-authored-by: SWHL <28639377+SWHL@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Research how to annotate PDF results using Python
Add Mar 6, 2026
to_searchable_pdf: overlay invisible OCR text layer onto image-based PDFs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Image-only PDF pages have no selectable/searchable text. This adds
RapidOCRPDF.to_searchable_pdf()which runs OCR on such pages and writes the results back as an invisible (render_mode=3) text layer, making the PDF fully text-searchable without altering its visual appearance.Implementation
_CJK_FONT_RANGES— module-level constant mapping Unicode ranges to PyMuPDF built-in font names (china-s,japan,korea)_select_font(text)— static helper; picks the appropriate built-in font for the text's script, falling back tohelvfor Latin_insert_ocr_text(page, box, txt, scale)— static helper that converts pixel-space OCR bounding boxes to PDF points (scale = 72 / dpi) and callspage.insert_text(..., render_mode=3)to_searchable_pdf(content, output_path, force_ocr, page_num_list) → bytes— main new API; reusesextract_textsto identify image-only pages, OCRs them, overlays invisible text, and optionally writes the result to disk--output_pdf <path>flag added toparse_args;main()routes toto_searchable_pdfwhen it is setUsage
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.