Skip to content

Add text_image modality support for text elements in multimodal embedder#1305

Open
edknv wants to merge 10 commits intoNVIDIA:mainfrom
edknv:edwardk/llama-nemotron-embed-vl-1b-v2
Open

Add text_image modality support for text elements in multimodal embedder#1305
edknv wants to merge 10 commits intoNVIDIA:mainfrom
edknv:edwardk/llama-nemotron-embed-vl-1b-v2

Conversation

@edknv
Copy link
Collaborator

@edknv edknv commented Jan 15, 2026

Description

This PR enables the ingestion pipeline to pair extracted text with its source image for multimodal embedding (e.g., using llama-3.2-nemoretriever-1b-vlm-embed-v1). It also adds logic to aggregate all text content from a PDF page and attach it to a full-page image, which yields the best recall on multiple datasets (bo10k, earnings). Despite the improved recall of this path, it is not set as default due to lower throughput.

  • Added nvidia/llama-nemotron-embed-vl-1b-v2 to the list of supported multimodal models.
  • The OCR extractor now preserves the original base64 image in metadata (as source_image) before it gets overwritten by text, allowing the embedding stage to access the visual context.
  • Implemented _aggregate_page_content which collects all text from TEXT and STRUCTURED (tables/charts) elements on a specific page and attaches it to the PAGE_IMAGE entry.
  • Added flags (embed_text_elements, etc.) to allow users to skip embedding individual text snippets if they are already being embedded as part of an aggregated page image.
  • Added logic to automatically enable page aggregation if text_image modality is requested and page images are present.
  • Added logic to detect if a base64 string is JPEG or PNG based on magic bytes, ensuring correctly formatted data URIs for embedding models.

Usage

ingestor = (
    Ingestor(message_client_hostname="nv-ingest-ms-runtime")
    .files(docs)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_infographics=True,
        extract_images=False,
        text_depth="page",
        table_output_format="markdown",
        extract_page_as_image=True,
    )
    .embed(
        image_elements_modality="text_image",
    )

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested a review from ChrisJar January 20, 2026 18:52
@edknv edknv marked this pull request as ready for review January 27, 2026 17:36
@edknv edknv requested a review from a team as a code owner January 27, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant