Add text_image modality support for text elements in multimodal embedder by edknv · Pull Request #1305 · NVIDIA/nv-ingest

edknv · 2026-01-15T05:55:55Z

Description

This PR enables the ingestion pipeline to pair extracted text with its source image for multimodal embedding (e.g., using llama-3.2-nemoretriever-1b-vlm-embed-v1). It also adds logic to aggregate all text content from a PDF page and attach it to a full-page image, which yields the best recall on multiple datasets (bo10k, earnings). Despite the improved recall of this path, it is not set as default due to lower throughput.

Added nvidia/llama-nemotron-embed-vl-1b-v2 to the list of supported multimodal models.
The OCR extractor now preserves the original base64 image in metadata (as source_image) before it gets overwritten by text, allowing the embedding stage to access the visual context.
Implemented _aggregate_page_content which collects all text from TEXT and STRUCTURED (tables/charts) elements on a specific page and attaches it to the PAGE_IMAGE entry.
Added flags (embed_text_elements, etc.) to allow users to skip embedding individual text snippets if they are already being embedded as part of an aggregated page image.
Added logic to automatically enable page aggregation if text_image modality is requested and page images are present.
Added logic to detect if a base64 string is JPEG or PNG based on magic bytes, ensuring correctly formatted data URIs for embedding models.

Usage

ingestor = (
    Ingestor(message_client_hostname="nv-ingest-ms-runtime")
    .files(docs)
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_infographics=True,
        extract_images=False,
        text_depth="page",
        table_output_format="markdown",
        extract_page_as_image=True,
    )
    .embed(
        image_elements_modality="text_image",
    )

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

…ding

…knv/nv-ingest into edwardk/llama-nemotron-embed-vl-1b-v2

edknv and others added 8 commits January 14, 2026 21:53

Add text_image modality support for text elements in multimodal embed…

4c548e9

…ding

Merge branch 'main' into edwardk/llama-nemotron-embed-vl-1b-v2

47c19ce

check image format

d4fb345

aggregate contents for page image

92f907d

simplify page image usage

eaf38c2

Merge branch 'main' into edwardk/llama-nemotron-embed-vl-1b-v2

ebc7745

lint

33fdc0c

Merge branch 'edwardk/llama-nemotron-embed-vl-1b-v2' of github.com:ed…

7fa4cb5

…knv/nv-ingest into edwardk/llama-nemotron-embed-vl-1b-v2

edknv requested a review from ChrisJar January 20, 2026 18:52

Merge branch 'main' into edwardk/llama-nemotron-embed-vl-1b-v2

be4dc64

edknv marked this pull request as ready for review January 27, 2026 17:36

edknv requested a review from a team as a code owner January 27, 2026 17:36

Merge branch 'main' into edwardk/llama-nemotron-embed-vl-1b-v2

000a80d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add text_image modality support for text elements in multimodal embedder#1305

Add text_image modality support for text elements in multimodal embedder#1305
edknv wants to merge 10 commits intoNVIDIA:mainfrom
edknv:edwardk/llama-nemotron-embed-vl-1b-v2

edknv commented Jan 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edknv commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edknv commented Jan 15, 2026 •

edited

Loading