Skip to content

Conversation

@qchapp
Copy link
Member

@qchapp qchapp commented Dec 19, 2025

This pull request adds robust multimodal (image + text) support to MIRAGE, enabling the processing of datasets containing both images and text with vision-language models (VLMs). The changes cover configuration, input handling, batch processing, and documentation, making MIRAGE compatible with datasets containing embedded images or image file paths. Additionally, the code now gracefully handles empty shards and includes an example configuration for a medical imaging dataset.

Multimodal (Image) Support:

  • Added support for image inputs in both configuration and processing, including handling of embedded images (PIL) and path-based images with a configurable image_base_path. [1] [2] [3]
  • Implemented resolve_image_input to robustly resolve image paths, URLs, and embedded objects for SGLang compatibility.
  • Modified the batch processing logic to handle multimodal prompts, including per-example calls for image-containing batches and efficient batching for text-only cases. [1] [2]

Configuration and Documentation:

  • Updated the README.md with detailed instructions and examples for configuring and using multimodal (image + text) datasets, including both embedded and path-based image scenarios. [1] [2]
  • Added a sample configuration file (config_pmc_oa.yaml) for a medical imaging dataset using a vision-language model, demonstrating new multimodal features.

General Improvements and Maintenance:

  • Gracefully handles empty shards in distributed processing, ensuring no errors when a shard has zero samples.
  • Cleaned up and removed an unused Markdown prompt from prompts.py.
  • Minor refactoring and imports for improved type handling and PIL image support. [1] [2]
  • Updated run.sh to simplify configuration and output directory handling.

These changes significantly enhance MIRAGE's flexibility for multimodal data and improve its usability for a wider range of datasets and models.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive multimodal (image + text) support to MIRAGE, enabling the framework to process datasets containing images alongside text using Vision-Language Models (VLMs). The changes introduce image input handling, path resolution for external image files, and batch processing logic that accommodates both embedded PIL Images and path-based images.

Key changes include:

  • Extended configuration schema with type: image and image_base_path fields for input variables to support both embedded and path-based images
  • Implemented resolve_image_input function to handle various image input formats (PIL Images, URLs, absolute/relative paths)
  • Modified batch processing to detect multimodal inputs and route them through per-example generation for compatibility with VLM APIs
  • Added graceful handling of empty shards in distributed processing

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/mirage/config.py Added type and image_base_path fields to InputVar dataclass with is_image() helper method for image input identification
src/mirage/utils.py Implemented image path resolution logic, added PIL Image imports, and updated template filling to preserve non-string objects like images
src/mirage/shard_process.py Added multimodal prompt builder, modified batch processing to handle image inputs with per-example generation, and added empty shard handling
src/mirage/prompts.py Removed unused ASSISTANT_ONLY_MD_PROMPT constant
run.sh Simplified script by removing hardcoded output directory variables
configs/config_pmc_oa.yaml Added example configuration for medical imaging dataset demonstrating multimodal features with Qwen3-VL model
README.md Added comprehensive documentation section explaining multimodal usage with examples for both embedded and path-based images
Comments suppressed due to low confidence (1)

src/mirage/utils.py:6

  • Import of 'Path' is not used.
from pathlib import Path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ds_shard.save_to_disk(shard_out_dir)
try:
llm.shutdown()
except Exception:
Copy link

Copilot AI Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah add a warning here probably

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll do it

Copy link

Copilot AI commented Dec 19, 2025

@qchapp I've opened a new pull request, #5, to work on those changes. Once the pull request is ready, I'll request review from you.

* Initial plan

* Make chat template configurable for multimodal models

Co-authored-by: qchapp <[email protected]>

* Update documentation for configurable chat template

Co-authored-by: qchapp <[email protected]>

* Address code review feedback: improve chat template validation and inference

Co-authored-by: qchapp <[email protected]>

* Improve chat template inference and add early validation

Co-authored-by: qchapp <[email protected]>

* Remove infer_chat_template method, make chat_template explicit in config

Co-authored-by: qchapp <[email protected]>

* Improve error message and simplify comment

Co-authored-by: qchapp <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: qchapp <[email protected]>
@qchapp qchapp requested review from fabnemEPFL and removed request for MichelDucartier and fabnemEPFL December 19, 2025 14:43
Copy link

Copilot AI commented Dec 19, 2025

@qchapp I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you.

* Initial plan

* Optimize batch processing by separating text-only and multimodal samples

Co-authored-by: qchapp <[email protected]>

* Optimize chat template validation to run once per batch

Co-authored-by: qchapp <[email protected]>

* Enable batched processing for multimodal samples

Co-authored-by: qchapp <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: qchapp <[email protected]>
@qchapp
Copy link
Member Author

qchapp commented Jan 7, 2026

I will test the new changes before merging.

@qchapp
Copy link
Member Author

qchapp commented Jan 7, 2026

I tested again on my small test and it worked:

>>> df[0]
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=800x600 at 0x4001A117FCB0>, 'caption': 'A translucent human torso reveals a vividly detailed heart at its core, pulsing with life as intricate networks of arteries and veins—colored in vibrant reds and blues—radiate outward, illustrating the vital circulatory system that sustains the body.', 'original_caption': 'A heart.'}

Here is the test image:

ds_shard.save_to_disk(shard_out_dir)
try:
llm.shutdown()
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah add a warning here probably

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants