-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/multi modal #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds comprehensive multimodal (image + text) support to MIRAGE, enabling the framework to process datasets containing images alongside text using Vision-Language Models (VLMs). The changes introduce image input handling, path resolution for external image files, and batch processing logic that accommodates both embedded PIL Images and path-based images.
Key changes include:
- Extended configuration schema with
type: imageandimage_base_pathfields for input variables to support both embedded and path-based images - Implemented
resolve_image_inputfunction to handle various image input formats (PIL Images, URLs, absolute/relative paths) - Modified batch processing to detect multimodal inputs and route them through per-example generation for compatibility with VLM APIs
- Added graceful handling of empty shards in distributed processing
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| src/mirage/config.py | Added type and image_base_path fields to InputVar dataclass with is_image() helper method for image input identification |
| src/mirage/utils.py | Implemented image path resolution logic, added PIL Image imports, and updated template filling to preserve non-string objects like images |
| src/mirage/shard_process.py | Added multimodal prompt builder, modified batch processing to handle image inputs with per-example generation, and added empty shard handling |
| src/mirage/prompts.py | Removed unused ASSISTANT_ONLY_MD_PROMPT constant |
| run.sh | Simplified script by removing hardcoded output directory variables |
| configs/config_pmc_oa.yaml | Added example configuration for medical imaging dataset demonstrating multimodal features with Qwen3-VL model |
| README.md | Added comprehensive documentation section explaining multimodal usage with examples for both embedded and path-based images |
Comments suppressed due to low confidence (1)
src/mirage/utils.py:6
- Import of 'Path' is not used.
from pathlib import Path
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/mirage/shard_process.py
Outdated
| ds_shard.save_to_disk(shard_out_dir) | ||
| try: | ||
| llm.shutdown() | ||
| except Exception: |
Copilot
AI
Dec 19, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah add a warning here probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll do it
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
* Initial plan * Make chat template configurable for multimodal models Co-authored-by: qchapp <[email protected]> * Update documentation for configurable chat template Co-authored-by: qchapp <[email protected]> * Address code review feedback: improve chat template validation and inference Co-authored-by: qchapp <[email protected]> * Improve chat template inference and add early validation Co-authored-by: qchapp <[email protected]> * Remove infer_chat_template method, make chat_template explicit in config Co-authored-by: qchapp <[email protected]> * Improve error message and simplify comment Co-authored-by: qchapp <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: qchapp <[email protected]>
* Initial plan * Optimize batch processing by separating text-only and multimodal samples Co-authored-by: qchapp <[email protected]> * Optimize chat template validation to run once per batch Co-authored-by: qchapp <[email protected]> * Enable batched processing for multimodal samples Co-authored-by: qchapp <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: qchapp <[email protected]>
|
I will test the new changes before merging. |
|
I tested again on my small test and it worked: |
src/mirage/shard_process.py
Outdated
| ds_shard.save_to_disk(shard_out_dir) | ||
| try: | ||
| llm.shutdown() | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah add a warning here probably

This pull request adds robust multimodal (image + text) support to MIRAGE, enabling the processing of datasets containing both images and text with vision-language models (VLMs). The changes cover configuration, input handling, batch processing, and documentation, making MIRAGE compatible with datasets containing embedded images or image file paths. Additionally, the code now gracefully handles empty shards and includes an example configuration for a medical imaging dataset.
Multimodal (Image) Support:
image_base_path. [1] [2] [3]resolve_image_inputto robustly resolve image paths, URLs, and embedded objects for SGLang compatibility.Configuration and Documentation:
README.mdwith detailed instructions and examples for configuring and using multimodal (image + text) datasets, including both embedded and path-based image scenarios. [1] [2]config_pmc_oa.yaml) for a medical imaging dataset using a vision-language model, demonstrating new multimodal features.General Improvements and Maintenance:
prompts.py.run.shto simplify configuration and output directory handling.These changes significantly enhance MIRAGE's flexibility for multimodal data and improve its usability for a wider range of datasets and models.