-
Notifications
You must be signed in to change notification settings - Fork 0
Support MSD dataset #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #59 +/- ##
=======================================
Coverage ? 72.22%
=======================================
Files ? 106
Lines ? 11798
Branches ? 1054
=======================================
Hits ? 8521
Misses ? 3032
Partials ? 245 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for the Medical Segmentation Decathlon (MSD) dataset to the itkit framework. The MSD dataset includes 10 different medical imaging segmentation tasks across various anatomical structures and imaging modalities.
Changes:
- Added MSD dataset classes for 10 tasks (Brain Tumor, Heart, Liver, Hippocampus, Prostate, Lung, Pancreas, Hepatic Vessel, Spleen, and Colon)
- Provided metadata JSON file with dataset information including labels, modalities, and references
- Included a conversion script to reorganize MSD dataset directory structure
- Removed git submodules configuration
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| itkit/dataset/MSD/mm_dataset.py | Defines dataset classes for all 10 MSD tasks, with both SeriesVolumeDataset and PatchedDataset variants |
| itkit/dataset/MSD/dataset.json | Contains metadata for all 10 MSD tasks including labels, modalities, references, and dataset sizes |
| itkit/dataset/MSD/convert.py | Provides utility script to reorganize MSD dataset from original structure to itkit's expected format |
| itkit/dataset/MSD/init.py | Exports all MSD dataset classes for external use |
| .gitmodules | Removes git submodules configuration (mmengine, mmsegmentation, mmpretrain) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "tensorImageSize": "3D", | ||
| "reference": "King’s College London", | ||
| "licence": "CC-BY-SA 4.0", | ||
| "relase": "1.0 04/05/2018", |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'relase' to 'release'.
| "description": "Left and right hippocampus segmentation", | ||
| "reference": " Vanderbilt University Medical Center", | ||
| "licence": "CC-BY-SA 4.0", | ||
| "relase": "1.0 04/05/2018", |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'relase' to 'release'.
| "description": "Prostate transitional zone and peripheral zone segmentation", | ||
| "reference": "Radboud University, Nijmegen Medical Centre", | ||
| "licence": "CC-BY-SA 4.0", | ||
| "relase": "1.0 04/05/2018", |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'relase' to 'release'.
| "description": "Lung and cancer segmentation", | ||
| "reference": "The Cancer Imaging Archive", | ||
| "licence": "CC-BY-SA 4.0", | ||
| "relase": "1.0 04/05/2018", |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'relase' to 'release'.
| "description": "Pancreas and cancer segmentation", | ||
| "reference": "Memorial Sloan Kettering Cancer Center ", | ||
| "licence": "CC-BY-SA 4.0", | ||
| "relase": "1.0 04/05/2018", |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'relase' to 'release'.
| if not JSON_PATH.exists(): | ||
| return {} | ||
| with open(JSON_PATH, encoding="utf-8") as f: | ||
| return json.load(f) | ||
|
|
||
| MSD_METADATA = _load_metadata() | ||
|
|
||
| def _get_classes(task_id: str) -> list[str]: | ||
| labels = MSD_METADATA.get(task_id, {}).get("labels", {}) | ||
| return [labels[k] for k in sorted(labels.keys(), key=int)] | ||
|
|
||
|
|
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _get_classes function will fail if MSD_METADATA is empty or if dataset.json is missing. When JSON_PATH doesn't exist, _load_metadata returns an empty dict, causing _get_classes to return an empty list for classes. This could lead to runtime errors when the dataset classes are instantiated with invalid METAINFO. Consider adding validation or error handling to ensure the JSON file exists and contains the expected task data.
| if not JSON_PATH.exists(): | |
| return {} | |
| with open(JSON_PATH, encoding="utf-8") as f: | |
| return json.load(f) | |
| MSD_METADATA = _load_metadata() | |
| def _get_classes(task_id: str) -> list[str]: | |
| labels = MSD_METADATA.get(task_id, {}).get("labels", {}) | |
| return [labels[k] for k in sorted(labels.keys(), key=int)] | |
| """Load dataset metadata from JSON_PATH. | |
| Raises: | |
| FileNotFoundError: If the metadata JSON file does not exist. | |
| ValueError: If the JSON cannot be parsed or does not contain a dict. | |
| """ | |
| if not JSON_PATH.exists(): | |
| raise FileNotFoundError(f"Dataset metadata file not found: {JSON_PATH}") | |
| try: | |
| with open(JSON_PATH, encoding="utf-8") as f: | |
| data = json.load(f) | |
| except json.JSONDecodeError as exc: | |
| raise ValueError(f"Failed to parse dataset metadata JSON at {JSON_PATH}") from exc | |
| if not isinstance(data, dict): | |
| raise ValueError(f"Dataset metadata in {JSON_PATH} must be a JSON object (dict)") | |
| return data | |
| MSD_METADATA = _load_metadata() | |
| def _get_classes(task_id: str) -> list[str]: | |
| """Return the ordered list of class labels for a given task. | |
| Raises: | |
| KeyError: If the task_id is not present in the metadata. | |
| ValueError: If labels are missing, empty, or have invalid keys. | |
| """ | |
| task_meta = MSD_METADATA.get(task_id) | |
| if task_meta is None: | |
| raise KeyError(f"Task id {task_id!r} not found in dataset metadata") | |
| labels = task_meta.get("labels") | |
| if not isinstance(labels, dict) or not labels: | |
| raise ValueError(f"Missing or invalid 'labels' for task {task_id!r} in dataset metadata") | |
| try: | |
| sorted_keys = sorted(labels.keys(), key=int) | |
| except (TypeError, ValueError) as exc: | |
| raise ValueError( | |
| f"Label keys for task {task_id!r} must be strings convertible to integers" | |
| ) from exc | |
| return [labels[k] for k in sorted_keys] |
|
|
||
| def _get_classes(task_id: str) -> list[str]: | ||
| labels = MSD_METADATA.get(task_id, {}).get("labels", {}) | ||
| return [labels[k] for k in sorted(labels.keys(), key=int)] |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sorted call uses key=int which will fail if any label key cannot be converted to an integer. This could raise a ValueError at module import time if the JSON contains non-numeric label keys. Consider adding error handling or validation for the label keys format.
| return [labels[k] for k in sorted(labels.keys(), key=int)] | |
| try: | |
| sorted_keys = sorted(labels.keys(), key=int) | |
| except (ValueError, TypeError): | |
| # Fallback: sort keys as strings if they cannot be converted to int | |
| sorted_keys = sorted(labels.keys()) | |
| return [labels[k] for k in sorted_keys] |
| if target.exists(): | ||
| print(f" Warning: {item.name} already exists in 'image', skipping.") | ||
| else: | ||
| shutil.move(str(item), str(target)) |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using shutil.move on potentially large medical image files without error handling could lead to data loss if the operation fails mid-transfer. Consider copying files first and only removing the source after verification, or at least wrap the move operation in a try-except block with appropriate error handling and recovery.
| if target.exists(): | ||
| print(f" Warning: {item.name} already exists in 'label', skipping.") | ||
| else: | ||
| shutil.move(str(item), str(target)) |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using shutil.move on potentially large medical image files without error handling could lead to data loss if the operation fails mid-transfer. Consider copying files first and only removing the source after verification, or at least wrap the move operation in a try-except block with appropriate error handling and recovery.
| from .mm_dataset import ( | ||
| Task01_BrainTumour_Mha, | ||
| Task01_BrainTumour_Patch, | ||
| Task02_Heart_Mha, | ||
| Task02_Heart_Patch, | ||
| Task03_Liver_Mha, | ||
| Task03_Liver_Patch, | ||
| Task04_Hippocampus_Mha, | ||
| Task04_Hippocampus_Patch, | ||
| Task05_Prostate_Mha, | ||
| Task05_Prostate_Patch, | ||
| Task06_Lung_Mha, | ||
| Task06_Lung_Patch, | ||
| Task07_Pancreas_Mha, | ||
| Task07_Pancreas_Patch, | ||
| Task08_HepaticVessel_Mha, | ||
| Task08_HepaticVessel_Patch, | ||
| Task09_Spleen_Mha, | ||
| Task09_Spleen_Patch, | ||
| Task10_Colon_Mha, | ||
| Task10_Colon_Patch, | ||
| ) |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new MSD dataset classes lack test coverage. Other datasets in the repository have test coverage in tests/dataset/test_dataset_registry.py with parametrized tests for common metainfo validation. Consider adding similar test coverage for the MSD dataset classes.
As title.