Support MSD dataset #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

MGAMZ merged 2 commits into main from dev/MSD

Jan 16, 2026

Owner

MGAMZ commented Jan 16, 2026

As title.


          [dataset] support MSD dataset

b61b869

Copilot AI review requested due to automatic review settings

January 16, 2026 14:53

MGAMZ added the dataset label

Copilot started reviewing on behalf of MGAMZ

January 16, 2026 14:54

codecov bot commented Jan 16, 2026 •

edited

Loading

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@bcf313a). Learn more about missing BASE report.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #59   +/-   ##
=======================================
  Coverage        ?   72.22%           
=======================================
  Files           ?      106           
  Lines           ?    11798           
  Branches        ?     1054           
=======================================
  Hits            ?     8521           
  Misses          ?     3032           
  Partials        ?      245

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This pull request adds support for the Medical Segmentation Decathlon (MSD) dataset to the itkit framework. The MSD dataset includes 10 different medical imaging segmentation tasks across various anatomical structures and imaging modalities.

Changes:

Added MSD dataset classes for 10 tasks (Brain Tumor, Heart, Liver, Hippocampus, Prostate, Lung, Pancreas, Hepatic Vessel, Spleen, and Colon)
Provided metadata JSON file with dataset information including labels, modalities, and references
Included a conversion script to reorganize MSD dataset directory structure
Removed git submodules configuration

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
itkit/dataset/MSD/mm_dataset.py	Defines dataset classes for all 10 MSD tasks, with both SeriesVolumeDataset and PatchedDataset variants
itkit/dataset/MSD/dataset.json	Contains metadata for all 10 MSD tasks including labels, modalities, references, and dataset sizes
itkit/dataset/MSD/convert.py	Provides utility script to reorganize MSD dataset from original structure to itkit's expected format
itkit/dataset/MSD/init.py	Exports all MSD dataset classes for external use
.gitmodules	Removes git submodules configuration (mmengine, mmsegmentation, mmpretrain)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

itkit/dataset/MSD/dataset.json

+                      "tensorImageSize": "3D",
+                      "reference": "King’s College London",
+                      "licence": "CC-BY-SA 4.0",
+                      "relase": "1.0 04/05/2018",

Copilot AI Jan 16, 2026

Corrected spelling of 'relase' to 'release'.

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/dataset.json

+                      "description": "Left and right hippocampus segmentation",
+                      "reference": " Vanderbilt University Medical Center",
+                      "licence": "CC-BY-SA 4.0",
+                      "relase": "1.0 04/05/2018",

Copilot AI Jan 16, 2026

Corrected spelling of 'relase' to 'release'.

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/dataset.json

+                      "description": "Prostate transitional zone and peripheral zone segmentation",
+                      "reference": "Radboud University, Nijmegen Medical Centre",
+                      "licence": "CC-BY-SA 4.0",
+                      "relase": "1.0 04/05/2018",

Copilot AI Jan 16, 2026

Corrected spelling of 'relase' to 'release'.

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/dataset.json

+                      "description": "Lung and cancer segmentation",
+                      "reference": "The Cancer Imaging Archive",
+                      "licence": "CC-BY-SA 4.0",
+                      "relase": "1.0 04/05/2018",

Copilot AI Jan 16, 2026

Corrected spelling of 'relase' to 'release'.

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/dataset.json

+                      "description": "Pancreas and cancer segmentation",
+                      "reference": "Memorial Sloan Kettering Cancer Center ",
+                      "licence": "CC-BY-SA 4.0",
+                      "relase": "1.0 04/05/2018",

Copilot AI Jan 16, 2026

Corrected spelling of 'relase' to 'release'.

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/mm_dataset.py

Comment on lines +10 to +21

+                  if not JSON_PATH.exists():
+                      return {}
+                  with open(JSON_PATH, encoding="utf-8") as f:
+                      return json.load(f)
+              MSD_METADATA = _load_metadata()
+              def _get_classes(task_id: str) -> list[str]:
+                  labels = MSD_METADATA.get(task_id, {}).get("labels", {})
+                  return [labels[k] for k in sorted(labels.keys(), key=int)]

Copilot AI Jan 16, 2026

The _get_classes function will fail if MSD_METADATA is empty or if dataset.json is missing. When JSON_PATH doesn't exist, _load_metadata returns an empty dict, causing _get_classes to return an empty list for classes. This could lead to runtime errors when the dataset classes are instantiated with invalid METAINFO. Consider adding validation or error handling to ensure the JSON file exists and contains the expected task data.

Suggested change

      
                if not JSON_PATH.exists():
          
                    return {}
          
                with open(JSON_PATH, encoding="utf-8") as f:
          
                    return json.load(f)
          
            MSD_METADATA = _load_metadata()
          
            def _get_classes(task_id: str) -> list[str]:
          
                labels = MSD_METADATA.get(task_id, {}).get("labels", {})
          
                return [labels[k] for k in sorted(labels.keys(), key=int)]
          
                """Load dataset metadata from JSON_PATH.
          
                Raises:
          
                    FileNotFoundError: If the metadata JSON file does not exist.
          
                    ValueError: If the JSON cannot be parsed or does not contain a dict.
          
                """
          
                if not JSON_PATH.exists():
          
                    raise FileNotFoundError(f"Dataset metadata file not found: {JSON_PATH}")
          
                try:
          
                    with open(JSON_PATH, encoding="utf-8") as f:
          
                        data = json.load(f)
          
                except json.JSONDecodeError as exc:
          
                    raise ValueError(f"Failed to parse dataset metadata JSON at {JSON_PATH}") from exc
          
                if not isinstance(data, dict):
          
                    raise ValueError(f"Dataset metadata in {JSON_PATH} must be a JSON object (dict)")
          
                return data
          
            MSD_METADATA = _load_metadata()
          
            def _get_classes(task_id: str) -> list[str]:
          
                """Return the ordered list of class labels for a given task.
          
                Raises:
          
                    KeyError: If the task_id is not present in the metadata.
          
                    ValueError: If labels are missing, empty, or have invalid keys.
          
                """
          
                task_meta = MSD_METADATA.get(task_id)
          
                if task_meta is None:
          
                    raise KeyError(f"Task id {task_id!r} not found in dataset metadata")
          
                labels = task_meta.get("labels")
          
                if not isinstance(labels, dict) or not labels:
          
                    raise ValueError(f"Missing or invalid 'labels' for task {task_id!r} in dataset metadata")
          
                try:
          
                    sorted_keys = sorted(labels.keys(), key=int)
          
                except (TypeError, ValueError) as exc:
          
                    raise ValueError(
          
                        f"Label keys for task {task_id!r} must be strings convertible to integers"
          
                    ) from exc
          
                return [labels[k] for k in sorted_keys]

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/mm_dataset.py

+              def _get_classes(task_id: str) -> list[str]:
+                  labels = MSD_METADATA.get(task_id, {}).get("labels", {})
+                  return [labels[k] for k in sorted(labels.keys(), key=int)]

Copilot AI Jan 16, 2026

The sorted call uses key=int which will fail if any label key cannot be converted to an integer. This could raise a ValueError at module import time if the JSON contains non-numeric label keys. Consider adding error handling or validation for the label keys format.

Suggested change

      
                return [labels[k] for k in sorted(labels.keys(), key=int)]
          
                try:
          
                    sorted_keys = sorted(labels.keys(), key=int)
          
                except (ValueError, TypeError):
          
                    # Fallback: sort keys as strings if they cannot be converted to int
          
                    sorted_keys = sorted(labels.keys())
          
                return [labels[k] for k in sorted_keys]

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/convert.py

+                                      if target.exists():
+                                          print(f"    Warning: {item.name} already exists in 'image', skipping.")
+                                      else:
+                                          shutil.move(str(item), str(target))

Copilot AI Jan 16, 2026

Using shutil.move on potentially large medical image files without error handling could lead to data loss if the operation fails mid-transfer. Consider copying files first and only removing the source after verification, or at least wrap the move operation in a try-except block with appropriate error handling and recovery.

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/convert.py

+                                      if target.exists():
+                                          print(f"    Warning: {item.name} already exists in 'label', skipping.")
+                                      else:
+                                          shutil.move(str(item), str(target))

Copilot AI Jan 16, 2026

Using shutil.move on potentially large medical image files without error handling could lead to data loss if the operation fails mid-transfer. Consider copying files first and only removing the source after verification, or at least wrap the move operation in a try-except block with appropriate error handling and recovery.

Copilot uses AI. Check for mistakes.

itkit/dataset/MSD/__init__.py

Comment on lines +1 to +22

+              from .mm_dataset import (
+                  Task01_BrainTumour_Mha,
+                  Task01_BrainTumour_Patch,
+                  Task02_Heart_Mha,
+                  Task02_Heart_Patch,
+                  Task03_Liver_Mha,
+                  Task03_Liver_Patch,
+                  Task04_Hippocampus_Mha,
+                  Task04_Hippocampus_Patch,
+                  Task05_Prostate_Mha,
+                  Task05_Prostate_Patch,
+                  Task06_Lung_Mha,
+                  Task06_Lung_Patch,
+                  Task07_Pancreas_Mha,
+                  Task07_Pancreas_Patch,
+                  Task08_HepaticVessel_Mha,
+                  Task08_HepaticVessel_Patch,
+                  Task09_Spleen_Mha,
+                  Task09_Spleen_Patch,
+                  Task10_Colon_Mha,
+                  Task10_Colon_Patch,
+              )

Copilot AI Jan 16, 2026

The new MSD dataset classes lack test coverage. Other datasets in the repository have test coverage in tests/dataset/test_dataset_registry.py with parametrized tests for common metainfo validation. Consider adding similar test coverage for the MSD dataset classes.

Copilot uses AI. Check for mistakes.


          [doc] Add MSD to doc

6add941

MGAMZ merged commit 94b7e74 into main

3 of 5 checks passed

MGAMZ deleted the dev/MSD branch

January 16, 2026 15:04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset