A modular dataloader for patch-based whole-body lesion detection. The package is split into small components so the dataset, sampler and loader can be reused in other projects or adapted to different modalities.
This repository accompanies the DEMI 2025 workshop paper:
Instance‑Balanced Patch Sampling for Whole‑Body Lesion Segmentation
Joris Wuts, Jakub Ceranka, Jef Vandemeulebroucke, Frédéric Lecouvet
Open-access version and citation details will be added after the MICCAI 2025 conference.
Whole-body scans often contain many tiny lesions that make up less than 0.01% of the image volume. Conventional positive–negative patch sampling struggles in this setting, over-representing background and large lesions while missing small targets. The instance-balanced strategy samples patches per lesion instance, improving CPU data-loading efficiency, training speed and segmentation accuracy.
ib_sampling exposes a few key utilities:
MedicalPatchDataset– loads pre-extracted 3D patches and keeps a cache of frequently used patches in memory.BalancedBatchSampler– draws a balanced mix of positive and negative patches for each epoch.get_loader– convenience helper that builds training and optional validation dataloaders.
The repository also provides a template prepare_dataset.py script that
converts raw volumes into patch datasets compatible with the dataloader.
Before running the preparation script, organise the raw data into separate
train and val directories. Each acquisition lives in its own folder:
raw_dataset/
├── train/
│ ├── Patient01/
│ │ ├── T1.nii.gz
│ │ ├── b1000.nii.gz
│ │ └── GT.nii.gz
│ └── Patient02/
│ └── ...
└── val/
├── Patient03/
│ └── ...
└── Patient04/
└── ...
Each case folder contains one NIfTI volume per modality (any names are accepted)
and a ground‑truth mask named GT.nii.gz.
Use the provided script to extract positive and negative patches. Supply the directory with training volumes and, optionally, a validation directory:
python prepare_dataset.py raw_dataset/train TRAIN_OUT_DIR --downstream-patch-size 96 96 96 \
--modalities T1 b1000 \
--val-dir raw_dataset/val --val-output VAL_OUT_DIR
The script performs 1–99 percentile normalisation on every modality. Training
patches are cropped around each connected component with a 10‑voxel margin and
are at least 133 % of the downstream size (96³ → 128³ by default). Larger
lesions expand the patch further. Validation patches are extracted at exactly
the downstream size. Background patches are generated with a sliding window of
the corresponding patch size and 50 % overlap. Patches are written to
OUTPUT_DIR/lesion_patches and OUTPUT_DIR/background_patches using the
following convention:
<pid>_<mod>_positive_<idx>.nii.gzand<pid>_label_positive_<idx>.nii.gz<pid>_<mod>_negative_<idx>.nii.gzand<pid>_label_negative_<idx>.nii.gz
from argparse import Namespace
from ib_sampling.loader import get_loader
args = Namespace(
train_dir="TRAIN_OUT_DIR",
val_dir="VAL_OUT_DIR", # optional
roi_x=96, roi_y=96, roi_z=96,
batch_size=4,
ratio=1.0, # negative:positive ratio
seed=0, rank=0, world_size=1,
num_workers=4, distributed=False,
modalities=["T1", "b1000"],
)
train_loader, val_loader = get_loader(args)Each item returned by the loaders is a dictionary with image and label
keys containing tensors shaped (C, Z, Y, X) and (1, Z, Y, X)
respectively.
- When instantiated,
MedicalPatchDatasetdiscovers the available modalities and preloads all positive patches into memory. In a distributed setup the positives are evenly split across workers. - For every epoch,
BalancedBatchSamplerrandomly selects the required number of negative patches to satisfy the desired ratio. The dataset then preloads only those negatives into a cache before iteration begins. - The sampler reports the number of positives and negatives each epoch and yields a balanced list of indices. The dataloader fetches the cached samples and applies the requested transforms.
- During validation all patches are cached up-front, removing disk I/O during evaluation.
This design keeps GPU utilisation high by avoiding repeated disk reads while still supporting large background pools.
- Joris Wuts
- Jakub Ceranka
- Jef Vandemeulebroucke
- Frédéric Lecouvet