Skip to content

jwutsetro/IB-sampling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IB-sampling

A modular dataloader for patch-based whole-body lesion detection. The package is split into small components so the dataset, sampler and loader can be reused in other projects or adapted to different modalities.

This repository accompanies the DEMI 2025 workshop paper:

Instance‑Balanced Patch Sampling for Whole‑Body Lesion Segmentation
Joris Wuts, Jakub Ceranka, Jef Vandemeulebroucke, Frédéric Lecouvet
Open-access version and citation details will be added after the MICCAI 2025 conference.

Whole-body scans often contain many tiny lesions that make up less than 0.01% of the image volume. Conventional positive–negative patch sampling struggles in this setting, over-representing background and large lesions while missing small targets. The instance-balanced strategy samples patches per lesion instance, improving CPU data-loading efficiency, training speed and segmentation accuracy.

Package structure

ib_sampling exposes a few key utilities:

  • MedicalPatchDataset – loads pre-extracted 3D patches and keeps a cache of frequently used patches in memory.
  • BalancedBatchSampler – draws a balanced mix of positive and negative patches for each epoch.
  • get_loader – convenience helper that builds training and optional validation dataloaders.

The repository also provides a template prepare_dataset.py script that converts raw volumes into patch datasets compatible with the dataloader.

Raw data layout

Before running the preparation script, organise the raw data into separate train and val directories. Each acquisition lives in its own folder:

raw_dataset/
├── train/
│   ├── Patient01/
│   │   ├── T1.nii.gz
│   │   ├── b1000.nii.gz
│   │   └── GT.nii.gz
│   └── Patient02/
│       └── ...
└── val/
    ├── Patient03/
    │   └── ...
    └── Patient04/
        └── ...

Each case folder contains one NIfTI volume per modality (any names are accepted) and a ground‑truth mask named GT.nii.gz.

Preparing patches

Use the provided script to extract positive and negative patches. Supply the directory with training volumes and, optionally, a validation directory:

python prepare_dataset.py raw_dataset/train TRAIN_OUT_DIR --downstream-patch-size 96 96 96 \
       --modalities T1 b1000 \
       --val-dir raw_dataset/val --val-output VAL_OUT_DIR

The script performs 1–99 percentile normalisation on every modality. Training patches are cropped around each connected component with a 10‑voxel margin and are at least 133 % of the downstream size (96³128³ by default). Larger lesions expand the patch further. Validation patches are extracted at exactly the downstream size. Background patches are generated with a sliding window of the corresponding patch size and 50 % overlap. Patches are written to OUTPUT_DIR/lesion_patches and OUTPUT_DIR/background_patches using the following convention:

  • <pid>_<mod>_positive_<idx>.nii.gz and <pid>_label_positive_<idx>.nii.gz
  • <pid>_<mod>_negative_<idx>.nii.gz and <pid>_label_negative_<idx>.nii.gz

Using the dataloader

from argparse import Namespace
from ib_sampling.loader import get_loader

args = Namespace(
    train_dir="TRAIN_OUT_DIR",
    val_dir="VAL_OUT_DIR",  # optional
    roi_x=96, roi_y=96, roi_z=96,
    batch_size=4,
    ratio=1.0,                 # negative:positive ratio
    seed=0, rank=0, world_size=1,
    num_workers=4, distributed=False,
    modalities=["T1", "b1000"],
)

train_loader, val_loader = get_loader(args)

Each item returned by the loaders is a dictionary with image and label keys containing tensors shaped (C, Z, Y, X) and (1, Z, Y, X) respectively.

Dataloader mechanics

  • When instantiated, MedicalPatchDataset discovers the available modalities and preloads all positive patches into memory. In a distributed setup the positives are evenly split across workers.
  • For every epoch, BalancedBatchSampler randomly selects the required number of negative patches to satisfy the desired ratio. The dataset then preloads only those negatives into a cache before iteration begins.
  • The sampler reports the number of positives and negatives each epoch and yields a balanced list of indices. The dataloader fetches the cached samples and applies the requested transforms.
  • During validation all patches are cached up-front, removing disk I/O during evaluation.

This design keeps GPU utilisation high by avoiding repeated disk reads while still supporting large background pools.

Authors

  • Joris Wuts
  • Jakub Ceranka
  • Jef Vandemeulebroucke
  • Frédéric Lecouvet

About

A GitHub repo holding the source code of our proposed custom dataloader. This work was submitted to the DEMI workshop of MICAI 2025 Source code will be made publically available upon paper acceptance, stay tuned

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages