IB-sampling

A modular dataloader for patch-based whole-body lesion detection. The package is split into small components so the dataset, sampler and loader can be reused in other projects or adapted to different modalities.

This repository accompanies the DEMI 2025 workshop paper:

Instance‑Balanced Patch Sampling for Whole‑Body Lesion Segmentation
Joris Wuts, Jakub Ceranka, Jef Vandemeulebroucke, Frédéric Lecouvet
Open-access version and citation details will be added after the MICCAI 2025 conference.

Whole-body scans often contain many tiny lesions that make up less than 0.01% of the image volume. Conventional positive–negative patch sampling struggles in this setting, over-representing background and large lesions while missing small targets. The instance-balanced strategy samples patches per lesion instance, improving CPU data-loading efficiency, training speed and segmentation accuracy.

Package structure

ib_sampling exposes a few key utilities:

MedicalPatchDataset – loads pre-extracted 3D patches and keeps a cache of frequently used patches in memory.
BalancedBatchSampler – draws a balanced mix of positive and negative patches for each epoch.
get_loader – convenience helper that builds training and optional validation dataloaders.

The repository also provides a template prepare_dataset.py script that converts raw volumes into patch datasets compatible with the dataloader.

Raw data layout

Before running the preparation script, organise the raw data into separate train and val directories. Each acquisition lives in its own folder:

raw_dataset/
├── train/
│   ├── Patient01/
│   │   ├── T1.nii.gz
│   │   ├── b1000.nii.gz
│   │   └── GT.nii.gz
│   └── Patient02/
│       └── ...
└── val/
    ├── Patient03/
    │   └── ...
    └── Patient04/
        └── ...

Each case folder contains one NIfTI volume per modality (any names are accepted) and a ground‑truth mask named GT.nii.gz.

Preparing patches

Use the provided script to extract positive and negative patches. Supply the directory with training volumes and, optionally, a validation directory:

python prepare_dataset.py raw_dataset/train TRAIN_OUT_DIR --downstream-patch-size 96 96 96 \
       --modalities T1 b1000 \
       --val-dir raw_dataset/val --val-output VAL_OUT_DIR

The script performs 1–99 percentile normalisation on every modality. Training patches are cropped around each connected component with a 10‑voxel margin and are at least 133 % of the downstream size (96³ → 128³ by default). Larger lesions expand the patch further. Validation patches are extracted at exactly the downstream size. Background patches are generated with a sliding window of the corresponding patch size and 50 % overlap. Patches are written to OUTPUT_DIR/lesion_patches and OUTPUT_DIR/background_patches using the following convention:

<pid>_<mod>_positive_<idx>.nii.gz and <pid>_label_positive_<idx>.nii.gz
<pid>_<mod>_negative_<idx>.nii.gz and <pid>_label_negative_<idx>.nii.gz

Using the dataloader

from argparse import Namespace
from ib_sampling.loader import get_loader

args = Namespace(
    train_dir="TRAIN_OUT_DIR",
    val_dir="VAL_OUT_DIR",  # optional
    roi_x=96, roi_y=96, roi_z=96,
    batch_size=4,
    ratio=1.0,                 # negative:positive ratio
    seed=0, rank=0, world_size=1,
    num_workers=4, distributed=False,
    modalities=["T1", "b1000"],
)

train_loader, val_loader = get_loader(args)

Each item returned by the loaders is a dictionary with image and label keys containing tensors shaped (C, Z, Y, X) and (1, Z, Y, X) respectively.

Dataloader mechanics

When instantiated, MedicalPatchDataset discovers the available modalities and preloads all positive patches into memory. In a distributed setup the positives are evenly split across workers.
For every epoch, BalancedBatchSampler randomly selects the required number of negative patches to satisfy the desired ratio. The dataset then preloads only those negatives into a cache before iteration begins.
The sampler reports the number of positives and negatives each epoch and yields a balanced list of indices. The dataloader fetches the cached samples and applies the requested transforms.
During validation all patches are cached up-front, removing disk I/O during evaluation.

This design keeps GPU utilisation high by avoiding repeated disk reads while still supporting large background pools.

Authors

Joris Wuts
Jakub Ceranka
Jef Vandemeulebroucke
Frédéric Lecouvet

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ib_sampling		ib_sampling
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
prepare_dataset.py		prepare_dataset.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IB-sampling

Package structure

Raw data layout

Preparing patches

Using the dataloader

Dataloader mechanics

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IB-sampling

Package structure

Raw data layout

Preparing patches

Using the dataloader

Dataloader mechanics

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages