CVPRW 2025 – IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
Thanos Delatolas ·
Vicky Kalogeiton ·
Dim Papadopoulos
Webpage ·
Paper
- selecting the appropriate diffusion model
- determining the optimal time step
- identifying the best feature extraction layer
- designing an effective affinity matrix calculation strategy to match the features
conda create -n diff-zvos python=3.10.8
conda activate diff-zvos
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
sh scripts/install_adm.shTo download the datasets, run:
python scripts/download_datasets.pyTo run inference please follow EVALUATION.md.
@article{delatolas2025studying,
title={Studying Image Diffusion Features for Zero-Shot Video Object Segmentation},
author={Delatolas, Thanos and Kalogeiton, Vicky and Papadopoulos, Dim P},
journal={arXiv preprint arXiv:2504.05468},
year={2025}
}| Model | #Images | #Segmentations (Image) | #Frames | #Segmentations (Video) | Datasets | DAVIS-17 val |
|---|---|---|---|---|---|---|
| Image + Video-level Data | ||||||
| XMem | 1.02M | 27K | 150K | 210K | I+S+D+Y | 86.2 |
| Cutie | 1.02M | 27K | 150K | 210K | I+S+D+Y | 88.8 |
| SAM2 | 11M | 1.1B | 4.2M | 35.5M | SA+SAV | 90.7 |
| Image-Level masks | ||||||
| SegIC | 1.3M | 1.8M | ❌ | ❌ | I+C+A+L | 73.7 |
| SegGPT | 147K | 1.62M | ❌ | ❌ | C+A+V | 75.6 |
| PerSAM-F | 11M | 1.1B | ❌ | ❌ | SA | 76.1 |
| Matcher | 11M | 1.1B | ❌ | ❌ | SA | 79.5 |
| No masks | ||||||
| FGVG | 1M | ❌ | 116K | ❌ | I+Y+FT | 72.4 |
| STT | 1M | ❌ | 95K | ❌ | I+Y | 74.1 |
| STC | ✗ | ❌ | 20M | ❌ | K | 67.6 |
| INO | ✗ | ❌ | 20M | ❌ | K | 72.5 |
| Mask-VOS | ✗ | ❌ | 95K | ❌ | Y | 75.6 |
| MoCo | 1M | ❌ | ❌ | ❌ | I | 65.4 |
| SHLS | 10K | ❌ | ❌ | ❌ | M | 68.5 |
| DIFT-SD | 5B | ❌ | ❌ | ❌ | LN | 70.0 |
| DINO | 1M | ❌ | ❌ | ❌ | I | 71.4 |
| DIFT-ADM | 1M | ❌ | ❌ | ❌ | I | 75.7 |
| Training-Free-VOS | 1M | ❌ | ❌ | ❌ | I | 76.3 |
| Ours | ||||||
| SD-2.1 + Prompt Learning | 5B | ❌ | ❌ | ❌ | LN | 70.5 |
| ADM + MAGFilter | 1M | ❌ | ❌ | ❌ | I | 76.8 |
We would like to thank the authors of DIFT, DINO and Cutie for making their code publicly available.
