GitHub - thanosDelatolas/diff-zvos: [CVPRW2025] Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

_{CVPRW 2025 – IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops}

Thanos Delatolas · Vicky Kalogeiton · Dim Papadopoulos

Webpage · Paper

We leverage pre-trained diffusion models for Zero-Shot Video Object Segmentation by addressing key challenges:

selecting the appropriate diffusion model
determining the optimal time step
identifying the best feature extraction layer
designing an effective affinity matrix calculation strategy to match the features

Installation

conda create -n diff-zvos python=3.10.8
conda activate diff-zvos
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
sh scripts/install_adm.sh

Datasets

To download the datasets, run:

python scripts/download_datasets.py

To run inference please follow EVALUATION.md.

Citation

@article{delatolas2025studying,
  title={Studying Image Diffusion Features for Zero-Shot Video Object Segmentation},
  author={Delatolas, Thanos and Kalogeiton, Vicky and Papadopoulos, Dim P},
  journal={arXiv preprint arXiv:2504.05468},
  year={2025}
}

State-of-the-art Comparison in Zero-Shot Video Segmentation

Model	#Images	#Segmentations (Image)	#Frames	#Segmentations (Video)	Datasets	DAVIS-17 val
Image + Video-level Data
XMem	1.02M	27K	150K	210K	I+S+D+Y	86.2
Cutie	1.02M	27K	150K	210K	I+S+D+Y	88.8
SAM2	11M	1.1B	4.2M	35.5M	SA+SAV	90.7
Image-Level masks
SegIC	1.3M	1.8M	❌	❌	I+C+A+L	73.7
SegGPT	147K	1.62M	❌	❌	C+A+V	75.6
PerSAM-F	11M	1.1B	❌	❌	SA	76.1
Matcher	11M	1.1B	❌	❌	SA	79.5
No masks
FGVG	1M	❌	116K	❌	I+Y+FT	72.4
STT	1M	❌	95K	❌	I+Y	74.1
STC	✗	❌	20M	❌	K	67.6
INO	✗	❌	20M	❌	K	72.5
Mask-VOS	✗	❌	95K	❌	Y	75.6
MoCo	1M	❌	❌	❌	I	65.4
SHLS	10K	❌	❌	❌	M	68.5
DIFT-SD	5B	❌	❌	❌	LN	70.0
DINO	1M	❌	❌	❌	I	71.4
DIFT-ADM	1M	❌	❌	❌	I	75.7
Training-Free-VOS	1M	❌	❌	❌	I	76.3
Ours
SD-2.1 + Prompt Learning	5B	❌	❌	❌	LN	70.5
ADM + MAGFilter	1M	❌	❌	❌	I	76.8

Acknowledgements

We would like to thank the authors of DIFT, DINO and Cutie for making their code publicly available.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Installation

Datasets

Citation

State-of-the-art Comparison in Zero-Shot Video Segmentation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

thanosDelatolas/diff-zvos

Folders and files

Latest commit

History

Repository files navigation

Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Installation

Datasets

Citation

State-of-the-art Comparison in Zero-Shot Video Segmentation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages