Can't use distributed processing

Thank you for sharing the code!
This is my script run.sh:
```
CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=4 --master_port 12345 main_pretrain.py \
    --num_workers 10 \
    --accum_iter 2 \
    --batch_size 128 \
    --model mrm \
    --norm_pix_loss \
    --mask_ratio 0.75 \
    --epochs 200 \
    --warmup_epochs 40 \
    --blr 1.5e-4 --weight_decay 0.05 \
    --resume ./MRM/mae_pretrain_vit_base.pth \
    --data_path ./MRM \
    --output_dir ./MRM \
```

When I use distributed computing, the program always gets stuck in this position

<img width="445" alt="image" src="https://github.com/RL4M/MRM-pytorch/assets/48847883/33397c2f-915c-4e40-877f-633a71ed6d89">

and never continue. But if I set the graphics card to one, it can train at a very slow speed.

I'm wondering how to deal with it.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't use distributed processing #13

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't use distributed processing #13

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions