[QUESTION] LLaVA model_type, pipeline parallel training #1078

KookHoiKim · 2024-08-28T00:58:52Z

KookHoiKim
Aug 28, 2024

I'm trying to apply pipeline parallelism in training LLaVA. (TP1, PP2)
Although I followed the instruction , the code is not working. (TP2 PP1 working)

And i found there is some weird points in the code.

LLaVA is basically decoder-only model.

In my understanding, vision encoder / vision projector is additional embedding part, which is only used in pre_process part.
However, LLaVA model is initialized as encoder_and_decoder model_type. Why not encoder_or_decoder?

Furthermore, while pp communication, the recv, send tensor shape is set as (num_image_token, B, hidden_size).
It seems shards gives/takes vision embedding, not the intermediate states from middle of language model.

P.S. Currently, i do not use encoder_pipeline_model_parallel_size / tensor parallel size because it occurs errors while initializing megatron that is not divisible world_size % total_model_size .
So i forced vision_config.pipeline_model_parallel_size to be 1.

I am not familiar with megatron code, and really hope that get some help with llava training.

Thank you.

Ir1d · 2025-02-25T19:57:06Z

Ir1d
Feb 25, 2025

Hi @KookHoiKim did you resolve this?

0 replies

CodersAcademy006 · 2026-01-21T13:40:08Z

CodersAcademy006
Jan 21, 2026

Hey, I guess the PP=2 failure is an architectural constraint in Megatron-LM's multimodal pipeline implementation. LLaVA uses encoder_and_decoder model type because the vision encoder constitutes a distinct computation graph that executes on pipeline stage 0 before language model activations flow through subsequent stages. The inter-stage communication protocol transfers materialized vision embeddings (num_image_token, B, hidden_size) rather than intermediate hidden states because the vision encoder is non-partitionable - it must execute atomically on rank 0 to maintain feature extraction coherence. Setting encoder-pipeline-model-parallel-size > 1 breaks this invariant and causes topology mismatches during collective communications, resulting in hanging allreduce operations or shape assertion failures.

The correct configuration is --pipeline-model-parallel-size 2 --encoder-pipeline-model-parallel-size 1 --tensor-model-parallel-size 1, which pins the vision encoder to stage 0 and partitions only the transformer decoder across stages. Your world size divisibility constraint world_size % (PP * TP * DP) == 0 fails because you're attempting to parallelize a component with insufficient depth - the vision encoder has fixed compute that cannot amortize communication overhead across multiple devices. The forced vision_config.pipeline_model_parallel_size = 1 is the correct mitigation. For production deployments, consider profiling stage bubble time - LLaVA's vision preprocessing creates pipeline imbalance since stage 0 has disproportionate compute. You may achieve better throughput with TP=2, PP=1 if your vision encoder latency dominates critical path, or implement async vision preprocessing to decouple vision encoding from language model forward passes.

Please correct me otherwise. Thank you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] LLaVA model_type, pipeline parallel training #1078

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[QUESTION] LLaVA model_type, pipeline parallel training #1078

Uh oh!

KookHoiKim Aug 28, 2024

Replies: 2 comments

Uh oh!

Ir1d Feb 25, 2025

Uh oh!

CodersAcademy006 Jan 21, 2026

KookHoiKim
Aug 28, 2024

Ir1d
Feb 25, 2025

CodersAcademy006
Jan 21, 2026