-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
I'm trying to apply pipeline parallelism in training LLaVA. (TP1, PP2)
Although I followed the instruction , the code is not working. (TP2 PP1 working)
And i found there is some weird points in the code.
- LLaVA is basically decoder-only model.
In my understanding, vision encoder / vision projector is additional embedding part, which is only used in pre_process part.
However, LLaVA model is initialized as encoder_and_decoder model_type. Why not encoder_or_decoder?
Furthermore, while pp communication, the recv, send tensor shape is set as (num_image_token, B, hidden_size).
It seems shards gives/takes vision embedding, not the intermediate states from middle of language model.
P.S. Currently, i do not use encoder_pipeline_model_parallel_size / tensor parallel size because it occurs errors while initializing megatron that is not divisible world_size % total_model_size .
So i forced vision_config.pipeline_model_parallel_size to be 1.
I am not familiar with megatron code, and really hope that get some help with llava training.
Thank you.