[model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends#5227
[model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends#5227aoshen524 wants to merge 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a shape mismatch bug when using Qwen2-VL with Ulysses Sequence Parallel by slicing rolled_labels. The fix is applied to both the torch and triton backends. While the fix is correct, it introduces code duplication between the two forward methods. My review includes a suggestion to refactor this duplicated logic into a shared helper method to improve code maintainability.
verl/models/transformers/qwen2_vl.py
Outdated
| # When using Ulysses Sequence Parallel, hidden_states is sliced in monkey_patch.py | ||
| # but input_ids/labels are not. We need to slice rolled_labels to match hidden_states. | ||
| sp_size = get_ulysses_sequence_parallel_world_size() | ||
| if sp_size > 1: | ||
| rolled_labels = slice_input_tensor(rolled_labels, dim=-1, padding=False) |
There was a problem hiding this comment.
This logic for slicing rolled_labels is also present in forward_with_triton_backend (lines 544-548). To improve maintainability and avoid code duplication, consider extracting the common input preparation logic from both forward_with_torch_backend and forward_with_triton_backend into a private helper method.
This would include the logic for creating rolled_labels and then slicing it for sequence parallelism.
For example, you could create a helper method like this:
def _prepare_ppo_inputs(
self: "Qwen2VLForConditionalGeneration",
input_ids: torch.LongTensor,
labels: Optional[torch.LongTensor],
backend_name: str,
**kwargs,
):
outputs = qwen2_vl_forward(self, input_ids, **kwargs)
hidden_states = outputs[0]
# Loss calculations
if labels is not None:
rolled_labels = torch.roll(labels, shifts=-1, dims=-1)
elif input_ids is not None:
rolled_labels = torch.roll(input_ids, shifts=-1, dims=-1)
else:
raise RuntimeError(f"To use {backend_name}, either labels or input_ids must be provided.")
# When using Ulysses Sequence Parallel
sp_size = get_ulysses_sequence_parallel_world_size()
if sp_size > 1:
rolled_labels = slice_input_tensor(rolled_labels, dim=-1, padding=False)
return outputs, hidden_states, rolled_labelsThen both forward methods can call this helper to get the processed inputs, making the code cleaner and easier to maintain.
There was a problem hiding this comment.
Thanks for the suggestion! I've refactored this into a shared compute_rolled_labels() helper in dense_common.py that handles both the label rolling and SP slicing. This also revealed 3 other files with the same bug (glm4v.py, qwen3_vl.py, and dense_common.py itself), which are now all fixed via the shared helper.
…models When using Ulysses Sequence Parallel with fused_kernel (torch or triton backend), hidden_states is sliced in monkey_patch.py, but input_ids/labels are not. This causes a shape mismatch in forward_with_torch_backend and forward_with_triton_backend. Changes: - Extract common rolled_labels logic into compute_rolled_labels() in dense_common.py, which handles both label rolling and Ulysses SP slicing via slice_input_tensor. - Fix qwen2_vl.py, glm4v.py, qwen3_vl.py, and dense_common.py to use the shared helper. Previously only qwen2_vl.py had the SP fix; the other 3 files were also affected.
a5670ef to
a92c23f
Compare
Add CPU-only tests covering: - labels branch (labels provided) - input_ids branch (labels=None) - both None raises RuntimeError - output shape preserved - roll shift-left correctness
What does this PR do?
Fix a shape mismatch bug when using Ulysses Sequence Parallel with fused kernel backends (torch or triton).
When using Ulysses SP,
hidden_statesis sliced per-rank inmonkey_patch.py, butinput_ids/labelsare not sliced before being rolled intorolled_labels. This causes a shape mismatch inforward_with_torch_backendandforward_with_triton_backendbecauserolled_labelshas the full sequence length whilehidden_stateshasseq_len / sp_size.Affected models: All 4 model backends —
qwen2_vl,qwen3_vl,glm4v, anddense_common(generic text models).Checklist Before Starting
[{modules}] {type}: {description}Test
Validated by training Qwen2.5-VL-7B with Ulysses SP (sp_size=2, 4) on 8xH100 GPUs. Without this fix, training crashes with shape mismatch error; with this fix, training runs correctly.
Design & Code Changes
verl/models/transformers/dense_common.py:compute_rolled_labels(input_ids, labels, backend_name)shared helper that handles label rolling + Ulysses SP slicing viaslice_input_tensorforward_with_torch_backendandforward_with_triton_backendto use the shared helperverl/models/transformers/qwen2_vl.py: Refactor to usecompute_rolled_labelsverl/models/transformers/glm4v.py: Refactor to usecompute_rolled_labels(fixes SP bug)verl/models/transformers/qwen3_vl.py: Refactor to usecompute_rolled_labels(fixes SP bug)This eliminates code duplication across 4 files × 2 backends = 8 call sites, and ensures any future model backends automatically get correct SP handling.
Checklist Before Submitting
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace.