[model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends by aoshen524 · Pull Request #5227 · verl-project/verl

aoshen524 · 2026-02-07T13:10:49Z

What does this PR do?

Fix a shape mismatch bug when using Ulysses Sequence Parallel with fused kernel backends (torch or triton).

When using Ulysses SP, hidden_states is sliced per-rank in monkey_patch.py, but input_ids/labels are not sliced before being rolled into rolled_labels. This causes a shape mismatch in forward_with_torch_backend and forward_with_triton_backend because rolled_labels has the full sequence length while hidden_states has seq_len / sp_size.

Affected models: All 4 model backends — qwen2_vl, qwen3_vl, glm4v, and dense_common (generic text models).

Checklist Before Starting

Search for similar PRs: query
Format the PR title as [{modules}] {type}: {description}

Test

Validated by training Qwen2.5-VL-7B with Ulysses SP (sp_size=2, 4) on 8xH100 GPUs. Without this fix, training crashes with shape mismatch error; with this fix, training runs correctly.

Design & Code Changes

verl/models/transformers/dense_common.py:
- Add compute_rolled_labels(input_ids, labels, backend_name) shared helper that handles label rolling + Ulysses SP slicing via slice_input_tensor
- Refactor forward_with_torch_backend and forward_with_triton_backend to use the shared helper
verl/models/transformers/qwen2_vl.py: Refactor to use compute_rolled_labels
verl/models/transformers/glm4v.py: Refactor to use compute_rolled_labels (fixes SP bug)
verl/models/transformers/qwen3_vl.py: Refactor to use compute_rolled_labels (fixes SP bug)

This eliminates code duplication across 4 files × 2 backends = 8 call sites, and ensures any future model backends automatically get correct SP handling.

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: This fix requires multi-GPU Ulysses SP setup which is not available in the standard CI environment.
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace.

gemini-code-assist

Code Review

This pull request correctly fixes a shape mismatch bug when using Qwen2-VL with Ulysses Sequence Parallel by slicing rolled_labels. The fix is applied to both the torch and triton backends. While the fix is correct, it introduces code duplication between the two forward methods. My review includes a suggestion to refactor this duplicated logic into a shared helper method to improve code maintainability.

gemini-code-assist · 2026-02-07T13:12:12Z

verl/models/transformers/qwen2_vl.py

+    # When using Ulysses Sequence Parallel, hidden_states is sliced in monkey_patch.py
+    # but input_ids/labels are not. We need to slice rolled_labels to match hidden_states.
+    sp_size = get_ulysses_sequence_parallel_world_size()
+    if sp_size > 1:
+        rolled_labels = slice_input_tensor(rolled_labels, dim=-1, padding=False)


This logic for slicing rolled_labels is also present in forward_with_triton_backend (lines 544-548). To improve maintainability and avoid code duplication, consider extracting the common input preparation logic from both forward_with_torch_backend and forward_with_triton_backend into a private helper method.

This would include the logic for creating rolled_labels and then slicing it for sequence parallelism.

For example, you could create a helper method like this:

def _prepare_ppo_inputs( self: "Qwen2VLForConditionalGeneration", input_ids: torch.LongTensor, labels: Optional[torch.LongTensor], backend_name: str, **kwargs, ): outputs = qwen2_vl_forward(self, input_ids, **kwargs) hidden_states = outputs[0] # Loss calculations if labels is not None: rolled_labels = torch.roll(labels, shifts=-1, dims=-1) elif input_ids is not None: rolled_labels = torch.roll(input_ids, shifts=-1, dims=-1) else: raise RuntimeError(f"To use {backend_name}, either labels or input_ids must be provided.") # When using Ulysses Sequence Parallel sp_size = get_ulysses_sequence_parallel_world_size() if sp_size > 1: rolled_labels = slice_input_tensor(rolled_labels, dim=-1, padding=False) return outputs, hidden_states, rolled_labels

Then both forward methods can call this helper to get the processed inputs, making the code cleaner and easier to maintain.

Thanks for the suggestion! I've refactored this into a shared compute_rolled_labels() helper in dense_common.py that handles both the label rolling and SP slicing. This also revealed 3 other files with the same bug (glm4v.py, qwen3_vl.py, and dense_common.py itself), which are now all fixed via the shared helper.

…models When using Ulysses Sequence Parallel with fused_kernel (torch or triton backend), hidden_states is sliced in monkey_patch.py, but input_ids/labels are not. This causes a shape mismatch in forward_with_torch_backend and forward_with_triton_backend. Changes: - Extract common rolled_labels logic into compute_rolled_labels() in dense_common.py, which handles both label rolling and Ulysses SP slicing via slice_input_tensor. - Fix qwen2_vl.py, glm4v.py, qwen3_vl.py, and dense_common.py to use the shared helper. Previously only qwen2_vl.py had the SP fix; the other 3 files were also affected.

Add CPU-only tests covering: - labels branch (labels provided) - input_ids branch (labels=None) - both None raises RuntimeError - output shape preserved - roll shift-left correctness

aoshen524 requested review from FightingZhen, PeterSH6, ji-huazhong, tardis-key and vermouth1992 as code owners February 7, 2026 13:10

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

aoshen524 force-pushed the fix/qwen2_vl_rolled_labels_sp_slice branch from a5670ef to a92c23f Compare February 7, 2026 13:19

aoshen524 changed the title ~~[model] fix: slice rolled_labels for Qwen2-VL Ulysses Sequence Parallel~~ [model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends Feb 7, 2026

test(model): add unit tests for compute_rolled_labels

3ed8e9a

Add CPU-only tests covering: - labels branch (labels provided) - input_ids branch (labels=None) - both None raises RuntimeError - output shape preserved - roll shift-left correctness

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends#5227

[model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends#5227
aoshen524 wants to merge 2 commits intoverl-project:mainfrom
aoshen524:fix/qwen2_vl_rolled_labels_sp_slice

aoshen524 commented Feb 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

aoshen524 Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aoshen524 commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen524 Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aoshen524 commented Feb 7, 2026 •

edited

Loading