Skip to content

[model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends#5227

Open
aoshen524 wants to merge 2 commits intoverl-project:mainfrom
aoshen524:fix/qwen2_vl_rolled_labels_sp_slice
Open

[model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends#5227
aoshen524 wants to merge 2 commits intoverl-project:mainfrom
aoshen524:fix/qwen2_vl_rolled_labels_sp_slice

Conversation

@aoshen524
Copy link
Contributor

@aoshen524 aoshen524 commented Feb 7, 2026

What does this PR do?

Fix a shape mismatch bug when using Ulysses Sequence Parallel with fused kernel backends (torch or triton).

When using Ulysses SP, hidden_states is sliced per-rank in monkey_patch.py, but input_ids/labels are not sliced before being rolled into rolled_labels. This causes a shape mismatch in forward_with_torch_backend and forward_with_triton_backend because rolled_labels has the full sequence length while hidden_states has seq_len / sp_size.

Affected models: All 4 model backends — qwen2_vl, qwen3_vl, glm4v, and dense_common (generic text models).

Checklist Before Starting

  • Search for similar PRs: query
  • Format the PR title as [{modules}] {type}: {description}

Test

Validated by training Qwen2.5-VL-7B with Ulysses SP (sp_size=2, 4) on 8xH100 GPUs. Without this fix, training crashes with shape mismatch error; with this fix, training runs correctly.

Design & Code Changes

  • verl/models/transformers/dense_common.py:
    • Add compute_rolled_labels(input_ids, labels, backend_name) shared helper that handles label rolling + Ulysses SP slicing via slice_input_tensor
    • Refactor forward_with_torch_backend and forward_with_triton_backend to use the shared helper
  • verl/models/transformers/qwen2_vl.py: Refactor to use compute_rolled_labels
  • verl/models/transformers/glm4v.py: Refactor to use compute_rolled_labels (fixes SP bug)
  • verl/models/transformers/qwen3_vl.py: Refactor to use compute_rolled_labels (fixes SP bug)

This eliminates code duplication across 4 files × 2 backends = 8 call sites, and ensures any future model backends automatically get correct SP handling.

Checklist Before Submitting

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a shape mismatch bug when using Qwen2-VL with Ulysses Sequence Parallel by slicing rolled_labels. The fix is applied to both the torch and triton backends. While the fix is correct, it introduces code duplication between the two forward methods. My review includes a suggestion to refactor this duplicated logic into a shared helper method to improve code maintainability.

Comment on lines 504 to 508
# When using Ulysses Sequence Parallel, hidden_states is sliced in monkey_patch.py
# but input_ids/labels are not. We need to slice rolled_labels to match hidden_states.
sp_size = get_ulysses_sequence_parallel_world_size()
if sp_size > 1:
rolled_labels = slice_input_tensor(rolled_labels, dim=-1, padding=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic for slicing rolled_labels is also present in forward_with_triton_backend (lines 544-548). To improve maintainability and avoid code duplication, consider extracting the common input preparation logic from both forward_with_torch_backend and forward_with_triton_backend into a private helper method.

This would include the logic for creating rolled_labels and then slicing it for sequence parallelism.

For example, you could create a helper method like this:

def _prepare_ppo_inputs(
    self: "Qwen2VLForConditionalGeneration",
    input_ids: torch.LongTensor,
    labels: Optional[torch.LongTensor],
    backend_name: str,
    **kwargs,
):
    outputs = qwen2_vl_forward(self, input_ids, **kwargs)
    hidden_states = outputs[0]

    # Loss calculations
    if labels is not None:
        rolled_labels = torch.roll(labels, shifts=-1, dims=-1)
    elif input_ids is not None:
        rolled_labels = torch.roll(input_ids, shifts=-1, dims=-1)
    else:
        raise RuntimeError(f"To use {backend_name}, either labels or input_ids must be provided.")

    # When using Ulysses Sequence Parallel
    sp_size = get_ulysses_sequence_parallel_world_size()
    if sp_size > 1:
        rolled_labels = slice_input_tensor(rolled_labels, dim=-1, padding=False)
    
    return outputs, hidden_states, rolled_labels

Then both forward methods can call this helper to get the processed inputs, making the code cleaner and easier to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've refactored this into a shared compute_rolled_labels() helper in dense_common.py that handles both the label rolling and SP slicing. This also revealed 3 other files with the same bug (glm4v.py, qwen3_vl.py, and dense_common.py itself), which are now all fixed via the shared helper.

…models

When using Ulysses Sequence Parallel with fused_kernel (torch or triton backend),
hidden_states is sliced in monkey_patch.py, but input_ids/labels are not.
This causes a shape mismatch in forward_with_torch_backend and forward_with_triton_backend.

Changes:
- Extract common rolled_labels logic into compute_rolled_labels() in dense_common.py,
  which handles both label rolling and Ulysses SP slicing via slice_input_tensor.
- Fix qwen2_vl.py, glm4v.py, qwen3_vl.py, and dense_common.py to use the shared helper.
  Previously only qwen2_vl.py had the SP fix; the other 3 files were also affected.
@aoshen524 aoshen524 force-pushed the fix/qwen2_vl_rolled_labels_sp_slice branch from a5670ef to a92c23f Compare February 7, 2026 13:19
@aoshen524 aoshen524 changed the title [model] fix: slice rolled_labels for Qwen2-VL Ulysses Sequence Parallel [model] fix: slice rolled_labels for Ulysses Sequence Parallel in all model backends Feb 7, 2026
Add CPU-only tests covering:
- labels branch (labels provided)
- input_ids branch (labels=None)
- both None raises RuntimeError
- output shape preserved
- roll shift-left correctness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant