[Bug] AssertionError in fill_routing_replay when enabling MTP training for GLM-4.7

I encountered a compatibility issue when attempting to train GLM-4.7 with MTP enabled. The training crashes with an AssertionError in the fill_routing_replay function within the Ray actor. Disabling MTP training resolves the issue.

### Environment
Version: Latest (using latest patch)
Model: GLM-4.7

### Reproduction Steps
Model Conversion: Converted the HF checkpoint to Torch Dist using the following args:
```
source scripts/models/glm4.5-355B-A32B.sh
MODEL_ARGS+=(
    --mtp-num-layers 1
)

PYTHONPATH=/root/slime/Megatron-LM torchrun \
    --nproc-per-node 8 \
    tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/slime/models/GLM-4.7 \
    --save /root/slime/models/GLM-4.7_torch_dist
```
Training Configuration: Training launched with MTP enabled:
```
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
```
Error Log The following error occurs during the train_actor execution:
```
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 367, in train
    return self.train_actor(rollout_id, rollout_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 402, in train_actor
    self.fill_routing_replay(data_iterator, num_microbatches, rollout_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 330, in fill_routing_replay
    assert routing_replay_offset == len(RoutingReplay.all_routing_replays)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
```
[run-glm4.5-355B-A32B-ray.sh](https://github.com/user-attachments/files/25122298/run-glm4.5-355B-A32B-ray.sh)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] AssertionError in fill_routing_replay when enabling MTP training for GLM-4.7 #1556

Environment

Reproduction Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] AssertionError in fill_routing_replay when enabling MTP training for GLM-4.7 #1556

Description

Environment

Reproduction Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions