Skip to content

[Bug] AssertionError in fill_routing_replay when enabling MTP training for GLM-4.7 #1556

@liujiahua123123

Description

@liujiahua123123

I encountered a compatibility issue when attempting to train GLM-4.7 with MTP enabled. The training crashes with an AssertionError in the fill_routing_replay function within the Ray actor. Disabling MTP training resolves the issue.

Environment

Version: Latest (using latest patch)
Model: GLM-4.7

Reproduction Steps

Model Conversion: Converted the HF checkpoint to Torch Dist using the following args:

source scripts/models/glm4.5-355B-A32B.sh
MODEL_ARGS+=(
    --mtp-num-layers 1
)

PYTHONPATH=/root/slime/Megatron-LM torchrun \
    --nproc-per-node 8 \
    tools/convert_hf_to_torch_dist.py \
    ${MODEL_ARGS[@]} \
    --hf-checkpoint /root/slime/models/GLM-4.7 \
    --save /root/slime/models/GLM-4.7_torch_dist

Training Configuration: Training launched with MTP enabled:

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

Error Log The following error occurs during the train_actor execution:

  File "/root/slime/slime/backends/megatron_utils/actor.py", line 367, in train
    return self.train_actor(rollout_id, rollout_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 402, in train_actor
    self.fill_routing_replay(data_iterator, num_microbatches, rollout_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 330, in fill_routing_replay
    assert routing_replay_offset == len(RoutingReplay.all_routing_replays)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

run-glm4.5-355B-A32B-ray.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions