-
Notifications
You must be signed in to change notification settings - Fork 513
Closed
Description
I encountered a compatibility issue when attempting to train GLM-4.7 with MTP enabled. The training crashes with an AssertionError in the fill_routing_replay function within the Ray actor. Disabling MTP training resolves the issue.
Environment
Version: Latest (using latest patch)
Model: GLM-4.7
Reproduction Steps
Model Conversion: Converted the HF checkpoint to Torch Dist using the following args:
source scripts/models/glm4.5-355B-A32B.sh
MODEL_ARGS+=(
--mtp-num-layers 1
)
PYTHONPATH=/root/slime/Megatron-LM torchrun \
--nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/slime/models/GLM-4.7 \
--save /root/slime/models/GLM-4.7_torch_dist
Training Configuration: Training launched with MTP enabled:
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
Error Log The following error occurs during the train_actor execution:
File "/root/slime/slime/backends/megatron_utils/actor.py", line 367, in train
return self.train_actor(rollout_id, rollout_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/actor.py", line 402, in train_actor
self.fill_routing_replay(data_iterator, num_microbatches, rollout_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/actor.py", line 330, in fill_routing_replay
assert routing_replay_offset == len(RoutingReplay.all_routing_replays)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels