-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
npu下产生此报错,使用的是megatron+vllm,verl和mindspeed均使用最新主线代码,报错信息如下:
Traceback (most recent call last):
File "/verl/verl/workers/megatron_workers.py", line 860, in compute_log_prob
output, entropys, layers_topk_idx = self.actor.compute_log_prob(data=data, calculate_entropy=True)
File "/verl/verl/utils/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
File "/verl/verl/utils/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
File "/verl/verl/workers/actor/megatron_actor.py", line 235, in compute_log_prob
output = self.forward_backward_batch(...)
File "/verl/verl/workers/actor/megatron_actor.py", line 683, in forward_backward_batch
losses_reduced = forward_backward_func(...)
File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1155, in forward_backward_pipelining_with_interleaving
output_tensor = forward_step_helper(k, microbatch_id, checkpoint_activations_microbatch)
File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1004, in forward_step_helper
output_tensor, num_tokens = forward_step(...)
File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 286, in forward_step
outputs = loss_func(output_tensor)
File "/verl/verl/workers/actor/megatron_actor.py", line 463, in loss_func
stats = post_process_fn(output, data)
File "/verl/verl/workers/actor/megatron_actor.py", line 213, in compute_logprobs_fn
log_probs = output["log_probs"][:, -response_length - 1 : -1].contiguous()
IndexError: too many indices for tensor of dimension 3
目前可以得出的结论是,mbridge和VPP无法同时开启,同时开启即会产生此报错,请问是否有人尝试过GPU环境下同时开启mbridge和VPP ?