Skip to content

Conversation

@yitianlian
Copy link
Collaborator

No description provided.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7a40039d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 104 to +108
GRPO_ARGS=(
--advantage-estimator on_policy_distillation
--advantage-estimator grpo
--use-opd
--opd-type sglang
--opd-kl-coef 1.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid GRPO with token-level OPD rewards

With --advantage-estimator grpo enabled here, the rollout reward pipeline expects one scalar reward per sample, but the configured OPD post-process (examples.on_policy_distillation.on_policy_distillation.post_process_rewards) returns per-token log-probability tensors. That mismatch causes compute_advantages_and_returns to call torch.tensor(rewards) on a list of variable-length tensors and will fail or produce invalid shapes at runtime. This worked previously only because the on_policy_distillation estimator consumed token-level log-probs directly; switching to GRPO without changing the reward post-process breaks the example.

Useful? React with 👍 / 👎.

Comment on lines 50 to 53
RM_ARGS=(
--custom-rm-path examples.on_policy_distillation.on_policy_distillation.reward_func
--custom-reward-post-process-path examples.on_policy_distillation.on_policy_distillation.post_process_rewards
--rm-url http://$TEACHER_IP:$TEACHER_PORT/generate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Megatron OPD script still requires an external RM server

This “megatron” example still wires the SGLang reward function and --rm-url into training, and reward_func posts to args.rm_url. If users run the script as advertised (no external server), it will fail with a connection error because $TEACHER_IP/$TEACHER_PORT are unset and no server is started. If the intent is truly “no external server,” the RM wiring needs to be removed or replaced with a local reward path.

Useful? React with 👍 / 👎.

Copy link
Contributor

@zhuzilin zhuzilin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some minor change suggestions.

raise NotImplementedError(f"advantage_estimator {args.advantage_estimator} is not supported. ")

# Apply on-policy distillation KL penalty to advantages (orthogonal to advantage estimator)
apply_opd_kl_to_advantages(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better if we move the early return in this function to outside, similar to:

if not args.use_opd:
    apply_opd_kl_to_advantages(...)

)
parser.add_argument(
"--opd-teacher-ckpt-step", type=int, default=None, help="The checkpoint step for OPD teacher model."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have a new add_on_policy_distillation_arguments function to manage the arguments for opd.

@zhuzilin zhuzilin merged commit 5073e32 into THUDM:main Feb 3, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants