[Feature] Add megatron version for on policy distillation #1538

yitianlian · 2026-02-03T02:56:03Z

No description provided.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7a40039d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-03T03:00:03Z

examples/on_policy_distillation/run-qwen3-8B-opd.sh

 GRPO_ARGS=(
-   --advantage-estimator on_policy_distillation
+   --advantage-estimator grpo
+   --use-opd
+   --opd-type sglang
+   --opd-kl-coef 1.0


Avoid GRPO with token-level OPD rewards

With --advantage-estimator grpo enabled here, the rollout reward pipeline expects one scalar reward per sample, but the configured OPD post-process (examples.on_policy_distillation.on_policy_distillation.post_process_rewards) returns per-token log-probability tensors. That mismatch causes compute_advantages_and_returns to call torch.tensor(rewards) on a list of variable-length tensors and will fail or produce invalid shapes at runtime. This worked previously only because the on_policy_distillation estimator consumed token-level log-probs directly; switching to GRPO without changing the reward post-process breaks the example.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-03T03:00:03Z

examples/on_policy_distillation/run-qwen3-8B-opd-megatron.sh

+RM_ARGS=(
+   --custom-rm-path examples.on_policy_distillation.on_policy_distillation.reward_func
+   --custom-reward-post-process-path examples.on_policy_distillation.on_policy_distillation.post_process_rewards
+   --rm-url http://$TEACHER_IP:$TEACHER_PORT/generate


Megatron OPD script still requires an external RM server

This “megatron” example still wires the SGLang reward function and --rm-url into training, and reward_func posts to args.rm_url. If users run the script as advertised (no external server), it will fail with a connection error because $TEACHER_IP/$TEACHER_PORT are unset and no server is started. If the intent is truly “no external server,” the RM wiring needs to be removed or replaced with a local reward path.

Useful? React with 👍 / 👎.

zhuzilin

LGTM. Left some minor change suggestions.

zhuzilin · 2026-02-03T03:14:39Z

slime/backends/megatron_utils/loss.py

        raise NotImplementedError(f"advantage_estimator {args.advantage_estimator} is not supported. ")

+    # Apply on-policy distillation KL penalty to advantages (orthogonal to advantage estimator)
+    apply_opd_kl_to_advantages(


It will be better if we move the early return in this function to outside, similar to:

if not args.use_opd: apply_opd_kl_to_advantages(...)

zhuzilin · 2026-02-03T03:16:09Z

slime/utils/arguments.py

+            )
+            parser.add_argument(
+                "--opd-teacher-ckpt-step", type=int, default=None, help="The checkpoint step for OPD teacher model."
+            )


we can have a new add_on_policy_distillation_arguments function to manage the arguments for opd.

yitianlian added 3 commits February 3, 2026 02:48

opt mgt

475b3ca

update doc

e602d72

format

d7a4003

chatgpt-codex-connector bot reviewed Feb 3, 2026

View reviewed changes

zhuzilin approved these changes Feb 3, 2026

View reviewed changes

yitianlian added 4 commits February 3, 2026 04:33

reorg args and code

df9cc8a

reorg

75c941a

fix cp error

9da705a

add filter

29d6bbd

zhuzilin approved these changes Feb 3, 2026

View reviewed changes

zhuzilin merged commit 5073e32 into THUDM:main Feb 3, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add megatron version for on policy distillation #1538

[Feature] Add megatron version for on policy distillation #1538

Uh oh!

yitianlian commented Feb 3, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 3, 2026

Uh oh!

chatgpt-codex-connector bot Feb 3, 2026

Uh oh!

zhuzilin left a comment

Uh oh!

zhuzilin Feb 3, 2026

Uh oh!

zhuzilin Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Feature] Add megatron version for on policy distillation #1538

[Feature] Add megatron version for on policy distillation #1538

Uh oh!

Conversation

yitianlian commented Feb 3, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin left a comment

Choose a reason for hiding this comment

Uh oh!

zhuzilin Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants