-
Notifications
You must be signed in to change notification settings - Fork 523
[Feature] Add megatron version for on policy distillation #1538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7a40039d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| GRPO_ARGS=( | ||
| --advantage-estimator on_policy_distillation | ||
| --advantage-estimator grpo | ||
| --use-opd | ||
| --opd-type sglang | ||
| --opd-kl-coef 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid GRPO with token-level OPD rewards
With --advantage-estimator grpo enabled here, the rollout reward pipeline expects one scalar reward per sample, but the configured OPD post-process (examples.on_policy_distillation.on_policy_distillation.post_process_rewards) returns per-token log-probability tensors. That mismatch causes compute_advantages_and_returns to call torch.tensor(rewards) on a list of variable-length tensors and will fail or produce invalid shapes at runtime. This worked previously only because the on_policy_distillation estimator consumed token-level log-probs directly; switching to GRPO without changing the reward post-process breaks the example.
Useful? React with 👍 / 👎.
| RM_ARGS=( | ||
| --custom-rm-path examples.on_policy_distillation.on_policy_distillation.reward_func | ||
| --custom-reward-post-process-path examples.on_policy_distillation.on_policy_distillation.post_process_rewards | ||
| --rm-url http://$TEACHER_IP:$TEACHER_PORT/generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Megatron OPD script still requires an external RM server
This “megatron” example still wires the SGLang reward function and --rm-url into training, and reward_func posts to args.rm_url. If users run the script as advertised (no external server), it will fail with a connection error because $TEACHER_IP/$TEACHER_PORT are unset and no server is started. If the intent is truly “no external server,” the RM wiring needs to be removed or replaced with a local reward path.
Useful? React with 👍 / 👎.
zhuzilin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some minor change suggestions.
| raise NotImplementedError(f"advantage_estimator {args.advantage_estimator} is not supported. ") | ||
|
|
||
| # Apply on-policy distillation KL penalty to advantages (orthogonal to advantage estimator) | ||
| apply_opd_kl_to_advantages( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better if we move the early return in this function to outside, similar to:
if not args.use_opd:
apply_opd_kl_to_advantages(...)
slime/utils/arguments.py
Outdated
| ) | ||
| parser.add_argument( | ||
| "--opd-teacher-ckpt-step", type=int, default=None, help="The checkpoint step for OPD teacher model." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can have a new add_on_policy_distillation_arguments function to manage the arguments for opd.
No description provided.