Skip to content

Conversation

@yxyOo
Copy link

@yxyOo yxyOo commented Jan 26, 2026

Summary

Ensure FP16 training is consistent and safe by setting required optimizer params, aligning fp16/bf16 flags, and avoiding duplicate grad scaling.

Changes

Results

fp16 shows faster convergence speed and lower mismatch between training and rollout.
all

kwargs["min_loss_scale"] = 1
kwargs["use_precision_aware_optimizer"] = True
kwargs["store_param_remainders"] = False
logger.info(f"FP16 mode enabled. Optimizer config: {kwargs}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that these params can be set with the

    for f in dataclasses.fields(OptimizerConfig):
        if hasattr(args, f.name):
            kwargs[f.name] = getattr(args, f.name)

For example, you can set the initial_loss_scale with --initial-loss-scale 32768.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, configurable, but these params must be enabled with FP16—otherwise performance degrades and users may miss them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And store_param_remainders cannot be set this way (not exposed in Megatron’s public config), while other params are permissible.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants