[Dev] Fix precision issues when resuming training from a checkpoint with BF16 and optimizer offload enabled #2789

Baidu-AIAK · 2025-12-31T08:23:01Z

Problem Description

When training with BF16, enabling CPU offload and loading a checkpoint together with the optimizer state can introduce a small accuracy discrepancy at the first training step.
parameter settings:

--bf16
--use-precision-aware-optimizer
--optimizer-cpu-offload
--optimizer-offload-fraction 1.0

--load $CHECKPOINT_PATH
#--no-load-optim
#--no-load-rng

Our Solution

We found that this accuracy discrepancy is caused by errors in the model parameters when loading the checkpoint. To address this, we restore the original model parameters using the parameter values stored in the optimizer when loading the optimizer state.

Experiment

Under BF16 training, optimizer offload exhibits an accuracy issue during resumed training. When loading the checkpoint from the 10th iteration together with the optimizer state, before the fix the loss at the 11th iteration (i.e., the first iteration after loading) shows a noticeable discrepancy, whereas after the fix the discrepancy is reduced to the order of 1e-4.

Summary

Overall, we identified and fixed the accuracy issue that occurs when loading checkpoints and optimizer states with CPU offload enabled during BF16 training.

…16 and optimizer offload enabled

copy-pr-bot · 2025-12-31T08:23:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Fix precision issues when resuming training from a checkpoint with BF…

1860237

…16 and optimizer offload enabled

Baidu-AIAK requested review from a team as code owners December 31, 2025 08:23

github-actions bot added the community-request label Dec 31, 2025

github-actions bot requested a review from Phlip79 December 31, 2025 08:23

BestJuly added the dev branch Dev branch related issues and development label Jan 2, 2026

yaox12 added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Jan 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Dev] Fix precision issues when resuming training from a checkpoint with BF16 and optimizer offload enabled #2789

[Dev] Fix precision issues when resuming training from a checkpoint with BF16 and optimizer offload enabled #2789

Uh oh!

Baidu-AIAK commented Dec 31, 2025

Uh oh!

copy-pr-bot bot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Dev] Fix precision issues when resuming training from a checkpoint with BF16 and optimizer offload enabled #2789

Are you sure you want to change the base?

[Dev] Fix precision issues when resuming training from a checkpoint with BF16 and optimizer offload enabled #2789

Uh oh!

Conversation

Baidu-AIAK commented Dec 31, 2025

Problem Description

Our Solution

Experiment

Summary

Uh oh!

copy-pr-bot bot commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants