Skip to content

Conversation

@Baidu-AIAK
Copy link

Problem Description

When training with BF16, enabling CPU offload and loading a checkpoint together with the optimizer state can introduce a small accuracy discrepancy at the first training step.
parameter settings:

--bf16
--use-precision-aware-optimizer
--optimizer-cpu-offload
--optimizer-offload-fraction 1.0

--load $CHECKPOINT_PATH
#--no-load-optim
#--no-load-rng

Our Solution

We found that this accuracy discrepancy is caused by errors in the model parameters when loading the checkpoint. To address this, we restore the original model parameters using the parameter values stored in the optimizer when loading the optimizer state.

Experiment

bf16-opt Under BF16 training, optimizer offload exhibits an accuracy issue during resumed training. When loading the checkpoint from the 10th iteration together with the optimizer state, before the fix the loss at the 11th iteration (i.e., the first iteration after loading) shows a noticeable discrepancy, whereas after the fix the discrepancy is reduced to the order of 1e-4.

Summary

Overall, we identified and fixed the accuracy issue that occurs when loading checkpoints and optimizer states with CPU offload enabled during BF16 training.

@Baidu-AIAK Baidu-AIAK requested review from a team as code owners December 31, 2025 08:23
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 31, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot requested a review from Phlip79 December 31, 2025 08:23
@BestJuly BestJuly added the dev branch Dev branch related issues and development label Jan 2, 2026
@yaox12 yaox12 added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request dev branch Dev branch related issues and development Expert Review Apply this label to indicate that your PR is ready for expert review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants