-
Notifications
You must be signed in to change notification settings - Fork 3k
[recipe, ckpt] feat: bypass the CPU in the checkpoint engine for fully_async mode. #4654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an optimization to bypass the CPU in the checkpoint engine for fully_async mode, which should improve performance by reducing data copies between host and device. A new configuration bypass_cpu is added to control this behavior. The changes in checkpoint_engine.py and fsdp_workers.py correctly implement this new path. My main feedback is to improve the naming of some functions and variables that have become misleading with this change, to ensure code clarity and maintainability.
|
The reason I cache params to cpu is to further support parameters version management. But I think it is a good pr. Can you update the relevant experiments/readme as in the commit. |
|
@zpltys Sorry for the delay. I have updated the README and conducted some basic experiments. With a single-node Ascend A2 server and a 4B model, enabling Besides, fully_async, together with the checkpoint engine, can work well on Ascend NPUs with some adaptations, as in #4658. We may first have a discussion internally and then figure out how to support fully_async and the checkpoint engine more gracefully. Many thanks. |
What does this PR do?
Bypass the CPU in the checkpoint engine for fully_async mode.
Currently, in fully_async mode, each time the weights are updated, they are offloaded to the CPU, then copied to the h2d_buffer, and finally copied to the broadcast buffer.
The offload-to-CPU operation seems unnecessary. So this PR adds an additional configuration named
bypass_cputo bypass the CPU cache. Early-stage experiments show benefits.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)