[perf] fix: fix npu profiling scripts#5226
[perf] fix: fix npu profiling scripts#5226tongtong0613 wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates NPU profiling scripts by increasing gpu_memory_utilization and setting a large update_weights_bucket_megabytes. While these changes can improve performance, setting the weight update bucket size to 4GB in an example script is risky as it significantly increases memory pressure on the training worker and may lead to out-of-memory errors on hardware with less memory. I've added comments suggesting to add a warning about the high memory requirement to mitigate this risk for other users.
| actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ | ||
| actor_rollout_ref.rollout.n=4 \ | ||
| actor_rollout_ref.rollout.enable_chunked_prefill=False \ | ||
| actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \ |
There was a problem hiding this comment.
Setting update_weights_bucket_megabytes to 4096MB allocates a 4GB buffer on the training worker's device. This large allocation, combined with memory for the model and optimizer, significantly increases the risk of out-of-memory (OOM) errors, especially on devices with less memory. While it can improve performance, this high value in an example script is risky. It would be beneficial to add a comment to warn users about the high memory requirement.
| actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \ | |
| # NOTE: A large bucket size (4GB) is used for weight updates to improve throughput. | |
| # This may cause OOM on devices with less memory. Consider lowering if you encounter OOM errors. | |
| actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \ |
| actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ | ||
| actor_rollout_ref.rollout.n=4 \ | ||
| actor_rollout_ref.rollout.enable_chunked_prefill=False \ | ||
| actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \ |
There was a problem hiding this comment.
Setting update_weights_bucket_megabytes to 4096MB allocates a 4GB buffer on the training worker's device. This large allocation, combined with memory for the model and optimizer, significantly increases the risk of out-of-memory (OOM) errors, especially on devices with less memory. While it can improve performance, this high value in an example script is risky. It would be beneficial to add a comment to warn users about the high memory requirement.
| actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \ | |
| # NOTE: A large bucket size (4GB) is used for weight updates to improve throughput. | |
| # This may cause OOM on devices with less memory. Consider lowering if you encounter OOM errors. | |
| actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \ |
What does this PR do?
fix npu profiling scripts
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.