Skip to content

【FT】Fix Fault Tolerance in PaddleFleet#4264

Merged
wtmlon merged 9 commits intoPaddlePaddle:developfrom
Xing-lil:fix_ft
Apr 15, 2026
Merged

【FT】Fix Fault Tolerance in PaddleFleet#4264
wtmlon merged 9 commits intoPaddlePaddle:developfrom
Xing-lil:fix_ft

Conversation

@Xing-lil
Copy link
Copy Markdown
Contributor

@Xing-lil Xing-lil commented Apr 12, 2026

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Others

PR changes

Others

Description

修复容错系统
1、部分参数build_skip_comm_buffer,zcc中未适配

  • e_score_correction_bias 设置了build_skip_comm_buffer,"color": "skip_comm",为stop_gradient,不在 sharding 通信的 fused buffer 中,不会被加入 param_mappings。
  • 修复:ZCC中识别unshard的参数,也保存在param_mappings,标记"buffer_index": "unfused",save时直接保存。

2、grouped_gemm_experts 中的EMA问题

  • grouped_gemm在保存EMA ckpt的时候shape未恢复
  • 修复:保存EMA时,将grouped_gemm参数恢复为原来的维度

3、MLA的AOA配置问题

  • 模型开启了MLA,最新合入的AOA配置未适配MLA
  • 修复:补充了MLA的AOA配置

4、GPTModelProvider缺少to_json_string

  • 以GPTModelProvider的方式配置模型,config缺少to_json_string
  • 修复:补充to_json_string

验证 ZCC 下接续精度对齐

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 12, 2026

Thanks for your contribution!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 4.91803% with 58 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6d7d8fb). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...addleformers/trainer/utils/zero_cost_checkpoint.py 0.00% 25 Missing ⚠️
paddleformers/trainer/trainer_utils.py 0.00% 14 Missing ⚠️
paddleformers/transformers/gpt_provider.py 21.42% 11 Missing ⚠️
paddleformers/transformers/minimax_m2/modeling.py 0.00% 8 Missing ⚠️

❌ Your patch status has failed because the patch coverage (4.91%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #4264   +/-   ##
==========================================
  Coverage           ?   33.78%           
==========================================
  Files              ?      474           
  Lines              ?    89370           
  Branches           ?        0           
==========================================
  Hits               ?    30191           
  Misses             ?    59179           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@wtmlon wtmlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@wtmlon wtmlon merged commit 591d0ea into PaddlePaddle:develop Apr 15, 2026
17 checks passed
Minestar6 pushed a commit to Minestar6/PaddleFormers that referenced this pull request Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants