修复 PPO-Lagrange 算法中的 NaN 问题和断言错误 #385
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
问题描述
修复了 PPO-Lagrange 算法在训练过程中出现的 NaN 值和断言失败问题。
修改内容
online_adapter.py:
ppo_lag.py:
policy_gradient.py:
onpolicy_adapter.py:
新增脚本:
reproduce_nan_issue.py: NaN 问题复现和调试脚本train_with_risk.py: 风险敏感训练对比实验脚本测试验证
影响范围
这些修改主要影响: