📝 Paper@arXiv | 🤗 HuggingFace | 🐱 GitHub
All experiments from the paper “DCPO: Dynamic Clipping Policy Optimization” can be reproduced with this repo.
DCPO is a reinforcement‑learning‑from‑verifiable‑rewards (RLVR) framework that dramatically improves data‑Utilization and training speed for large language models (LLMs) on reasoning‑heavy tasks.
| Feature | What it does |
|---|---|
| Dynamic Adaptive Clipping | Computes a closed‑form clipping bound that depends on the old probability, reducing the token‑clipping ratio by ~10× compared with fixed clipping. |
| Smooth Advantage Standardization | Standardizes rewards by mixing the current‑step statistics with the cumulative statistics, removing zero‑gradient “dead zones” and increasing non‑zero‑gradient usage by ≈ 28 %. |
| OTM loss | Calculates the loss over tokens of asingle response without batch-level averaging, preserving the relative advantage between responses. |
| Broad‑scale validation | Tested on MATH‑500, AMC‑23, AIME‑24, AIME‑25 with model sizes 1.5 B – 14 B. DCPO‑7B reaches38.8 Avg@32 on AIME‑24 (↑ 21 % over GRPO) while halving wall‑clock GPU hours versus DAPO. |
GRPO first samples G responses for each query, assigns rewards R through a rule-based reward function and estimates token-level advantage.
GSPO replace the token-level clipping methods with sequence-level clipping, and then discards the nonzero-advantage responses with high-variance, while it will keep the tokens with high token-level probability ratio resulting in the training instability, and waste much informative token in high-variance responses.
In previous works, including GRPO and DAPO, the advantage
For Importance sampling, the expected value of a function
Previous works(e.g., PPO, GRPO)usually set the fixed bounds
| different clipping method | clip thresholds | q(x) | low p(x) | high q(x) |
|---|---|---|---|---|
| symmetric fixed bound(GRPO) | 0.9 | 0.72 | min(1.08,1) | |
| asymmetric fixed bound(DAPO) | 0.9 | 0.72 | min(1.152,1) | |
| dynamic-adaptive bound(Our) | 0.9 | 0.69 | min(1.06,1) | |
| symmetric fixed bound(GRPO) | 0.01 | 0.008 | 0.0012 | |
| asymmetric fixed bound(DAPO) | 0.01 | 0.008 | 0.00128 | |
| dynamic-adaptive bound(Our) | 0.01 | 0.005 | 0.05 |
As we can see, when the probability is small, the high q(x) of our dynamic-adaptive clipping method could participate in model update is much greater than the fixed methods (whether asymmetric or symmetric fixed methods).
Previous Works calculate the advantage only considering the current-step rewards of generated responses. This approach can lead to several issues:
- when randomness in response sampling causes all rewards to be the same at a given step, the advantage becomes zero, preventing the prompt from contributing to parameter updates despite potentially valuable differences in reasoning trajectories.
- randomness in high-entropy sampling can yield highly skewed label counts, causing large fluctuations in standardized advantage values across steps, even reversing signs, thus destabilizing training.
We consider the cumulative rewards for the same prompt to calculate the advantage.
To mitigate fluctuations of the step-specific standardization
To reduce the impact of the respective fluctuations of cumulative standardization and standardization of the current step on training stability, our final advantage
Once the prompt participates in model optimization, the response of this prompt will participate in model update in the later steps. When the rewards are same in current steps, they will participate with advantage
- Code base – DCPO extends the open‑source Verl codebase (https://github.com/volcengine/verl).*mainly the loss formulation and the dynamic adaptive clipping / step‑smooth advantage standardization modules are added.*
- Docker image – The training environment used in the paper is published as a Docker image:
docker pull verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2
[
{
"data_source": "qwen_aime_2024", # start with qwen
"prompt":
[
{
"content": "Please reason step by step, and put your final answer within \\boxed{}.",
"role": "system"
}, # the qwen template
{
"content": "There exist real numbers $x$ and $y$, both greater than 1, such that $\\log_x\\left(y^x\\right)=\\log_y\\left(x^{4y}\\right)=10$. Find $xy$.", # question
"role": "user"
}
],
"ability": "MATH",
"reward_model":
{
"ground_truth": "25", # the label for the question
"style": "rule-lighteval/MATH_v2" # option
},
"extra_info":
{
"index": 24, # must have and be different from each other
"raw_problem": "There exist real numbers $x$ and $y$, both greater than 1, such that $\\log_x\\left(y^x\\right)=\\log_y\\left(x^{4y}\\right)=10$. Find $xy$.",
"split": null
}
}
]
git clone https://github.com/lime-RL/DCPO.git
cd DCPO
pip install -r requirements.txtrequirements.txt includes the additional Python packages.
# checkpoint (e.g. Qwen2.5‑Math‑7B) will be fetched from HuggingFace automatically
# change your data to the template and modify the path on *.sh
bash ./recipe/dcpo/run_dcpo.sh # DCPO
bash ./recipe/dcpo/run_grpo.sh # GRPO baseline
bash ./recipe/dcpo/run_dapo.sh # DAPO baseline
bash ./recipe/dcpo/run_gspo.sh # gspo albation baselineEach script sets the corresponding hyper-parameters and starts training (default is 8 GPUs for a single machine, and it automatically recognizes multiple machines)
- Avg@1: This metric represents the standard accuracy achieved using greedy decoding. It measures the performance of the model's single best prediction.
- Avg@32: This metric calculates the average accuracy over 32 sampled responses per problem, using a temperature of 1.0 and top_p of 1.0. This metric provides insight into the robustness and stability of the trained policy distribution.
| Model | MATH‑500 | AMC‑23 | AIME‑24 | AIME‑25 | Average |
|---|---|---|---|---|---|
| base | 73.6 | 57.5/49.4 | 10.0/10.0 | 3.3/6.1 | 36.1/21.8 |
| GRPO | 77.2 | 70.0/68.4 | 16.7/14.0 | 20.0/13.5 | 46.0/32.0 |
| DAPO | 76.0 | 80.0/70.6 | 20.0/13.5 | 14.4/12.5 | 46.5/32.4 |
| DCPO | 77.2 | 75.0/70.8 | 20.0/15.6 | 16.7/12.1 | 47.2/32.8 |
| Δ GRPO | +0.0 | +5.0/+2.4 | +3.3/+1.6 | ‑3.3/‑1.4 | +1.2/+0.8 |
| Δ DAPO | +1.2 | ‑5.0/+0.2 | +0.0/+2.1 | +0.0/+2.1 | +0.7/+0.4 |
| Model | MATH‑500 | AMC‑23 | AIME‑24 | AIME‑25 | Average |
|---|---|---|---|---|---|
| base | 46.4 | 27.5/7.8 | 3.3/0.1 | 3.3/0.7 | 20.1/2.9 |
| GRPO | 69.2 | 62.5/51.6 | 10.0/7.5 | 6.7/4.5 | 36.3/21.0 |
| DAPO | 72.4 | 57.5/54.0 | 10.0/8.3 | 6.7/3.9 | 36.6/23.1 |
| DCPO | 71.2 | 62.5/55.8 | 3.3/7.5 | 10.0/4.7 | 37.6/22.7 |
| Δ GRPO | +2.0 | +0.0/+4.2 | ‑6.7/+0.0 | +3.3/+0.2 | +1.3/+1.7 |
| Δ DAPO | ‑1.2 | +5.0/+1.8 | ‑6.7/‑0.8 | +3.3/+0.8 | +1.0/‑0.4 |
| Model | MATH‑500 | AMC‑23 | AIME‑24 | AIME‑25 | Average |
|---|---|---|---|---|---|
| base | 50.4 | 40.0/19.5 | 13.3/6.0 | 3.3/1.5 | 28.4/9.3 |
| GRPO | 81.6 | 77.5/75.9 | 36.7/32.1 | 16.7/16.7 | 53.1/41.6 |
| DAPO | 83.0 | 72.5/80.7 | 36.7/31.6 | 23.3/14.9 | 53.9/42.4 |
| GSPO | 84.0 | 80.0/78.8 | 40.0/34.9 | 16.7/16.2 | 55.2/43.3 |
| DCPO | 82.5 | 82.6/79.8 | 46.7/38.8 | 16.7/17.2 | 57.1/45.2 |
| Δ GRPO | +0.9 | +5.1/+4.9 | +10.0/+6.7 | +0.0/+0.5 | +4.0/+3.6 |
| Δ DAPO | ‑0.5 | +10.1/‑0.9 | +10.0/+7.2 | ‑6.6/+2.3 | +3.3/+2.8 |
| Δ GSPO | ‑1.5 | +2.6/+1.0 | +6.7/+3.9 | +0.0/+1.0 | +1.9/+1.9 |
| Model | MATH‑500 | AMC‑23 | AIME‑24 | AIME‑25 | Average |
|---|---|---|---|---|---|
| Qwen‑Math‑14B (Common base) | 60.8 | 47.5/16.4 | 3.3/1.3 | 3.3/1.1 | 28.7/6.3 |
| GRPO | 81.2 | 75.0/65.6 | 13.3/17.6 | 13.3/10.5 | 45.7/31.3 |
| DAPO | 83.4 | 87.5/85.1 | 16.7/16.4 | 20.0/15.3 | 51.9/38.9 |
| GSPO | 78.6 | 77.5/75.0 | 23.3/16.0 | 16.7/9.9 | 49.0/33.5 |
| DCPO | 84.6 | 85.0/79.9 | 20.0/18.2 | 23.3/19.0 | 53.2/39.0 |
| Δ GRPO | +3.4 | +10.0/+14.3 | +6.7/+0.6 | +10.0/+8.5 | +6.5/+7.7 |
| Δ DAPO | +1.2 | ‑2.5/‑5.2 | +3.3/+1.8 | +3.3/+3.7 | +1.3/+0.1 |
| Δ GSPO | +6.0 | +7.5/+4.9 | -3.3/+2.2 | +6.6/+10.1 | +4.2/+5.5 |
- Response Utilization Ratio (RUR) ↑ 70 % for DCPO (vs. 44 % for GRPO).
- Token‑Clipping Ratio (TCR) is reduced 10× lower than GRPO/DAPO.
- Training wall‑clock time is roughly half of DAPO for the same number of update steps.
- GRPO: for smaller model( 1.5B and 3B), the TCR increases with training steps. while for larger models(7B and 14B), it gradually decreases.
- DAPO: DAPO shows an upward trajectory for all model scales. it indicates that DCPO will have more proportion of partial or truncated responses to update models.
- GSPO: the TCR based on Qwen2.5-Math-7B is greater than 11% and based on Qwen2.5-14B is over 15%, which is much higher than the token-level clipping methods, resulting in most responses been wasted during training. Although GSPO keep sequence-level variance-bias is small, it keep some tokens with high token-level importance ratio, and these tokens may Increase training instability.
- DCPO: different with both GRPO and DAPO, the TCR for DCPO remains relatively constant and an order of magnitude lower than that of GRPO and DAPO. DCPO use more high-entropy tokens which are more informative, while discarding some tokens with excessively abnormal importance weight. DCPO frees up more reasonable space for model exploration.
Taking Qwen2.5-Math-7B as an example, we observe that after about 60 steps, the probability of about 95% in generated tokens more than 0.9, and after about 100 steps, it more than 97%, and increases in later stages of training. this indicates that model generate most token with high confidence but a few token with high-entropy. so the most of token-level
| model | GRPO | GSPO | DCPO |
|---|---|---|---|
| Qwen2.5-Math-1.5B-Instruct | 45.6% | - | 67.1% |
| Qwen2.5-3B | 48.3% | - | 74.3% |
| Qwen2.5-Math-7B | 37.4% | 43.5% | 73.2% |
| Qwen2.5-14B | 43.9% | 47.6% | 72.4% |
| Average | 43.8% | 45.6% | 71.8% |
Due to average RUR of GRPO and GSPO is less than 50%, GRPO and GSPO waste more than half of generated responses under the current-step standardization. but our method DSPO keep around 70% after first epoch, and keep slowly increasing in the subsequent training steps.
We conducted an ablation study on Qwen2.5-Math-7B to assess the contribution of each component in DCPO, using Avg@32 as the evaluation metric. This metric highlights the robustness and stability of the learned policy distribution. To ensure fairness, each experiment modifies a single component of the baseline GRPO framework while keeping all other settings identical and removing the KL divergence term to align with DAPO, GSPO, and the full DCPO.
Each component of DCPO contributes positively to overall performance, and their combination leads to substantial cumulative gains. The results validate the effectiveness of the proposed mechanisms in improving data efficiency and stability in reinforcement learning for LLMs.
The source code is released under the Apache‑2.0 license (the same license as the underlying Verl code). Pre‑trained Qwen‑Math checkpoints or others are provided under their original licenses – please refer to the model cards on HuggingFace for details.
If you use DCPO in your research, please cite the original work:
@misc{yang2025dcpodynamicclippingpolicy,
title={DCPO: Dynamic Clipping Policy Optimization},
author={Shihui Yang and Chengfeng Dou and Peidong Guo and Kai Lu and Qiang Ju and Fei Deng and Rihui Xin},
year={2025},
eprint={2509.02333},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.02333},
}Happy hacking! If you run into any issues, open a GitHub Issue or start a discussion in the repository.






