DCPO: Dynamic Clipping Policy Optimization

📝 Paper@arXiv | 🤗 HuggingFace | 🐱 GitHub

1.Paper of this work.

All experiments from the paper “DCPO: Dynamic Clipping Policy Optimization” can be reproduced with this repo.

DCPO is a reinforcement‑learning‑from‑verifiable‑rewards (RLVR) framework that dramatically improves data‑Utilization and training speed for large language models (LLMs) on reasoning‑heavy tasks.

2. Key Contributions

Feature	What it does
Dynamic Adaptive Clipping	Computes a closed‑form clipping bound that depends on the old probability, reducing the token‑clipping ratio by ~10× compared with fixed clipping.
Smooth Advantage Standardization	Standardizes rewards by mixing the current‑step statistics with the cumulative statistics, removing zero‑gradient “dead zones” and increasing non‑zero‑gradient usage by ≈ 28 %.
OTM loss	Calculates the loss over tokens of asingle response without batch-level averaging, preserving the relative advantage between responses.
Broad‑scale validation	Tested on MATH‑500, AMC‑23, AIME‑24, AIME‑25 with model sizes 1.5 B – 14 B. DCPO‑7B reaches38.8 Avg@32 on AIME‑24 (↑ 21 % over GRPO) while halving wall‑clock GPU hours versus DAPO.

3. Preliminary

GRPO

$$ \mathcal{T}_{\text{GRPO}}(\theta) = \frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \min\left( r_{i,t}\left(\theta\right)\hat{A}_{i,t}, \text{clip}(r_{i,t}\left(\theta\right),1-\epsilon,1+\epsilon)\hat{A}_{i,t} \right) - \beta \mathbb{D}_{\text{KL}}(\pi_\theta | \pi_{\text{ref}}) $$

GRPO first samples G responses for each query, assigns rewards R through a rule-based reward function and estimates token-level advantage.

DAPO

$$ \begin{aligned} \mathcal{T}_{\mathrm{DAPO}}\left(\theta\right)&={\frac{1}{\sum_{i=1}^G|o_i|}\sum_{i=1}^G\sum_{t=1}^{|o_i|}}\min\left(r_{i,t}(\theta)\hat{A}_{i,t},\mathrm{~clip}\left(r_{i,t}(\theta),1-{\epsilon_{\mathrm{low}}},1+{\epsilon_{\mathrm{high}}}\right)\hat{A}_{i,t}\right)\\ & where,\ 0 <\left|{o_i\mid{is_equivalent}(a,o_i)}\right|<G \end{aligned} $$

$0 <\left|{o_i\mid{is_equivalent}(a,o_i)}\right|<G$ means DAPO will discard the all responses while same reward and regenerates responses to maintain batch size.

GSPO

$$ \begin{aligned} \mathcal{T}_{\mathrm{GSPO}}\left(\theta\right)=&\frac{1}{G}\sum_{i=1}^G\min\left( s_{i}\left(\theta\right)\hat{A}_{i}, \text{clip}(s_{i}\left(\theta\right),1-\epsilon,1+\epsilon)\hat{A}_{i} \right) \\ &\text{where\ } s_i(\theta) = % \left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}\right)^{\frac{1}{|o_i|}}= exp(\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}log(\frac{\pi_\theta(o_{j,t}|q)}{\pi_{\theta_{old}}(o_{j,t}|q)})) \end{aligned} $$

GSPO replace the token-level clipping methods with sequence-level clipping, and then discards the nonzero-advantage responses with high-variance, while it will keep the tokens with high token-level probability ratio resulting in the training instability, and waste much informative token in high-variance responses.

Prior Advantage Calculation

$$ \hat{A}_{j,t}^i = \frac{\left(R^i_j-\mu^i\right)}{\sigma^i} $$

In previous works, including GRPO and DAPO, the advantage $\hat{A}_{j,t}^i$ for the token $t$ in the response $j$ is calculated by standardizing the reward $R^i_j$ against the mean $\mu^i$ and standard deviation $\sigma^i$ of the rewards of the $G$ responses generated in the $i$-th step. When the rewards for the same prompt are identical, responses with zero advantages do not contribute to model update, resulting in response waste.

4.Details of major innovations

4.1 Dynamic Adaptive Clipping（DAC)

For Importance sampling, the expected value of a function $f(x)$ under the new probability $p(x)$ can be rewritten as an expectation under the old probability $q(x)$ by importance sampling weight. Although this estimator is unbiased, its variance can be significantly inflated, which is a common challenge in importance sampling.

$$ \begin{aligned} {Var}_{x \sim q}\left[f(x)\frac{p(x)}{q(x)}\right]- {Var}_{x \sim p}\left[f(x)\right] =&\mathbb{E}_{x \sim p}\left[f(x)^2(\frac{p(x)}{q(x)}-1)\right] =\int f(x)^2(\frac{p(x)}{q(x)}-1)p(x)\mathrm{d}x \end{aligned} $$

Previous works(e.g., PPO, GRPO)usually set the fixed bounds $\epsilon$ for the $|\frac{p(x)}{q(x)}-1|$ to limit the variance bias. this method don't consider the different probabbility across different tokens,resulting in a smaller effective absolute space with smaller probabilities. This is unreasonable because the less confident the model is about a token (with the lower probability), the more valuable information it can provide to model. we propose a more practical alternative to constrain the probability ratio $r(x)$ through the dynamic-adaptive mechanism by including the probability in the restriction as $|(\frac{p(x)}{q(x)}-1)p(x)|\le \epsilon $. Finally, we get the dynamic-adaptive clipping bounds which adaptively adjust the bounds of $r(x)$ with different old probability.

$$ \begin{aligned} 0.5+\frac{1}{2}\sqrt{\max\left(1-\frac{4\epsilon_{low}}{q\left(x\right)},\ 0\right)}\leq&r\left(x\right) \leq 0.5+\frac{1}{2}\sqrt{1+\frac{4\epsilon_{high}}{q\left(x\right)}} \end{aligned} $$

different clipping method	clip thresholds	q(x)	low p(x)	high q(x)
symmetric fixed bound(GRPO)	$\epsilon=0.2$	0.9	0.72	min(1.08,1)
asymmetric fixed bound(DAPO)	$\epsilon_{low}=0.2,\epsilon_{high}=0.28$	0.9	0.72	min(1.152,1)
dynamic-adaptive bound(Our)	$\epsilon_{low}=0.16,\epsilon_{high}=0.2$	0.9	0.69	min(1.06,1)
symmetric fixed bound(GRPO)	$\epsilon=0.2$	0.01	0.008	0.0012
asymmetric fixed bound(DAPO)	$\epsilon_{low}=0.2,\epsilon_{high}=0.28$	0.01	0.008	0.00128
dynamic-adaptive bound(Our)	$\epsilon_{low}=0.16,\epsilon_{high}=0.2$	0.01	0.005	0.05

As we can see, when the probability is small, the high q(x) of our dynamic-adaptive clipping method could participate in model update is much greater than the fixed methods (whether asymmetric or symmetric fixed methods).

4.2 Smooth Advantage Standardization(SAS)

Previous Works calculate the advantage only considering the current-step rewards of generated responses. This approach can lead to several issues:

when randomness in response sampling causes all rewards to be the same at a given step, the advantage becomes zero, preventing the prompt from contributing to parameter updates despite potentially valuable differences in reasoning trajectories.
randomness in high-entropy sampling can yield highly skewed label counts, causing large fluctuations in standardized advantage values across steps, even reversing signs, thus destabilizing training.

We consider the cumulative rewards for the same prompt to calculate the advantage.

$$ \begin{aligned} \hat{A}_{total,j}^i=\frac{\left(R^i_j-\mu_{total}^i\right)}{\sigma_{total}^i} \\ \end{aligned} $$

To mitigate fluctuations of the step-specific standardization ${\hat{A}^i_{new,j}}$ and the cumulative standardization ${\hat{A}^i_{total,j}}$, we introduce two smoothing functions, ${\hat{SA}^i_{new,j}}$ and ${\hat{SA}^i_{total,j}}$, which represent the weighted average between the two standardization methods with the weights changing over step $i$

$$ \hat{SA}^i_{new,j} = \frac{i-1}{i}\hat{A}_{new,j}^i + \frac{1}{i}\hat{A}_{total,j}^i,\ \hat{SA}^i_{total,j} = \frac{1}{i}\hat{A}_{new,j}^i + \frac{i-1}{i}\hat{A}_{total,j}^i $$

To reduce the impact of the respective fluctuations of cumulative standardization and standardization of the current step on training stability, our final advantage $\hat{A}^i_j$ is defined as the smoothed advantage with the smaller absolute value.

$$ \hat{A}^i_j=\begin{cases} \hat{SA}^i_{new,j} , & \text{when} \ |\hat{SA}^i_{new,j}| < |\hat{SA}^i_{total,j}|\\ \hat{SA}^i_{total,j} , & \text{otherwise} \end{cases} $$

Once the prompt participates in model optimization, the response of this prompt will participate in model update in the later steps. When the rewards are same in current steps, they will participate with advantage $\frac{1}{i}\hat{A}_{total,j}^i$.

5. Code base & Docker

Code base – DCPO extends the open‑source Verl codebase (https://github.com/volcengine/verl).*mainly the loss formulation and the dynamic adaptive clipping / step‑smooth advantage standardization modules are added.*
Docker image – The training environment used in the paper is published as a Docker image:

docker pull verlai/verl:app-verl0.4-sglang0.4.6.post5-vllm0.8.5-mcore0.12.2

6. the example of sample in data parquet

[
    {
        "data_source": "qwen_aime_2024", # start with qwen
        "prompt":
        [
            {
                "content": "Please reason step by step, and put your final answer within \\boxed{}.",
                "role": "system"
            }, # the qwen template
            {
                "content": "There exist real numbers $x$ and $y$, both greater than 1, such that $\\log_x\\left(y^x\\right)=\\log_y\\left(x^{4y}\\right)=10$. Find $xy$.", # question
                "role": "user"
            }
        ],
        "ability": "MATH",
        "reward_model":
        {
            "ground_truth": "25", # the label for the question 
            "style": "rule-lighteval/MATH_v2" # option
        },
        "extra_info":
        {
            "index": 24, # must have and be different from each other 
            "raw_problem": "There exist real numbers $x$ and $y$, both greater than 1, such that $\\log_x\\left(y^x\\right)=\\log_y\\left(x^{4y}\\right)=10$. Find $xy$.",
            "split": null
        }
    }
]

7. Quick Start

1️⃣ Clone the repository

git clone https://github.com/lime-RL/DCPO.git
cd DCPO
pip install -r requirements.txt

requirements.txt includes the additional Python packages.

2️⃣ Download a pre-trained checkpoint & data

# checkpoint (e.g. Qwen2.5‑Math‑7B) will be fetched from HuggingFace automatically
# change your data to the template and modify the path on *.sh
bash ./recipe/dcpo/run_dcpo.sh   # DCPO
bash ./recipe/dcpo/run_grpo.sh   # GRPO baseline
bash ./recipe/dcpo/run_dapo.sh   # DAPO baseline
bash ./recipe/dcpo/run_gspo.sh   # gspo albation baseline

Each script sets the corresponding hyper-parameters and starts training (default is 8 GPUs for a single machine, and it automatically recognizes multiple machines)

8. Experimental Results

Avg@1: This metric represents the standard accuracy achieved using greedy decoding. It measures the performance of the model's single best prediction.
Avg@32: This metric calculates the average accuracy over 32 sampled responses per problem, using a temperature of 1.0 and top_p of 1.0. This metric provides insight into the robustness and stability of the trained policy distribution.

1. Qwen2.5‑Math‑1.5B-Instruct

Model	MATH‑500	AMC‑23	AIME‑24	AIME‑25	Average
base	73.6	57.5/49.4	10.0/10.0	3.3/6.1	36.1/21.8
GRPO	77.2	70.0/68.4	16.7/14.0	20.0/13.5	46.0/32.0
DAPO	76.0	80.0/70.6	20.0/13.5	14.4/12.5	46.5/32.4
DCPO	77.2	75.0/70.8	20.0/15.6	16.7/12.1	47.2/32.8
Δ GRPO	+0.0	+5.0/+2.4	+3.3/+1.6	‑3.3/‑1.4	+1.2/+0.8
Δ DAPO	+1.2	‑5.0/+0.2	+0.0/+2.1	+0.0/+2.1	+0.7/+0.4

2. Qwen2.5‑3B (Common base)

Model	MATH‑500	AMC‑23	AIME‑24	AIME‑25	Average
base	46.4	27.5/7.8	3.3/0.1	3.3/0.7	20.1/2.9
GRPO	69.2	62.5/51.6	10.0/7.5	6.7/4.5	36.3/21.0
DAPO	72.4	57.5/54.0	10.0/8.3	6.7/3.9	36.6/23.1
DCPO	71.2	62.5/55.8	3.3/7.5	10.0/4.7	37.6/22.7
Δ GRPO	+2.0	+0.0/+4.2	‑6.7/+0.0	+3.3/+0.2	+1.3/+1.7
Δ DAPO	‑1.2	+5.0/+1.8	‑6.7/‑0.8	+3.3/+0.8	+1.0/‑0.4

3. Qwen2.5‑Math‑7B (Math base)

Model	MATH‑500	AMC‑23	AIME‑24	AIME‑25	Average
base	50.4	40.0/19.5	13.3/6.0	3.3/1.5	28.4/9.3
GRPO	81.6	77.5/75.9	36.7/32.1	16.7/16.7	53.1/41.6
DAPO	83.0	72.5/80.7	36.7/31.6	23.3/14.9	53.9/42.4
GSPO	84.0	80.0/78.8	40.0/34.9	16.7/16.2	55.2/43.3
DCPO	82.5	82.6/79.8	46.7/38.8	16.7/17.2	57.1/45.2
Δ GRPO	+0.9	+5.1/+4.9	+10.0/+6.7	+0.0/+0.5	+4.0/+3.6
Δ DAPO	‑0.5	+10.1/‑0.9	+10.0/+7.2	‑6.6/+2.3	+3.3/+2.8
Δ GSPO	‑1.5	+2.6/+1.0	+6.7/+3.9	+0.0/+1.0	+1.9/+1.9

4. Qwen2.5‑14B (Common base)

Model	MATH‑500	AMC‑23	AIME‑24	AIME‑25	Average
Qwen‑Math‑14B (Common base)	60.8	47.5/16.4	3.3/1.3	3.3/1.1	28.7/6.3
GRPO	81.2	75.0/65.6	13.3/17.6	13.3/10.5	45.7/31.3
DAPO	83.4	87.5/85.1	16.7/16.4	20.0/15.3	51.9/38.9
GSPO	78.6	77.5/75.0	23.3/16.0	16.7/9.9	49.0/33.5
DCPO	84.6	85.0/79.9	20.0/18.2	23.3/19.0	53.2/39.0
Δ GRPO	+3.4	+10.0/+14.3	+6.7/+0.6	+10.0/+8.5	+6.5/+7.7
Δ DAPO	+1.2	‑2.5/‑5.2	+3.3/+1.8	+3.3/+3.7	+1.3/+0.1
Δ GSPO	+6.0	+7.5/+4.9	-3.3/+2.2	+6.6/+10.1	+4.2/+5.5

Take‑away

Response Utilization Ratio (RUR) ↑ 70 % for DCPO (vs.  44 % for GRPO).
Token‑Clipping Ratio (TCR) is reduced 10× lower than GRPO/DAPO.
Training wall‑clock time is roughly half of DAPO for the same number of update steps.

8.1. Token Clipping Ratio(TCR)

$$ \text{TCR} =Average\left(\sum_{m=1}^{N}\frac{\text{Number of clipped tokens in\ }micro_m}{\text{Total number of tokens in\ } micro_m}\right) $$

GRPO: for smaller model( 1.5B and 3B), the TCR increases with training steps. while for larger models(7B and 14B), it gradually decreases.
DAPO: DAPO shows an upward trajectory for all model scales. it indicates that DCPO will have more proportion of partial or truncated responses to update models.
GSPO: the TCR based on Qwen2.5-Math-7B is greater than 11% and based on Qwen2.5-14B is over 15%, which is much higher than the token-level clipping methods, resulting in most responses been wasted during training. Although GSPO keep sequence-level variance-bias is small, it keep some tokens with high token-level importance ratio, and these tokens may Increase training instability.
DCPO: different with both GRPO and DAPO, the TCR for DCPO remains relatively constant and an order of magnitude lower than that of GRPO and DAPO. DCPO use more high-entropy tokens which are more informative, while discarding some tokens with excessively abnormal importance weight. DCPO frees up more reasonable space for model exploration.

Taking Qwen2.5-Math-7B as an example, we observe that after about 60 steps, the probability of about 95% in generated tokens more than 0.9, and after about 100 steps, it more than 97%, and increases in later stages of training. this indicates that model generate most token with high confidence but a few token with high-entropy. so the most of token-level

8.2 Response Utilization Ratio(RUR)

model	GRPO	GSPO	DCPO
Qwen2.5-Math-1.5B-Instruct	45.6%	-	67.1%
Qwen2.5-3B	48.3%	-	74.3%
Qwen2.5-Math-7B	37.4%	43.5%	73.2%
Qwen2.5-14B	43.9%	47.6%	72.4%
Average	43.8%	45.6%	71.8%

Due to average RUR of GRPO and GSPO is less than 50%, GRPO and GSPO waste more than half of generated responses under the current-step standardization. but our method DSPO keep around 70% after first epoch, and keep slowly increasing in the subsequent training steps.

8.3 Ablation Result

We conducted an ablation study on Qwen2.5-Math-7B to assess the contribution of each component in DCPO, using Avg@32 as the evaluation metric. This metric highlights the robustness and stability of the learned policy distribution. To ensure fairness, each experiment modifies a single component of the baseline GRPO framework while keeping all other settings identical and removing the KL divergence term to align with DAPO, GSPO, and the full DCPO.

Each component of DCPO contributes positively to overall performance, and their combination leads to substantial cumulative gains. The results validate the effectiveness of the proposed mechanisms in improving data efficiency and stability in reinforcement learning for LLMs.

9. License

The source code is released under the Apache‑2.0 license (the same license as the underlying Verl code). Pre‑trained Qwen‑Math checkpoints or others are provided under their original licenses – please refer to the model cards on HuggingFace for details.

10. Citation

If you use DCPO in your research, please cite the original work:

@misc{yang2025dcpodynamicclippingpolicy,
      title={DCPO: Dynamic Clipping Policy Optimization}, 
      author={Shihui Yang and Chengfeng Dou and Peidong Guo and Kai Lu and Qiang Ju and Fei Deng and Rihui Xin},
      year={2025},
      eprint={2509.02333},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.02333}, 
}

Happy hacking! If you run into any issues, open a GitHub Issue or start a discussion in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
docker		docker
docs		docs
examples		examples
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DCPO: Dynamic Clipping Policy Optimization

1.Paper of this work.

2. Key Contributions

3. Preliminary

GRPO

DAPO

GSPO

Prior Advantage Calculation

4.Details of major innovations

4.1 Dynamic Adaptive Clipping（DAC)

4.2 Smooth Advantage Standardization(SAS)

5. Code base & Docker

6. the example of sample in data parquet

7. Quick Start

1️⃣ Clone the repository

2️⃣ Download a pre-trained checkpoint & data

8. Experimental Results

1. Qwen2.5‑Math‑1.5B-Instruct

2. Qwen2.5‑3B (Common base)

3. Qwen2.5‑Math‑7B (Math base)

4. Qwen2.5‑14B (Common base)

Take‑away

8.1. Token Clipping Ratio(TCR)

8.2 Response Utilization Ratio(RUR)

8.3 Ablation Result

9. License

10. Citation

About

Uh oh!

Releases

Packages

Languages

License

lime-RL/DCPO

Folders and files

Latest commit

History

Repository files navigation

DCPO: Dynamic Clipping Policy Optimization

1.Paper of this work.

2. Key Contributions

3. Preliminary

GRPO

DAPO

GSPO

Prior Advantage Calculation

4.Details of major innovations

4.1 Dynamic Adaptive Clipping（DAC)

4.2 Smooth Advantage Standardization(SAS)

5. Code base & Docker

6. the example of sample in data parquet

7. Quick Start

1️⃣ Clone the repository

2️⃣ Download a pre-trained checkpoint & data

8. Experimental Results

1. Qwen2.5‑Math‑1.5B-Instruct

2. Qwen2.5‑3B (Common base)

3. Qwen2.5‑Math‑7B (Math base)

4. Qwen2.5‑14B (Common base)

Take‑away

8.1. Token Clipping Ratio(TCR)

8.2 Response Utilization Ratio(RUR)

8.3 Ablation Result

9. License

10. Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages