[Feature] Trust region rejection sampling #1568
Open
+30
−21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Extend rejection sampling (RS) to support sequence-level mismatch-based rejection rules derived from tokenwise policy divergence proxies. SLIME already supports:
These RS mechanisms primarily control weight magnitude (i.e., they target high-variance outliers from extreme importance weights). In long-horizon settings, however, they may not be sufficient to detect harmful off-policy samples: per-token mismatches can cancel when aggregated, allowing problematic sequences to slip through.
Solution
Add trust-region RS criteria that reject entire sequences based on token-level KL mismatch aggregated over the trajectory:
Following Trust Region Masking for Long-Horizon LLM Reinforcement Learning, this PR uses:
Both criteria use an upper bound only (
rs_tr_threshold). The max criterion replaces the existing veto mechanism as a more principled outlier detector.Implementation
This PR extends the existing RS path with a trust-region option supporting max and mean criteria. The implementation closely follows the paper, but intentionally does not introduce a general “rule engine” for trust-region / RS composition.
A more general alternative would provide a flexible rule framework that can combine:
alongside existing IS-weight-based rejection.
Alternative implementation (rule-based RS)
As a potential follow-up, a configurable rule framework could look like this:
--rs-rule(repeatable and composable)Format:
--rs-rule "name=;scope=<sequence|token>;stat=<k1|k2|k3>;reduce=<...>;low=<float?>;high=<float?>"
Required keys:
name: unique identifierscope: must supportsequence(token scope optional/future; if unsupported, error)stat: token statistic; must supportk1,k2,k3reduce: token-to-scalar reduction; must support:maxmeansumidentity(required only ifscope=token; optional otherwise)In such a framework, the legacy RS modes naturally map to k1 with
sum(sequence) ormean(geometric) reduction. This approach is inspired by this verl PR and docs.The max (k2) and mean (k3) trust-region criteria implemented in this PR are likely the most important variants in practice. That said, a more flexible rule framework would make it easier to experiment with additional criteria and compositions as we iterate. I’m happy to forgo this PR and focus on a said flexible design instead. I opened this narrower change as concrete baseline to kick of a discussion.
TODO