|
| 1 | +--- |
| 2 | +title: "Inference-Time Scaling for Generalist Reward Modeling" |
| 3 | +paper_authors: "Zijun Liu et al." |
| 4 | +orgs: "DeepSeek-AI, Dept. of ComputerSci & Tech. Tsinghua University, Institute for AI Industry Research (AIR) Tsinghua University" |
| 5 | +paper_link: "https://arxiv.org/abs/2504.02495" |
| 6 | +tags: |
| 7 | + - LLMs |
| 8 | + - reward-modeling |
| 9 | + - reinforcement-learning |
| 10 | + - inference-time-compute |
| 11 | + - test-time-compute |
| 12 | +potm_year: 2025 |
| 13 | +potm_month: 4 |
| 14 | +paper_order: 2 |
| 15 | +image_dir: "/assets/images/posts/2025-04/potm/inference-scaling-grm/" |
| 16 | +review_author: |
| 17 | + name: "Arianna Saracino" |
| 18 | + link: "https://www.linkedin.com/in/arianna-saracino-038889a8/" |
| 19 | +hidden: true |
| 20 | +--- |
| 21 | + |
| 22 | +### The key idea |
| 23 | + |
| 24 | +Recent studies have highlighted the critical role of reward models (RMs) in post-training reinforcement learning (RL) in providing high-quality reward signals that help Large Language Models (LLMs) perform well in domains where correctness can be automatically verified, such as coding and mathematics. |
| 25 | + |
| 26 | +However, generating reliable rewards becomes far more challenging in less structured or open-ended domains where answers cannot be automatically verified. At the same time, there is growing interest in making reward quality scale with available inference-time compute - improving as more sampling or computational resources are used. |
| 27 | + |
| 28 | +This paper addresses both challenges by introducing **Self-Principled Critique Tuning (SPCT)**, a novel learning method that enables **Generalist Reward Models (GRMs)** to generate adaptive, high-quality rewards and effectively leverage increased **inference-time compute**. |
| 29 | + |
| 30 | +This approach is implemented in **DeepSeek-GRM-27B**, a Gemma-2-27B-based post-trained with SPCT and enhanced with a secondary Meta Reward Model (MetaRM) to further improve inference-time scaling performance, as shown in Figure 1. |
| 31 | + |
| 32 | + |
| 33 | +<img class="constrained_img" src="{{ page.image_dir | append: 'grm-inference-time-scaling-perf.png' | relative_url }}" alt="Inference-time scaling performance tested on RM Bench shows DeepSeek-GRM-27B outperforming strong public models."> |
| 34 | +<figcaption><strong>Figure 1.</strong> Inference-time scaling performance with different RMs on all RM benchmarks. Results are shown with up to 8 samples for each method, and are further scaled to 32 samples for ours. Non-italic font indicates models based on Gemma-2-27B.</figcaption> |
| 35 | + |
| 36 | + |
| 37 | +### Their method |
| 38 | + |
| 39 | +The authors adopt a **pointwise generative** reward modeling paradigm. Pointwise scoring assigns individual rewards to each response, enabling flexibility across diverse input formats, while the generative approach produces textual judgements or *critiques* from which reward scores are derived. |
| 40 | + |
| 41 | +To enhance performance, they apply **sampling-based** aggregation, generating multiple reward sets per query and combining them to produce a final score. |
| 42 | + |
| 43 | +This setup lays the foundation for their core innovation — the **Self-Principled Critique Tuning (SPCT)** method, which further improves reward quality and scalability. As suggested from previous studies, the authors incorporate "*principles*" generated by the GRM to guide the RM and crucially they treat these not as a pre-processing step, but as part of the reward generation itself. The GRM generates *principles* based on the input query and answers, and then produces *critiques* and assigns rewards according to these generated *principles*. This adaptive approach allows the reward generation process to align with different inputs contexts and nuances. |
| 44 | + |
| 45 | +As shown in Figure 3, SPCT begins with rejective fine-tuning to train the model on properly formatted principles and critiques, followed by rule-based online RL (via GRPO) to refine output quality and improve the model’s ability to distinguish between high- and low-quality responses. |
| 46 | + |
| 47 | +<img src="{{ page.image_dir | append: 'SPCT.png' | relative_url }}" alt="Illustration of SPCT including cold-start rejective fine-tuning, rule-based RL and corresponding scalable behaviours during inference."> |
| 48 | + |
| 49 | +To scale reward quality at inference time, the authors use sampling-based strategies: the model generates multiple reward samples per query, assigns pointwise scores, and aggregates them — typically by summing — to obtain a robust final reward. This approach leverages diverse judgments to approximate a consensus, reducing bias from any single sample. Finally, a **Meta Reward Model (MetaRM)** filters the sampled rewards, selecting only the highest-quality critiques for aggregation, further improving reliability and reducing bias. |
| 50 | + |
| 51 | +### Results - RM benchmarks |
| 52 | + |
| 53 | +Table 2 show that the post-trained DeepSeek-GRM-27B outperforms the baseline methods (reimplemented by the authors) and matches or exceeds the performance of leading models like GPT-4o and Nemotron-4-340B-Reward. |
| 54 | + |
| 55 | +<img src="{{ page.image_dir | append: 'RMBench.png' | relative_url }}" alt="Overall results on RM Benchmarks."> |
| 56 | + |
| 57 | +### Results - inference-time scalability |
| 58 | + |
| 59 | +Table 3 and Figure 1 demonstrate that with inference-time scaling (using 32-sample voting) the model achieves the best overall performance, which improves further when combined with MetaRM-guided voting. |
| 60 | + |
| 61 | +<img class="constrained_img" src="{{ page.image_dir | append: 'inference-time-scalability.png' | relative_url }}" alt="Inference-time scalability results of different methods on RM benchmarks."> |
| 62 | +<figcaption><strong>Table 3.</strong> Inference-time scalability results of different methods on RM benchmarks. Settings are the same as Table 2.</figcaption> |
| 63 | + |
| 64 | + |
| 65 | +### Results - scaling inference vs training costs |
| 66 | + |
| 67 | +Figure 4 compares the benefits of inference-time scaling versus model size scaling. Remarkably, the 27B-parameter DeepSeek-GRM, when paired with 32-sample voting, reaches performance comparable to or better than much larger models, even the 671B MoE model. |
| 68 | + |
| 69 | +<img src="{{ page.image_dir | append: 'inference-vs-training.png' | relative_url }}" alt="Inference-time scaling vs training-time scaling performance on Reward Bench benchmark."> |
| 70 | + |
| 71 | +### Takeaways |
| 72 | + |
| 73 | +This paper marks an important step toward building a true Generalist Reward Model (GRM), introducing the SPCT learning method to generate high-quality, adaptive rewards across diverse tasks. While the results are promising, the authors acknowledge that challenges remain, particularly in tasks with highly subjective reward criteria or those requiring external knowledge. |
| 74 | + |
| 75 | +The paper also demonstrates the strong potential of inference-time scaling, showing that smarter use of compute can deliver major performance gains — a promising direction for future research on efficient, scalable reward systems. |
0 commit comments