LLMs之RL之SAR《Self-Aligned Reward: Towards Effective and Efficient Reasoners》翻译与解读导读本文提出的 Self-Aligned RewardSAR通过比较答案在有无问题条件下的困惑度提供了一个细粒度、内部的质量与相关性信号将其与可验证奖励联合用于 RLPPO/GRPO能在保持或提升准确率的同时显著减少冗长推理与推理成本为构建既高效又可靠的 LLM 推理器提供了一条行之有效的工程化路径。 背景痛点● 可验证奖励信号粗糙现有用来训练推理能力的“可验证奖励”例如基于最终答案正误的二元反馈无法区分同为正确但质量不同的回答如简洁正确 vs 冗长正确导致模型“过度推理/冗长解释”且训练效率低下。● 效率与正确性难以兼得简单的长度惩罚能缩短输出但常会抑制必要的中间推理步骤从而损害准确性而外部奖励模型成本高且易被“奖励劫持”。●缺乏细粒度自监督信号需要一种内部、连续且能反映“与题目相关性”与“必要性”的信号以在不依赖大量外部标注或外部模型的前提下给出细致的训练梯度。● 记忆型/无思考的短答案难辨别简短且正确的“记忆型”回答没有推理过程可能被误认为理想输出亟需能区分“有思路的简洁答案”与“只记住答案的简洁输出”的机制。 具体的解决方案方法概述●自对齐奖励Self-Aligned Reward, SAR论文提出 SAR将“答案在有问题条件下的困惑度conditioned perplexity”与“答案单独无问题时的困惑度standalone perplexity”做相对比较得到一个表示“答案与问题依赖程度”的连续内在奖励高值表示答案紧密依赖查询通常是简洁且相关的回答。● 与可验证奖励联合使用把 SAR 作为对可验证final-answer correctness奖励的补充在常用强化学习算法PPO、GRPO 及其变体中联合优化从而同时提升准确性与输出效率。● 适配多种 RL 算法与训练细节论文在 Dr. GRPO对 GRPO 的改良与 PPO 框架里实现了 SA-GRPO 和 SA-PPO讨论了如何在优势估计、KL 项与长度相关约束之间取舍以避免训练时产生不期望的长度缩减。 核心思路与可复用步骤操作流程● 计算候选回答的两类困惑度对模型生成的每个回答分别计算 p(answer ∣ question)条件困惑度与 p(answer)独立困惑度作为后续奖励的基础数值。● 构造自对齐奖励函数用两者的相对差conditioned perplexity drop / 相对变化作为 SAR数值越大说明回答对问题的依赖越强、关联系越高。● 与可验证奖励融合把 SAR 与标准可验证奖励按权重相加论文给出具体加权形式作为 RL 的即时奖励信号用于计算优势并更新策略。● 训练与约束管理在训练中监控输出长度、KL 惩罚系数与准确率以避免只通过短答案最大化 SAR/长度收益而牺牲必要推理。论文还采用了 GRPO 的分组比较机制与 Dr. GRPO 变体来稳定训练。● 评估与迭代用多基准数学推理、逻辑推理等评估准确性与效率例如平均响应长度下降作为效率指标并做消融实验以量化 SAR 的贡献。 优势SAR 相比传统替代方案的价值点● 兼顾准确性与简洁性SAR 能对“正确且简洁”的答案给较高评分而对“正确但冗长”或“错误/无关”答案给较低评分从而在准确率与响应长度上同时受益。● 提供细粒度与部分评分能力与二元可验证奖励不同SAR 会对“部分正确”答案给出中间值有利于模型在训练早期学习推理模式。● 抑制“无思考记忆型”答案即便答案短且正确若回答与问题无实质性推理相关模型直接记忆输出SAR 倾向于不予以高分从而鼓励“有思路的简洁回答”。●计算开销低且可无缝集成SAR 只需额外计算两类困惑度比训练外部奖励模型或大规模评估器更节省计算能较容易集成到现有 RL 管线中。 论文中的主要实验结论与经验建议侧重实践与工程化建议● 性能与效率提升量化结果在 4 个模型、7 个基准上论文报告将 SAR 与常用 RLPPO/GRPO结合后平均能将准确性提高约 4%同时将推理成本以响应长度近似降低约 30%。这是作者给出的量化主张。● 优先结合可验证奖励而非完全替代作者建议将 SAR 作为可验证奖励的补充而非替代二者互补能在保持或提升准确率的同时降低冗长。● 慎用单纯长度惩罚或置信度奖励论文的对比显示单纯长度奖励会盲目压缩必要步骤而置信度/entropy 类奖励不能像 SAR 一样区分“与问题相关性”。因此工程上应优先考虑像 SAR 这样基于条件困惑度的内在信号。● 训练稳定性与超参敏感性作者在文中展示了关于 KL 系数、SAR 权重与 GRPO 变体Dr. GRPO调整的消融提示工程实现时需要调参以避免出现“效率提升但准确率下降”的退化情形。● 关注“去记忆化”与可解释性由于 SAR 会惩罚“无思考的短答案”作者建议在高风险或需审计的场景保留日志与可解释性检测以验证模型是否依靠真实推理。 局限、挑战与未来方向论文提示的事项● SAR 并非万灵药虽然能获得 Pareto 优化的准确性-效率折中但仍需注意在某些任务需要充分展开的证明或解释可能误伤必要细节并且 SAR 本身依赖困惑度估计质量。● 依赖模型的困惑度校准SAR 基于 perplexity 的相对差异若模型的困惑度估计本身不稳定或未校准奖励信号可能受影响因此需要在实现上保证困惑度计算的一致性与数值稳定。● 更广泛的任务与模态验证论文在数学与逻辑推理上给出验证后续可在多模态视觉-语言或更开放式生成任务中检验 SAR 的泛化性论文作者也做了相关讨论。目录《Self-Aligned Reward: Towards Effective and Efficient Reasoners》翻译与解读AbstractFigure 1:Training with self-aligned reward enhances both efficiency and accuracy. We present the relative gains in efficiency and accuracy compared to the respective base model in math reasoning benchmarks. Efficiency gain is measured as the drop in average response length.图 1使用自对齐奖励进行训练可同时提高效率和准确性。我们展示了在数学推理基准测试中相对于各自基础模型在效率和准确性方面的相对提升。效率提升的衡量标准是平均响应长度的下降。1、IntroductionTable 1:Comparison of different reward designs.表 1不同奖励设计的比较。6 Conclusion《Self-Aligned Reward: Towards Effective and Efficient Reasoners》翻译与解读地址论文地址https://arxiv.org/abs/2509.05489时间2025年09月05日作者University of Illinois Urbana-Champaign伊利诺伊大学厄巴纳-香槟分校、Amazon Web Services亚马逊云科技AbstractReinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we introduce self-aligned reward (SAR), a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency. SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO improves accuracy by 4%, while reducing inference cost by 30%. Further analysis demonstrates that SAR achieves a Pareto-optimal trade-off between correctness and efficiency compared to reward signals based on length or self-confidence. We also show that SAR shortens responses while preserving advanced reasoning behaviors, demonstrating its ability to suppress unnecessary elaboration without losing critical reasoning. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for more efficient and effective LLM training.具有可验证奖励的强化学习在大型语言模型LLM的推理方面取得了显著进展但此类信号仍然较为粗糙仅提供二元正确性反馈。这种局限性常常导致效率低下包括推理过程冗长和计算成本高昂而现有的解决方案往往又会牺牲准确性。为解决这一问题我们引入了自对齐奖励SAR这是一种自我引导的信号可与可验证奖励互补以鼓励推理的准确性和效率。SAR 被定义为基于查询的答案与独立答案之间的相对困惑度差异从而倾向于简洁且针对查询的回答。定量分析表明SAR 能够可靠地区分答案质量简洁且正确的答案得分高于冗余的答案部分正确的答案得分高于完全错误的答案。在 4 个模型和 7 个基准上的评估表明将 SAR 与流行的强化学习算法如 PPO 和 GRPO相结合可将准确性提高 4%同时将推理成本降低 30%。进一步的分析表明与基于长度或自信心的奖励信号相比自对齐奖励SAR在正确性和效率之间实现了帕累托最优的权衡。我们还表明SAR 能够缩短响应时间同时保留高级推理行为这表明它能够在不丢失关键推理的情况下抑制不必要的详述。这些结果突显了自对齐奖励作为可验证奖励的细粒度补充的前景为更高效和有效的大型语言模型LLM训练铺平了道路。Figure 1:Training with self-aligned reward enhances both efficiency and accuracy. We present the relative gains in efficiency and accuracy compared to the respective base model in math reasoning benchmarks. Efficiency gain is measured as the drop in average response length.图 1使用自对齐奖励进行训练可同时提高效率和准确性。我们展示了在数学推理基准测试中相对于各自基础模型在效率和准确性方面的相对提升。效率提升的衡量标准是平均响应长度的下降。1、IntroductionRecently, reinforcement learning (RL) with verifiable rewards has attracted broad attention in LLM training, demonstrating remarkable improvements in reasoning skills (Guo et al., 2025; Jaech et al., 2024). However, such verifiable signals—especially in domains like math—are inherently coarse: they offer rule-based feedback on final answer correctness but fail to capture finer distinctions among responses. For instance, an unnecessarily long solution receives no penalty as long as the final answer is correct, while an almost correct response is treated the same as a completely wrong one. This limitation often induces “overthinking”, where models generate verbose explanations that increase latency and computational cost without improving reasoning quality (Sui et al., 2025). Moreover, relying on external signal sources such as reward models isn’t a favorable solution, as it’s computationally expensive and vulnerable to reward hacking. This underscores the necessity of developing internally grounded reward mechanisms that provide precise and nuanced guidance.To this end, researchers have proposed heuristic constraints such as length penalties or brevity-oriented objectives (Luo et al., 2025; Aggarwal Welleck, 2025). While effective in reducing output verbosity, these methods often penalize redundant and essential reasoning at the same time, thereby harming accuracy when necessary intermediate steps are suppressed. Consequently, these approaches struggle to balance efficiency with correctness, highlighting the challenge of designing internal signals that can distinguish necessary reasoning from redundant elaboration.近期带有可验证奖励的强化学习RL在大型语言模型LLM训练中引起了广泛关注显著提升了推理能力Guo 等人2025 年Jaech 等人2024 年。然而这种可验证的信号——尤其是在数学等领域——本质上是粗略的它们仅对最终答案的正确性提供基于规则的反馈却无法捕捉到不同回答之间的细微差别。例如只要最终答案正确冗长的解答也不会受到惩罚而近乎正确的回答与完全错误的回答则被同等对待。这种局限性常常导致“过度思考”即模型生成冗长的解释增加了延迟和计算成本却并未提升推理质量Sui 等人2025 年。此外依赖诸如奖励模型之类的外部信号源并非理想之选因为这不仅计算成本高昂还容易受到奖励操纵的影响。这凸显了开发内部基础奖励机制的必要性以提供精确且细致的指导。为此研究人员提出了诸如长度惩罚或以简洁为导向的目标等启发式约束Luo 等人2025 年Aggarwal 和 Welleck2025 年。虽然这些方法在减少输出冗长方面有效但它们往往同时惩罚冗余和必要的推理从而在必要的中间步骤被抑制时损害准确性。因此这些方法难以在效率和正确性之间取得平衡凸显了设计能够区分必要推理和冗余阐述的内部信号的挑战。Of the various internal signals, perplexity offers a promising option, given its role as a natural proxy for model confidence (Friedland et al., 2024; Agarwal et al., 2025). Building on this insight, we introduce Self-Aligned Reward (SAR), a self-guided proxy for answer quality (Equation 8). SAR evaluates the perplexity of an answer both in isolation and when conditioned on the query, and then measures their relative difference between the two. Consequently, the reward promotes answers that are highly confident under the query context but unlikely to arise independently without the query, which typically corresponds to responses that are concise and aligned with the query. Table 1 compares SAR with existing rewards, from which we can observe that SAR is the only fine-grained approach that promotes accuracy and efficiency at the same time.We conduct a quantitative analysis of different types of answers, showing that SAR encourages concise and well-targeted answers. This indicates SAR provides an accurate fine-grained reward landscape over answers of different qualities (Section 3). We then train LLMs with reinforcement learning by combining SAR and verifiable reward in PPO and GRPO. We found that self-aligned PPO and GRPO (SA-PPO and SA-GRPO) achieve notable gains over baselines across 4 models and 7 benchmarks, improving accuracy by 4% and efficiency by 30% (Section 4.2). Moreover, we show that SAR outperforms length-based rewards with a pareto-optimal front in the accuracy-efficiency trade-off (Section 4.3). In addition, we demonstrate the advantages of SAR over confidence-based methods (Section 5.1) and provide an analysis of its reasoning behaviors (Section 5.2). Our findings suggest that combining verifiable rewards with intrinsic model self-judgment offers a new paradigm for RL training, enabling both fine-grained improvements in reasoning and better efficiency.在各种内部信号中困惑度因其作为模型置信度的自然代理而成为一种有前景的选择Friedland 等人2024 年Agarwal 等人2025 年。基于这一见解我们引入了自对齐奖励SAR这是一种用于评估答案质量的自我引导代理公式 8。SAR 会分别评估答案的孤立困惑度以及在条件化于查询时的困惑度然后测量两者之间的相对差异。因此该奖励机制会促进在查询上下文中高度自信但独立于查询不太可能出现的答案这通常对应于简洁且与查询紧密相关的回答。表 1 对 SAR 与现有奖励机制进行了比较从中我们可以观察到SAR 是唯一一种同时促进准确性和效率的细粒度方法。我们对不同类型的回答进行了定量分析结果表明 SAR 鼓励简洁且目标明确的回答。这表明 SAR 能够为不同质量的回答提供准确的细粒度奖励格局第 3 节。然后我们通过将 SAR 与可验证奖励结合在 PPO 和 GRPO 中对 LLM 进行强化学习训练。我们发现自对齐的 PPO 和 GRPOSA-PPO 和 SA-GRPO在 4 个模型和 7 个基准测试中均显著优于基线准确率提高了 4%效率提高了 30%第 4.2 节。此外我们表明在准确率与效率的权衡中SAR 优于基于长度的奖励具有帕累托最优前沿第 4.3 节。另外我们展示了 SAR 相对于基于置信度的方法的优势第 5.1 节并对其推理行为进行了分析第 5.2 节。我们的研究结果表明将可验证的奖励与内在模型自我判断相结合为强化学习训练提供了一种新的范式能够实现推理的精细化改进以及更高的效率。Table 1:Comparison of different reward designs.表 1不同奖励设计的比较。6 ConclusionIn this work, we propose Self-Aligned Reward (SAR), an internal perplexity-based signal evaluating the answer’s relevancy with the query, enabling fine-grained supervision beyond binary correctness. Through comprehensive experiments across 4 base models and 7 benchmarks, we demonstrated that SAR enables reinforcement learning to achieve consistent gains of up to 4% in accuracy while reducing response length and computational cost by 30%. Moreover, SAR exhibits a favorable accuracy–efficiency trade-off compared with length-based baselines, offering a fine-grained and content-aware reward signal that complements verifiable correctness. These findings highlight the significance of incorporating intrinsic model self-assessment into the reinforcement learning framework, establishing a new paradigm that advances both the effectiveness and efficiency of large language model training.在这项工作中我们提出了自对齐奖励SAR这是一种基于内部困惑度的信号用于评估答案与查询的相关性从而实现超越二元正确性的细粒度监督。通过在 4 个基础模型和 7 个基准上的全面实验我们证明了 SAR 能够使强化学习在准确率上实现高达 4% 的一致提升同时将响应长度和计算成本降低 30%。此外与基于长度的基线相比SAR 在准确率与效率之间表现出良好的权衡提供了一种细粒度且内容感知的奖励信号补充了可验证的正确性。这些发现突显了将内在模型自我评估纳入强化学习框架的重要性确立了一种新的范式推动了大型语言模型训练的有效性和效率。