# free_process_rewards_without_process_labels__564b6cf7.pdf Free Process Rewards without Process Labels Lifan Yuan * 1 Wendi Li * 2 3 Huayu Chen 2 Ganqu Cui 4 Ning Ding 2 Kaiyan Zhang 2 Bowen Zhou 2 Zhiyuan Liu 2 Hao Peng 1 Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine-grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an Implicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models rϕ(y) = β log πϕ(y) πref(y), which can be optimized regardless of the specific choice of loss objectives. In experiments, we train our Implicit PRMs with various objectives and evaluate their performance on MATH. Implicit PRMs outperform strong MCTS-based baselines á la Math-Shepherd (Wang et al., 2023) using less than 1/38 of the training data. We further find that scaling up instructions and responses benefits our Implicit PRMs, and the latter brings a larger gain. Particularly, Implicit PRMs with the cross-entropy (CE) loss is more data-efficient, and yields meaningful improvements even trained with only one response per instruction, a setup that suffers from extreme data scarcity and imbalance. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible. *Equal contribution 1University of Illinois Urbana Champaign 2Tsinghua University 3Huazhong University of Science and Technology 4Shanghai AI Lab. Correspondence to: Lifan Yuan , Wendi Li , Ganqu Cui , Ning Ding . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Figure 1: The x-axis indicates the FLOPs required to collect the data and train the model, and y axis the accuracies of best-of-64 performance. The accuracy is averaged over the best-of-64 accuracies of Mistral-7B-Instruct-v0.2 (Jiang et al., 2023), Llama-3.1-8B-Instruct, and Llama-3.1-70BInstruct (Meta, 2024) on MATH (Hendrycks et al., 2021). Different dots on the same line indicates models trained with the same approach but on different scales of data. The topleft zone is desirable in this figure, as it suggests a model can achieve higher performance with less development overhead. Our implicit PRM is much cheaper to train while presenting the best performance under the same budget. 1. Introduction Training on high-quality supervised data has driven the advances in LLMs development (Meta, 2024; Ding et al., 2023; Luo et al., 2024b; Yue et al., 2024; Yuan et al., 2024; Zhang et al., 2024b). Building upon this progress, reward models push the boundaries even further, especially in tasks requiring complex reasoning (Lightman et al., 2023; Wang et al., 2023; Snell et al., 2024). Outcome Reward Models (ORMs), designed to evaluate full responses, have been primarily explored, which can be used in both reinforcement learning (RL) and inference. However, due to the sparsity of outcome rewards, ORMs often yield suboptimal performance when reranking responses at inference (Lightman et al., 2023) and struggle with stability and efficiency during RL training (Cao et al., 2024; Chan et al., 2024). This highlights the growing demand for denser and more fine-grained rewards. Process Reward Models (PRMs), evaluating intermediate steps to provide fine-grained guidance, naturally meet this need. Existing work has shown consistent results Free Process Rewards without Process Labels that PRMs outperform ORMs in best-of-N sampling (Wang et al., 2023; Snell et al., 2024) and RL (Setlur et al., 2024), and argues that scoring every intermediate step provides better transparency and interpretability (Leike, 2024). Despite their promise, PRMs are much harder to train than ORMs, since collecting PRM training data requires annotating every intermediate step. To reduce human efforts, automatic annotation approaches have been proposed, where an intermediate step is labeled based on its estimated probability of leading to a correct outcome. Typically, this is achieved through either sampling massive look-ahead trajectories to estimate or directly training a verifier to predict Q value, both incurring extensive overhead (Wang et al., 2023; Lu et al., 2024). For example, collecting step-level data through sampling look-ahead trajectories as Wang et al. (2023) requires 38.8 more FLOPs than training an ORM ( 4). We argue, from both theoretical and empirical perspectives, that building PRMs can be substantially cheaper than previously realized: a strong PRM can be obtained at no additional cost from training an ORM on the cheaper response-level data, with a simple reward parameterization. Specifically, by parameterizin the reward as the log-likelihood ratio of the policy and the reference models rϕ(y) = β log πϕ(y) πref(y), a common practice in DPO (Rafailov et al., 2023) and many of its variants (Azar et al., 2024; Ethayarajh et al., 2024; Chen et al., 2024; Rosset et al., 2024; Wu et al., 2024), a PRM can be automatically learned during ORM training. The process reward is then the same log-likelihood ratio, but calculated over a partial response. We dub our approach an Implicit PRM since it only requries response-level data and ORM training. Moreover, our insights are agnostic to the specific choice of the training objective, and are applicable to both DPO and all the variants that adopt the same form of implicit reward; it further extends to other objectives like the Cross-Entropy (CE) loss. This fresh theoretical insight generalizes the conclusion from Rafailov et al. (2024) that DPO training enables the model to learn the Q function; practically, our approach is particularly well-suited for scenarios where pairwise data is hard to obtain and algorithms like CE loss remain equally applicable, as shown in 5.1. In experiments, we train our Implicit PRMs on a dataset consisting of 33K math instructions and eight solutions for each, and evaluate them through the best-of-N sampling on MATH (Hendrycks et al., 2021). We explore variants of our Implicit PRMs trained with different training objectives, including DPO, KTO (Ethayarajh et al., 2024), NCA (Chen et al., 2024), and CE ( 4.2). All produce strong PRMs, outperforming competitive baselines including our reimplementations of Math-Shepherd (Wang et al., 2023) and Auto PSV (Lu et al., 2024) and six off-the-shelf open ORMs and PRMs, with substantially better trade-offs between accuracy and development overhead, as shown in Figure 1. Particularly, when integrated into weighted best-of-N, CE stands as the most effective ( ??). This makes CE loss appealing in scenarios where pairwise data is hard to collect, since it can handle unpaired and imbalanced data, and is demonstrated to be less data-consuming than DPO in order for an Implicit PRM with decent performance. Further, we find out that our Implicit PRM benefits from increased training data, and the scale of responses is more impactful than that of instructions ( 5.1). Surprisingly, training on step-level data brings no further improvements to our Implicit PRMs ( C.2). Finally, we observe that, at least for the models and tasks we consider, the reference model can be omitted without hurting the model s quality ( 5.3.2). This makes our Implicit PRMs even more appealing, offering improved training efficiency and performance without added inference overhead. Bypassing the need for step labels, implicit PRMs substantially lower the data collection and training overhead of building PRMs while delivering stronger performance than existing methods. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible. 2. ORMs vs. PRMs: Dilemma of Performance and Expense Background ORMs assign sparse rewards rϕ(y) to the entire response, and no feedback is provided until the last token is generated. In contrast, a PRM assesses the quality of every intermediate step and can provide reward after completing each (Lightman et al., 2023). Given an instruction and an n-step response y with yt being the t-th step and y