# selfevolved_reward_learning_for_llms__6493a673.pdf

Published as a conference paper at ICLR 2025

SELF-EVOLVED REWARD LEARNING FOR LLMS

Chenghua Huang , Zhizhen Fan , Lu Wang , Fangkai Yang , Pu Zhao , Zeqi Lin

Qingwei Lin ,Dongmei Zhang ,Saravan Rajmohan ,Qi Zhang

School of Computer Science, Fudan University School of Computer Science, Peking University Microsoft huangch22@m.fudan.edu.cn, 2201210191@stu.pku.edu.cn {wlu,fangkaiyang,puzhao,zelin,dongmeiz,saravar,qi-zh}@microsoft.com

Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences, playing a pivotal role in the success of conversational models like GPT-4, Chat GPT, and Llama 2. A core challenge in employing RLHF lies in training a reliable reward model (RM), which relies on high-quality labels typically provided by human experts or advanced AI system. These methods can be costly and may introduce biases that affect the language model s responses. As language models improve, human input may become less effective in further enhancing their performance. In this paper, we propose Self-Evolved Reward Learning (SER), a novel approach where the RM generates additional training data to iteratively improve itself. We conducted extensive experiments on multiple datasets such as HH-RLHF and Ultra Feedback, using models like Mistral and Llama 3, and compare SER against various baselines. Our results demonstrate that even with limited human-annotated data, learning from self-feedback can robustly enhance RM performance, thereby boosting the capabilities of large language models (LLMs). Resources of this paper can be found at https://aka.ms/ser

1 INTRODUCTION

Reinforcement Learning from Human Feedback (RLHF) is a well-established approach that aligns Large Language Models (LLMs) with human preference data Ouyang et al. (2022); Bai et al. (2022b). The standard approach involves learning a reward model (RM) from human preferences and the learned RM is then frozen to train LLMs via Reinforcement Learning (RL) such as Proximal Policy Optimization (PPO) Schulman et al. (2017a). Another common approach directly trains LLMs from the human preference data without learning an RM such as Direct Preference Optimiztion (DPO) Rafailov et al. (2024). Both approaches rely heavily on the size and quality of human-annotated preference data. However, the availability of such data is often limited and expensive to acquire, posing a significant bottleneck in the development and performance of RL approaches Yuan et al. (2024b). This dependency on human-annotated data hinders the scalability of strong LLMs that require vast amounts of labeled data to achieve greater performance Kaplan et al. (2020); Muennighoff et al. (2024). To mitigate the dependency, recent works leverage the AI feedback to train RMs, referred to as Reinforcement Learning from AI Feedback (RLAIF) Bai et al. (2022b); Lee et al. (2023) , which reduces the reliance on human-annotated data. However, they hold heuristic assumptions that LLMs can provide high-quality feedback and they often requires stronger LLMs to provide feedback Pang et al. (2023).

Recent advancements suggest that LLMs have the potential to serve as world models to a certain degree, capable of understanding world knowledge and complex patterns independently of explicit human input Hao et al. (2023); Guan et al. (2023); Zhao et al. (2024). Leveraging this ability, LLMs can evaluate and provide feedback. In the context of RLHF and RLAIF, this capability of LLMs

work is done during an internship at Microsoft corresponding author

Published as a conference paper at ICLR 2025

can be extended as the role of RMs, and RL approaches rely heavily on the RMs Dewey (2014); Li (2017). Focusing on training a better RM with limited human-annotated data, we propose a novel reward learning approach, which self-evolves the RM through a feedback loop using the RM itself. In our approach, the LLM serves as the RM, generating feedback on the dataset that is subsequently used to refine its own learning. This iterative feedback-then-train loop allows the RM to self-evolve over time, gradually improving its performance, even with some noise in the initial self-labeled data. As the iteration progresses, however, similar data offers diminishing help and can even degrade performance. To address this, we identify the RM learning status in each iteration and introduce data filtering strategies to select high-confidence data that are later used for a more robust RM training.

Unlabeled Data

1 Self-Label with Reward Model (RM)

2 Select High-Confidence Self-Labeled Data

(a) Distinguish Good and Bad (b) Amplify Difference

Selected Self-Labeled Data

3 Retrain RM with Selected Self-Labeled Data

Next Iteration Reward Model

4 Train LLM with RM via Reinforcement Learning

Figure 1: The Self-Evolved Reward Learning (SER) pipeline. Our SER method consists of following steps: (1) Self-labeling: the reward model (RM) assigns labels to unlabeled data. (2) Identifying learning status and selecting data: high-confidence data is selected by assessing the learning status. (3) Retrain the RM: the RM trains itself using the self-labeled and selected data. (4) Train the Large Language Model (LLM): the LLM is trained under the guidance of the self-evolved RM. Note that steps (1)-(3) iterate multiple rounds to a converged RM.

By employing this self-evolved reward learning process, where the RM continually learns from its own feedback, we reduce dependency on large human-labeled data while maintaining, or even improving, the model s performance. Our contributions are threefold:

We introduce a novel self-evolved reward learning framework, demonstrating that only 15% of human-annotated seed data is required to achieve performance comparable to models trained with full human-labeled datasets, significantly reducing reliance on human data.

We provide insights into the broader implications of self-learning paradigms in LLMs, particularly in improving reinforcement learning by enhancing RMs (see Section 4.1.2).

Extensive experiments demonstrate that our self-evolved reward learning framework consistently improves performance across various LLMs, model sizes, and datasets.

We conducted experiments on multiple datasets and LLMs with their varied sizes to validate the generalization and effectiveness of our method. We find that, compared to the seed models that use only a small amount of human-labeled data, our method can robustly and significantly enhance model performance, with an average improvement of 7.88%. After multiple iterations, the final convergence can achieve or even surpass the performance of models using the entire human-annotated dataset, providing a potential solution for the self-improvement of models.

Published as a conference paper at ICLR 2025

2 RELATED WORK

2.1 REINFORCEMENT LEARNING FROM EXTERNAL FEEDBACK

Preference learning or now commonly referred to as reinforcement learning from human feedback (RLHF) Christiano et al. (2017); Ziegler et al. (2019); Stiennon et al. (2020b); Ouyang et al. (2022); Bai et al. (2022a) train a fixed reward model (RM) from human preference data, and the trained RM is then used to train the Large Language Model (LLM) via RL, such as Proximal Policy Optimization (PPO) Schulman et al. (2017b). In order to make RL training more stable and efficient, methods such as Direct Preference Optimization (DPO) Rafailov et al. (2024) directly train the LLM using human preferences without training the RM. Other methods Zhao et al. (2023); Gulcehre et al. (2023); Yuan et al. (2024a) adjust the preference training schemes to improve the performance and stability. However, obtaining human preference data, especially high-quality data, is extremely expensive and time-consuming K opf et al. (2024); Xu et al. (2023); Sun et al. (2024), and the data diversity is skewed to be low, containing few expert-annotated data which requires huge effort and expertise Peng et al. (2023); Zhang et al. (2023); Xu et al. (2023). The data quality and size sets the bottleneck of the performance of LLMs. Reinforcement Learning from AI Feedback (RLAIF) Bai et al. (2022b); Lee et al. (2023) employs LLMs to generate feedback for training RMs, reducing reliance on humanannotated data. However, it relies on the heuristic assumption that LLMs can provide high-quality and diverse feedback and often requires stronger LLMs to provide feedback Pang et al. (2023). In this paper, we leverage a small percentage of the human-annotated data to train an RM which achieves a comparable performance with the one trained with the full annotated data. The RM is further used in PPO to train the LLM.

2.2 SELF-LEARNING IN LLMS

As the LLMs are developing towards superhuman-level, which may be bottlenecked by human performance level. Similar to the self-improvement in human reflection, self-learning is a new approch in improving LLM performance recently. Self-learning in LLMs focuses on enhancing capabilities without external supervision. SELF-ALIGN Sun et al. (2024) demonstrates self-alignment through principle-driven reasoning, allowing models to adjust their outputs based on internal guidelines. Re STEM Singh et al. (2023) employs self-training to enhance problem-solving abilities. RLC Pang et al. (2023) and SCo Re Kumar et al. (2024) showcase methods for self-correction and improvement using self-generated data. Additionally, Huang et al. (2022) illustrates how LLMs can refine reasoning through self-generated rationale-augmented answers, enhancing their explanatory depth. Math Shepherd Wang et al. (2024) and Self-Rewarding Language Models Yuan et al. (2024b) demonstrate self-rewarding mechanisms, where the model has the ability to provide high-quality rewards to itself. Our proposed approach falls in this self-learning paradigm by innovatively using the RM to generate feedback for itself, fostering robust RM training and improvement.

3 SELF-EVOLVED REWARD LEARNING FOR LARGE LANGUAGE MODELS

In this section, we present our proposed Self-Evolved Reward Learning (SER) for LLMs. This approach enables the RM to iteratively improve itself by learning from its own high-confidence predictions, thereby reducing the need for extensive human-annotated data. Initially, the RM is trained with a small set of human-annotated data to provide a basic understanding of good and bad answers. From there, the RM evolves through self-labeling and iterative retraining. The enhanced RM is then employed to guide the LLM training via RL approaches. We detail each component of our method in the below section, including self-labeling, identifying learning status, data filtering, RM retraining and the LLM training via RL with improved and converged RM.

3.1 OVERVIEW

Figure 1 illustrates the overall pipeline of our SER method. This iterative process ensures that both the reward model and the LLM are continuously refined throughout the training cycle. Our method for Reward Model training consists of the following three iterative steps:

Published as a conference paper at ICLR 2025

1. Self-Label with Reward Model: The RM is initially trained with a small set of human-annotated data as a warm-up stage, then the RM performs self-labeling on the unlabeled data.

2. Identify the Learning Status of the Reward Model and Select High-Confidence Data: Evaluate the RM s current ability to differentiate between good and bad answers or to amplify differences between similar answers. This status assessment guides the selection of high-confidence data.

3. Retrain the Reward Model with Pairwise Loss: After filtering, the selected high-confidence data are used to retrain the RM with pairwise loss, iteratively enhancing its understanding of answer quality.

With a few iterations of self-evolved reward learning, the RM training converges or meets the stopping criteria, such as when no further data can be filtered, the RM is then used to guide the training of the LLM via RL approaches. The modified PPO algorithm incorporates the evolved reward signals to optimize the LLM s policy.

Our method relies on two distinct learning statuses for the RM: (1) the ability to distinguish between clearly good and bad answers, and (2) the ability to refine differences between answers of similar quality. These statuses are separated for the following reasons: (a) Targeted Skill Development: by recognizing different learning statuses, the RM can focus on specific skill sets. Initially, the model focuses on clear distinctions (e.g., good vs. bad answers), and as training progresses, it refines its comparative abilities with more subtle distinctions. (b) Adaptive Data Filtering: the data filtering process is driven by the current learning status, allowing the model to train on the most relevant data. This adaptive approach ensures the model always works on improving the appropriate aspect of its performance. (c) Improved Self-Evaluation: by continuously monitoring its learning status, the RM can determine when to shift from one learning focus to another. This dynamic approach fosters self-driven, curriculum-like learning.

Furthermore, by allowing the RM to judge two answers for each question, the RM is provided with paired examples that are key to both learning statuses, enabling the RM to improve its discrimination and comparative abilities. Once the RM becomes proficient at handling both tasks, it is well-equipped to guide the LLM during reinforcement learning.

STEP 1: SELF-LABEL WITH REWARD MODEL

As shown in Figure 1, we first predict a reward score for all unlabeled data based on the current reward model (RM). This is formally expressed as follows:

ri = RM(Qi, Ai) (1)

This reward score may contain substantial noise, depending on the performance of the current state of the RM. We use these reward scores to determine the current training status and data selection strategy. Initially, we employ a small amount of human-annotated data to obtain a seed RM. In this study, the seed RM is trained using 15% of the entire dataset.

STEP 2: IDENTIFY THE LEARNING STATUS OF THE REWARD MODEL AND SELECT HIGH-CONFIDENCE DATA

Each question Qi in our self-labeled dataset has two possible answers, A1 i and A2 i , which can exhibit various relationships. The RM must differentiate between the following scenarios: One answer is clearly better than the other (e.g., A1 i is good, A2 i is bad, or vice versa), or both answers are good, but one is better (or both are bad, but one is worse). The RM assigns probabilities p1 i and p2 i that represent the likelihood that A1 i and A2 i are good . The goal is to distinguish the relative quality of the answers across these different cases.

Let Dtrain = {(Qi, A1 i , A2 i )}N i=1 be the training dataset. The learning status S is determined by the predicted probability differences between A1 i and A2 i :

i = |p1 i p2 i |, (2)

We define the learning status S using thresholds τlow, τhigh, and τ :

Published as a conference paper at ICLR 2025

Status1, if (p1 i > τhigh and p2 i < τlow) or (p1 i < τlow and p2 i > τhigh), Status2, else if i τ , Stop, otherwise. (3)

To determine the current status, we use the reward model (RM) trained in the current iteration to predict on the unlabeled data. Both Status 1 and Status 2 require a sufficient number of predictions meeting specific criteria to ensure a statistically meaningful assessment. (In this paper, we selected τhigh = 0.55, τlow = 0.45, and τ = 0.3 as they provided the most consistent improvements in the RM s ability)

Status 1 (Easier Task): This status evaluates whether the RM can effectively distinguish between positive (good) and negative (bad) samples. The evaluation is based on the predicted probabilities pk j for each answer Ak j : If pk j > τhigh, the RM is confident that the answer is positive. If pk j < τlow, the RM is confident that the answer is negative. A sufficient number of high-confidence predictions (e.g., 600 in the HH dataset) indicates that the RM is proficient in distinguishing positive and negative samples, thereby satisfying the criteria for Status 1.

Status 2 (Harder Task): This status assesses the RM s ability to discern subtle differences between answers of similar quality (e.g., both good or both bad). It requires the RM to evaluate paired answers to the same question and compute the absolute difference between their predicted probabilities: p1 j p2 j > τ

If a sufficient number of paired predictions meet this threshold, it indicates that the RM can amplify distinctions between similar-quality answers. This task is more challenging than Status 1 because it requires the RM to recognize and quantify nuanced differences. Similar to Status 1, this determination requires a sufficient number of predictions on the unlabeled dataset (e.g., 600 predictions in the HH dataset).

We check the statuses in order first Status 1 and then Status 2 because Status 1 represents a foundational capability that is necessary before tackling the more complex task in Status 2. Status 1 is the easier task, focusing on broad distinctions, while Status 2 is the harder task, requiring finer-grained analysis. If the RM does not meet the criteria for Status 1 (i.e., few or no samples satisfy the thresholds τhigh and τlow), we then check for Status 2. If the RM also fails to meet the criteria for Status 2, we interpret this as the model reaching its convergence point, and we halt further training of the RM.

STEP 3: RETRAIN THE REWARD MODEL WITH FILTERED DATA USING PAIRWISE LOSS

Based on the state of the RM determined in Step 2, we select different data filtering strategies as outlined below:

F(Dunlabeled, S) =

{(Qj, A1 j, A2 j) | (RM(Qj, A1 j) > τhigh and RM(Qj, A2 j) < τlow) or (RM(Qj, A1 j) < τlow and RM(Qj, A2 j) > τhigh), if S = Status1, {(Qj, A1 j, A2 j) | |RM(Qj, A1 j) RM(Qj, A2 j)| > δ}, if S = Status2, , if S = Stop. (4)

Here, Dunlabeled refers to the unlabeled data. In Status 1, the filter selects high-confidence data where (RM(Qj, A1 j) > τhigh and RM(Qj, A2 j) < τlow) or (RM(Qj, A1 j) < τlow and RM(Qj, A2 j) > τhigh), ensuring the model trains on reliable examples. In Status 2, the filter selects pairs where the reward difference |RM(Qj, A1 j) RM(Qj, A2 j)| exceeds a threshold δ, focusing on refining comparative judgments.

After filtering, the model is retrained using pairwise loss, allowing the model to compare answers relatively rather than relying on absolute labels. which consistently improves performance by focusing on relative comparisons rather than absolute classifications. The pairwise loss function is:

Published as a conference paper at ICLR 2025

Lpair = 1 |Dfiltered|

(Qj,A1 j,A2 j) Dfiltered max(0, (RM(Qj, A1 j) RM(Qj, A2 j))), (5)

where is the desired margin between reward scores, and Dfiltered = Dn filtered + Dn 1 filtered, where n denote the number of iterations of the loop. The training data for the current loop consists of the data filtered using Equation 4, in addition to the training data from the previous loops. This iterative process, i.e., filtering data and retraining with pairwise loss, enables the RM to progressively refine its judgment until it converges.

STEP 4: TRAIN THE LLM VIA RL WITH SELF-EVOLVED REWARD MODEL

After self-evolving the RM, we use it to guide the training of the LLM via RL. To accommodate the refined reward signals from RM, we modify the PPO framework. LLM training is framed as a policy optimization problem. The policy πϕ generates responses A for inputs Q, and the objective is to optimize πϕ to maximize the rewards generated by the self-evolved RM: r = RM(Q, A). We maximize the expected reward from the self-evolved RM: maxϕ EQ Dtrain,A πϕ( |Q)[RM(Q, A)].

Using PPO (Schulman et al., 2017b), we modify the policy updates to incorporate the refined reward signals from Rθ, which better capture subtle differences in response quality. The policy is updated by maximizing the clipped surrogate objective:

LPPO = E min πϕ(A | Q)

πϕold(A | Q)AR, clip πϕ(A | Q)

πϕold(A | Q), 1 ϵ, 1 + ϵ AR (6)

Here, AR is the advantage function based on the rewards from RM. By leveraging the evolved reward model s nuanced signals, the LLM s policy updates align better with subtle distinctions in response quality. The detailed algorithm framework of our SER approach is provided in the Appendix B.

3.2 THEORETICAL ANALYSIS

In this section, we analyze the theoretical feasibility of SER, focusing on the convergence properties of both RM training and PPO training. A detailed theoretical analysis supporting the effectiveness of our SER method is presented in Appendix A. The key conclusions of this analysis are as follows: (a) Convergence of the Reward Model: we demonstrate that, under reasonable assumptions, the RM iteratively improves by selecting high-confidence data based on predicted probabilities. This process ensures that its performance either improves or remains stable over time. (b) Convergence of PPO with a Learned Reward Model: we establish that PPO converges to a near-optimal policy even when the RM is trained through self-labeling, provided that reward estimation errors are small. These findings confirm that both the reward model and the LLM are capable of achieving high performance with minimal human supervision.

4 EXPERIMENT

In this section, we report our main experiment results, including reward modeling results and PPO results. We select multiple base models with different parameter sizes (Llama 3 8B (Dubey et al., 2024), Llama 2 13B (Touvron et al., 2023), Llama 3 70B (Dubey et al., 2024), Mistral 7B (Jiang et al., 2023)) and conduct experiments on various datasets (Stack Overflow (Lambert et al., 2023), HH-RLHF (Bai et al., 2022a), Ultra Feedback (Cui et al., 2023), Summarize (Stiennon et al., 2020a)) to verify the effectiveness of the method. The experimental setup, statistical of datasets, evaluation metrics, and baselines are provided in the appendix C.

4.1 REWARD MODELING RESULTS

Our main experimental results are shown in Table 1. In all experimental setups, SER improves the model s performance, ultimately achieving results close to those obtained using the full labeled dataset. In some experimental settings, it even exceeds the performance of models trained with the

Published as a conference paper at ICLR 2025

full human-labeled data, while using only 15% of the labeled data. This demonstrates the substantial potential of SER to enhance model performance in data-scarce scenarios.

Table 1: The results of reward modeling on the HH-RLHF, Ultrafeedback, Summarize, and Stackoverflow. Loop 0 denotes the RM trained with 15% of the human data. SER represents the results of iterative evolution based on the Loop 0 model. Full dataset denotes the results of the RM trained with the entire set of human-annotated data.

HH-RLHF Summarize Llama3-8b Mistral-7b Llama2-13b Llama3-8b Mistral-7b Llama2-13b

Loop 0 56.9 56.01 59.47 58.01 55.84 63.2

SER 68.56 68.1 70.26 68.42 65.49 69.19

Full Dataset 70.45 67.97 71.11 68.61 63.56 71.3

Ultrafeedback Stackoverflow Llama3-8b Mistral-7b Llama2-13b Llama3-8b Mistral-7b Llama2-13b

Loop 0 62.54 61.4 66.5 69 68.8 65.1

SER 74.46 70 72.69 70.8 70.1 69.2

Full Dataset 73.92 71.3 74.53 69 70.4 68.7

4.1.1 MAIN FINDINGS

SER consistently and effectively enhances model performance. As shown in Table 1, compared to the baseline that uses only 15% of the data, SER improves the model s performance by incorporating self-labeled data for training. Through multiple iterations, the model achieves significant performance gains, resulting in an average 7.88% increase in accuracy. Furthermore, we find that as the model s parameter size increases, the foundational capability strengthens, and the potential for SER s selfimprovement further enhances. Larger parameter models typically achieve higher performance after undergoing self-improvement. In most experimental settings, the performance of the LLama 13B model surpasses that of the other two smaller parameter models.

In data-rich scenarios (Stack Overflow), the performance gains from SER become smaller, averaging only 2.4%. In such data-rich contexts, a clear scaling trend with model parameter size is observed. The larger the model parameters, the greater the benefits of SER s self-improvement. Mistral 7B achieves a performance improvement of 1.3%, Llama 8B achieves 1.8%, and Llama 13B achieves 4.1%.

SER can approach or even exceed the performance of full-scale human-labeled data. We compare our method with using the full human-labeled data. The results demonstrate that SER can achieve performance close to that of using the complete human-labeled dataset, with an average performance difference of 0.3%. For Mistral 7B on the HH-RLHF dataset, SER exceeds the baseline by 0.13%, and on the Summarize dataset, it surpasses the baseline by 1.93%. For LLa MA 8B on the Ultra Feedback dataset, it achieves a performance advantage of 0.54% over the baseline. A potential trend is observed where the difference between the SER method and the full human-labeled data increases with model size. Specifically, the average difference for Mistral 7B is +0.12%, for LLa MA 8B is +0.06%, and for LLa MA 13B is -1.07%. This suggests that larger models better utilize labeled data, enhancing performance. This trend highlights the potential of SER to further elevate model performance by scaling labeled data through self-labeling rather than manual annotation.

In data-rich scenarios, this trend becomes more pronounced. In the Stack Overflow dataset, LLa MA 8B achieves performance very close to that of using the full dataset with only 15% of the humanlabel data. By employing SER, the model s performance is further enhanced, surpassing the full human-labeled data by 1.8%. LLa MA 13B shows a 0.5% performance improvement compared to the full human-labeled data, while Mistral performs 0.3% lower than the baseline. This indicates that in cases of abundant data, the model s self-evolved can lead to more diverse data distributions, thereby further raising the model s performance ceiling.

Published as a conference paper at ICLR 2025

250 500 750 1000 1250 1500 1750 Steps

Test accuracy (%)

(a) LLama 8B

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

200 400 600 800 1000 1200 1400 1600 Steps

Test accuracy (%)

(b) Mistral 7B

(a) HH-RLHF

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

200 400 600 800 1000 1200 1400 1600 Steps

Test accuracy (%)

(c) LLama 13B

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

500 1000 1500 2000 Steps

Test accuracy (%)

(a) LLama 8B

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

200 400 600 800 1000 1200 Steps

Test accuracy (%)

(b) Mistral 7B

(b) Ultrafeedback

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

1000 2000 3000 4000 Steps

Test accuracy (%)

(c) LLama 13B

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

500 1000 1500 2000 2500 3000 3500 Steps

Test accuracy (%)

(a) LLama 8B

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

500 1000 1500 2000 2500 3000 3500 Steps

Test accuracy (%)

(b) Mistral 7B

(c) Summarize

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

500 1000 1500 2000 2500 3000 Steps

Test accuracy (%)

(c) LLama 13B

Loop 0 Loop 1 Loop 2 Loop 3 Baseline

Figure 2: Reward modeling improves in performance with iterative evolution. We demonstrate the performance variation of the model during the iterative process on the HH-RLHF, Ultrafeedback, and Summarize datasets. Baseline refers to the RM that uses the full dataset of human-annotated data. Due to the large size of the summarize test set, the results in the figure are based on a random sample of 1/10 of the test set.

4.1.2 FINE-GRAINED ANALYSIS

In order to conduct a more detailed analysis of what occurs during the model s iterations, we present the changes in the model s accuracy on the validation set, as shown in Figure 2. Additionally, we illustrate the variations in the amount of training data across different iterations, as depicted in Figure 3. Our main conclusions are as follows:

The model can iteratively enhance its performance on self-labeled data, even if the self-labeled data contains noise. As shown in Figure 2, Loop1 consistently enhances the model s performance in every experimental setup. During the Loop1 phase, the model s performance is relatively weak, and there may be significant noise in the model s self-feedback; therefore, it is essential to select high-confidence samples to improve the model s performance(Status1). Generally, the performance improvement in Loop1 is the most significant among all loops, and it can filter out the largest number of training examples. As illustrated in Figure 2, on average, Loop1 provides a 4.54% enhancement to the model s performance. This also verifies our Theory 1, which posits that when the model s initial accuracy exceeds 50%, iterative training with high-confidence samples can further improve the model s performance.

Similar data becomes marginally helpful after multiple iterations and may even harm the model s performance. During the loop 2 phase, as the model s capability increases, the benefits brought by the simple samples filtered out in state 1 become less significant. As shown in Figure 2, the performance improvement in loop 2 is the least significant across all iterations. This indicates that merely increasing the number of clearly defined samples offers limited performance enhancement for

Published as a conference paper at ICLR 2025

0 10 20 30 40 50 Data Nums (%)

Loop 0 (human)

(a) LLama 8B

0 10 20 30 40 50 Data Nums (%)

(b) Mistral 7B

0 10 20 30 40 50 Data Nums (%)

(c) LLama 13B

0 10 20 30 40 50 Data Nums (%)

Loop 0 (human)

(d) LLama 8B

0 10 20 30 40 50 60 70 Data Nums (%)

(e) Mistral 7B

0 10 20 30 40 Data Nums (%)

(f) LLama 13B

0 10 20 30 40 50 Data Nums (%)

Loop 0 (human)

(g) LLama 8B

0 10 20 30 40 50 Data Nums (%)

(h) Mistral 7B

0 10 20 30 40 50 60 70 Data Nums (%)

(i) LLama 13B

Loop 0 (human) Loop 1 Loop 2 Loop 3

Figure 3: The percentage of the total data used by the RM in each iteration is shown. (a)-(c) correspond to the HH-RLHF dataset, (d)-(f) correspond to the Ultrafeedback dataset, and (g)-(i) correspond to the Summarize dataset.

0 20 40 60 80 100

28% 55% 17%

22% 54% 24%

18% 71% 11%

19% 65% 16%

(a) hh-rlhf

0 20 40 60 80 100

24.5% 55.0% 20.5%

29.0% 46.5% 24.5%

33.0% 35.5% 31.5%

30.5% 42.0% 27.5%

(b) stackoverflow

Left wins Tie Right wins

Figure 4: we use GPT-4 as a judge to evaluate the capabilities of the model trained with PPO. We employ the win rate as the evaluation metric. Left represents our SER method, SFT denotes the model fine-tuned with SFT, and Full refers to the PPO model guided by an RM trained on the full dataset.

the model and may even lead to a decrease in model performance (as observed with llama 13b in the ultrafeedback dataset). By observing the training process, we should incorporate more ambiguous and difficult samples into the model s training process, allowing the model to better discern the quality of two similar samples.

By adjusting the error reduction strategy, more diverse self-labeled data can be obtained, further enhancing the effectiveness of self-learning. Based on the analysis of the training process, to avoid overfitting the model on simple data, we need to focus the training objective on similar samples that are difficult to distinguish, enabling the model to identify differences between the two

Published as a conference paper at ICLR 2025

samples. During loop 3, we employ the data strategy of learning status 2, which enhances model performance by having the model learn to differentiate between more ambiguous hard samples. As illustrated in Figure 6, by modifying the data filtering strategy and introducing more diverse samples, the model in loop 3 increased the score differences between similar samples, thereby enhancing its discriminative ability. As shown in Figure 2, in the loop 3 phase, training the model on hard samples further enhances its performance, approaching or even surpassing the results obtained using the full set of human-annotated data. Across multiple iterations, the total amount of training samples and the number of training steps are less than those required for the full dataset, typically representing only about 50% of the full data.

SER is more data and human-labor efficient than full fine-tuning. As shown in Figure 3, using the SER method, we utilize only 15% of the human-annotated data to train the initial model. Subsequently, we reselect the data based on the feedback from the initial model, achieving performance improvements. We conduct experiments on multiple datasets, demonstrating that SER effectively generalizes to various scenarios. Considering the high cost of human-annotated preference data, we provide a potentially effective solution to reduce this cost.

4.2 PPO RESULTS

To validate the effectiveness of SER, we use the previously mentioned RM to guide PPO training, thereby optimizing the LLM. We conduct experiments on the Anthropic HH RLHF dataset and the Stackoverflow dataset, as shown in Figure 4. In the hh-rlhf dataset, all SER models exceed the SFT baseline in terms of win rate, indicating that the SER approach enhances the capabilities of the LLM. Compared to RMs trained with the full human-annotated data, the win rate in PPO experiments demonstrates a consistent trend with the performance of the RMs. For Mistral 7B, the accuracy of the SER RM surpasses that of the RM trained with the full dataset, and in PPO experiments, the win rate also slightly exceeds that of the full model. Additionally, to verify the generalizability of our method, we conduct the same experiment on the Stack Overflow dataset. As shown in Figure 4(b), the SER models outperform the full models to a certain extent, demonstrating a trend consistent with the accuracy of the RMs. In summary, our main findings are as follows:

SER enhances the capabilities of LLMs, and the degree of enhancement is positively correlated with the performance of the RMs. Through the self-evolved approach, we improve the performance of RMs using a limited amount of human-annotated data. Leveraging RMs to guide the learning of LLMs results in stronger LLMs. Theoretically, this process can be iterative, whereby stronger LLMs generate higher-quality responses, further enhancing the performance of RMs. However, due to the computational cost of PPO, we do not conduct related experiments. Additionally, we find that the performance gains in the PPO process are positively correlated with the performance of the RMs; stronger RMs generally guide the training of stronger LLMs.

5 DISCUSSION

Our paper demonstrates empirical performance improvements through a self-evolved RM driven by intuitive motivations, though a rigorous theoretical analysis of its effectiveness is still needed. The data filtering strategies are empirical, yet it s interesting that different datasets exhibit similar learning statuses in each iteration loop. Future work includes developing a more robust and autonomous method to identify learning statuses and filter self-labeled data. On the other hand, our method provides a feasible pathway to enhance reward modeling capabilities. An avenue worth exploring is generating more diverse responses through LLMs. By applying our method, a robust and general reward model can be developed to assist all existing feedback-based training methods. Additionally, integrating LLMs into the entire self-evolved reward learning loop is another future work, specifically by incorporating step 4 in each iteration and using LLMs to generate responses for the RM to perform self-labeling. Our work presents a potential solution to break through the performance ceiling of those strongest LLMs.

Published as a conference paper at ICLR 2025

6 CONCLUSION

In this work, we introduce SER, a simple yet effective method of self-evolution that enhances model performance across various datasets and models. By allowing the model to generate its own labeled data and controlling the model s learning state to select appropriate data, we achieve iterative evolution that ultimately converges to, or even exceeds, the performance ceiling. Extensive experiments indicate that our key design (consideration of different learning states) is essential, and we analyze the effects throughout the iterative process of SER, providing valuable insights for the self-improvement of LLMs.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862, 2022a.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mc Kinnon, et al. Constitutional ai: Harmlessness from ai feedback. ar Xiv preprint ar Xiv:2212.08073, 2022b.

L eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223 311, 2018.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. ar Xiv preprint ar Xiv:2310.01377, 2023.

Daniel Dewey. Reinforcement learning and the reward engineering principle. In 2014 AAAI Spring Symposium Series, 2014.

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and Abhishek Kadian et al. The llama 3 herd of models. Ar Xiv, abs/2407.21783, 2024.

Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pretrained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems, 36:79081 79094, 2023.

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. ar Xiv preprint ar Xiv:2308.08998, 2023.

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. ar Xiv preprint ar Xiv:2305.14992, 2023.

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. ar Xiv preprint ar Xiv:2210.11610, 2022.

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023.

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267 274, 2002.

Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020.

Published as a conference paper at ICLR 2025

Andreas K opf, Yannic Kilcher, Dimitri von R utte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Rich ard Nagyfi, et al. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, et al. Training language models to self-correct via reinforcement learning. ar Xiv preprint ar Xiv:2409.12917, 2024.

Nathan Lambert, Lewis Tunstall, Nazneen Rajani, and Tristan Thrush. Huggingface h4 stack exchange preference dataset, 2023. URL https://huggingface.co/datasets/ Hugging Face H4/stack-exchange-preferences.

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling. ar Xiv preprint ar Xiv:2403.13787, 2024.

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. ar Xiv preprint ar Xiv:2309.00267, 2023.

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. ar Xiv preprint ar Xiv:2406.11939, 2024.

Yuxi Li. Deep reinforcement learning: An overview. ar Xiv preprint ar Xiv:1701.07274, 2017.

Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730 27744, 2022.

Jing-Cheng Pang, Pengyuan Wang, Kaiyuan Li, Xiong-Hui Chen, Jiacheng Xu, Zongzhang Zhang, and Yang Yu. Language model self-improvement by reinforcement learning contemplation. ar Xiv preprint ar Xiv:2305.14483, 2023.

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. ar Xiv preprint ar Xiv:2304.03277, 2023.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Ar Xiv, abs/1707.06347, 2017a. URL https://api. semanticscholar.org/Corpus ID:28695052.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017b.

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, et al. Beyond human data: Scaling self-training for problem-solving with language models. ar Xiv preprint ar Xiv:2312.06585, 2023.

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Neur IPS, 2020a.

Published as a conference paper at ICLR 2025

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008 3021, 2020b.

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9426 9439, 2024.

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. ar Xiv preprint ar Xiv:2304.12244, 2023.

Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2024a.

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. ar Xiv preprint ar Xiv:2401.10020, 2024b.

Shujian Zhang, Chengyue Gong, Lemeng Wu, Xingchao Liu, and Mingyuan Zhou. Automl-gpt: Automatic machine learning with gpt. ar Xiv preprint ar Xiv:2305.02499, 2023.

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback. ar Xiv preprint ar Xiv:2305.10425, 2023.

Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Information Processing Systems, 36, 2024.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595 46623, 2023.

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593, 2019.

Published as a conference paper at ICLR 2025

A THEORETICAL ANALYSIS

In this section, we provide a rigorous theoretical analysis to support the effectiveness and convergence of our SER method. We focus on two main aspects:

1. Convergence of the Reward Model during Self-Training: We provide formal conditions under which the reward model converges to a good solution through self-training. 2. Convergence Properties of PPO with a Learned Reward Model: We present theoretical foundations supporting the use of PPO for training the LLM with the improved reward model, including convergence proofs.

A.1 CONVERGENCE OF THE REWARD MODEL DURING SELF-TRAINING

Self-training involves using a model s own predictions to generate additional training data. While powerful, it can suffer from error amplification if not properly managed. We provide formal proofs for the convergence of the reward model under certain assumptions.

Definitions Let R(t) θ denote the reward model at iteration t. Let D(t) filtered be the filtered dataset used for retraining at iteration t.

Assumptions

Assumption 1 (Initial Model Accuracy). The initial reward model R(0) θ has an expected accuracy greater than random guessing:

P(Q,A,y) h R(0) θ (Q, A) = y i = Acc(0) > 0.5, (7)

where y {0, 1} denotes the true label (bad or good answer). Assumption 2 (High-Confidence Prediction Reliability). For data points where the reward model s prediction confidence exceeds thresholds τp and 1 τp, the prediction accuracy is at least α, with α > 0.5: P h R(t) θ (Q, A) = y | |p(t)(Q, A) 0.5| δp i α, (8)

where p(t)(Q, A) is the predicted probability, and δp = τp 0.5, τp equala to τhigh and 1 τp equals to τlow .

Theoretical Result Theorem 1 (Convergence of Reward Model during Self-Training). Under Assumptions 1 and 2, and with appropriate choice of threshold τp, the sequence of reward models {R(t) θ } converges to a fixed point R θ with improved accuracy, i.e.,

lim t Acc(t) = Acc Acc(t) Acc(0) > 0.5. (9)

Proof We provide a proof by induction.

Base Case ( t = 0 ): By Assumption 1, Acc(0) > 0.5.

Inductive Step: Assume that at iteration t, Acc(t) > 0.5. The filtered dataset D(t) filtered consists of examples where the model s predicted probabilities are confident, i.e., |p(t)(Q, A) 0.5| δp. From Assumption 2, the accuracy on D(t) filtered is at least α > 0.5.

Retraining the model on D(t) filtered leads to an updated model R(t+1) θ with improved accuracy due to the following reasons: 1. Risk Minimization: Training minimizes the empirical risk on D(t) filtered, leading to better performance on data similar to D(t) filtered. 2. Data Distribution Shift: Since D(t) filtered is a subset where the model is confident and likely correct, retraining on this data reinforces correct predictions.

Therefore, the expected accuracy satisfies Acc(t+1) Acc(t).

Published as a conference paper at ICLR 2025

Convergence: As Acc(t) is bounded above by 1 and forms a non-decreasing sequence, it converges to Acc 1. Thus, lim t Acc(t) = Acc Acc(0) > 0.5. (10)

The key to convergence is the selection of high-confidence data that is more likely to be correctly labeled. By ensuring α > 0.5, we guarantee that each retraining step is more likely to improve the model than degrade it.

A.2 CONVERGENCE PROPERTIES OF PPO WITH A LEARNED REWARD MODEL

We now analyze the convergence properties of PPO when using the learned reward model R θ obtained from self-training. PPO is a policy gradient method that seeks to maximize the expected cumulative reward. The policy πϕ is updated to maximize:

J(ϕ) = EQ,A πϕ[R θ(Q, A)]. (11)

Assumption 3 (Lipschitz Continuity of Reward Model). The learned reward model R θ(Q, A) is Lipschitz continuous with respect to A, i.e., there exists LR > 0 such that for all A1, A2,

|R θ(Q, A1) R θ(Q, A2)| LR A1 A2 1. (12)

Assumption 4 (Bounded Policy Updates). The policy updates satisfy ϕ(t+1) ϕ(t) 2 δϕ for some δϕ > 0. Theorem 2 (Convergence of PPO with Learned Reward Model). Under Assumptions 3 and 4, and given that the true reward function R (Q, A) is approximated by R θ(Q, A) with bounded error ϵr:

|R θ(Q, A) R (Q, A)| ϵr, (13)

the PPO algorithm converges to a policy π ϕ that is within O(ϵr) of the optimal policy π with respect to R (Q, A).

Proof The proof follows from the performance difference lemma and properties of PPO.

Performance Difference Lemma(Kakade & Langford, 2002):

The difference in expected rewards between the learned policy πϕ and the optimal policy π under the true reward function R is:

J (π ) J (πϕ) = 1 1 γ EQ µ EA π Aπϕ π (Q, A) , (14)

where Aπϕ π (Q, A) is the advantage function.

Since R θ approximates R with error ϵr, the advantage estimates used in PPO are off by at most ϵr.

Impact on Policy Gradient: The policy gradient used in PPO is:

ϕJ(ϕ) = EQ,A πϕ[ ϕ log πϕ(A | Q)AR(Q, A)], (15)

where AR(Q, A) is the advantage function computed using R θ.

Due to the bounded reward error ϵr, the gradient estimation error is also bounded:

ϕJ (ϕ) ϕJ(ϕ) 2 Cϵr, (16)

where C is a constant depending on the policy and reward model.

Convergence Analysis:

Under Assumptions 3 and 4, standard results from stochastic gradient descent convergence apply (Bottou et al., 2018). The policy updates converge to a stationary point of J(ϕ), and the error in the reward model introduces an O(ϵr) bias.

Therefore, the final policy π ϕ satisfies:

|J (π ) J (π ϕ)| Kϵr, (17)

Published as a conference paper at ICLR 2025

for some constant K.

The convergence to a near-optimal policy depends on the accuracy of the learned reward model. As ϵr 0, the learned policy approaches the optimal policy under R .

Combining Theorems 1 and 2, we conclude that: (1) The reward model improves over iterations, reducing the reward estimation error ϵr. (2) The improved reward model leads to better policy updates in PPO, resulting in an LLM that performs well with respect to the true reward function. (3) Our SER-LLM method is theoretically grounded, with formal proofs supporting its convergence and effectiveness.

B ALGORITHM

Algorithm 1 summarizes the entire SER method, including iterative self-evolved RM training and reinforcement learning for LLM policy optimization.

Algorithm 1: Self-Evolved Reward Learning for LLMs (SER) Input: Initial RM Rθ, unlabeled data Dunlabeled, human-labeled data Dlabeled, thresholds τlow, τhigh, τ , δ, learning rate η Output: Trained LLM policy πϕ /* Step 0: Pretrain the Reward Model */ Pretrain Rθ on the human-labeled data Dlabeled using pairwise loss Lpair; while not converged do

/* Step 1: Identify the Learning Status of the Reward Model */ Evaluate Rθ on Dunlabeled to determine the learning status S using predicted probabilities and thresholds τlow, τhigh, and τ ; if S = Stop then

break; /* Step 2: Filter Data Based on Learning Status */ if S = Status1 then

Filter samples where predicted probabilities pj satisfy pj > τhigh (confidently good) or pj < τlow (confidently bad) to construct Dfiltered; else if S = Status2 then

Filter paired samples where the absolute difference in predicted probabilities satisfies |p1 j p2 j| > δ to construct Dfiltered; /* Step 3: Update and Retrain the Reward Model */ Update Dfiltered Dn filtered + Dn 1 filtered, where Dn filtered is the newly filtered data and Dn 1 filtered is the data from the previous iteration.; Retrain Rθ on the updated Dfiltered using pairwise loss Lpair with learning rate η; /* Step 4: Train the LLM via Reinforcement Learning */ Train LLM πϕ using Rθ as the reward function and update πϕ with modified PPO (Eq. 6);

C EXPERIMENTAL SETUP

SFT training: For each Base Model, we perform instruction fine-tuning using preference data. Similar to the setting by Rafailov et al. (2024), we sample higher quality responses from the preference data based on human annotations to use as training data (for instance, in the HH-RLHF dataset, we sample responses labeled as chosen for instruction fine-tuning). We conduct standard instruction fine-tuning training on the base model using the sampled data. In our experiments, we refer to this as our SFT baseline.

Reward Model training: We perform reward modeling on the SFT baseline using preference data. In our method, we train an initial model with a small amount of human-annotated preference data (in our experiments, this constitutes 15% of the overall dataset size; details on the dataset split for

Published as a conference paper at ICLR 2025

SFT, PPO, etc., can be found in the appendix D.1). The initial model then assigns reward scores to unannotated responses, and based on these reward scores, we filter and obtain new training data for the next iteration of training.

C.1 DATASET STATISTICS

Our experiments explore four different preference datasets as show in Table 2. Stack Overflow contains over 3,000K QA pairs collected from Stack Overflow. Each question receives a score based on the number of upvotes, resulting in a comparison pair. HH-RLHF: we use human preference data, which consists of 118K helpful and 42K harmless instances as the training set. Similar to previous work, we select the last round of dialogues to construct the data into a single-turn dialogue format. Ultra Feedback is constructed by large language models (LLMs). It collects 64K instructions from various sources, generates 256K responses using LLMs such as LLa MA, and has these responses annotated and scored by GPT4. From this process, we create a preference dataset containing 100K entries. TL;DR consists of 179K pairs of summarization and human preference annotations.

Table 2: The statistics of datasets, types of tasks, and types of feedback are presented. We provide a detailed introduction to the datasets in the appendix C.1.

Dataset Num Task Feedback type Response type Stackoverflow 31,284,837 QA human human response HH-RLHF 169,352 QA human LLM response Ultra Feedback 63967 QA GPT4 LLM response Summarize 179000 summarize human LLM&human response

C.2 EVALUATION METRICS AND BASELINE

Reward Modeling. The standard process of RLHF involves training an RM based on preference data to predict the preferences between human and model responses. Subsequently, reinforcement learning methods are used to optimize the language model based on the RM. Accuracy: We use accuracy to measure the performance of reward modeling. Specifically, for a given preference data, if the reward value assigned by the RM to the chosen response is higher than that to the rejected response, the prediction is considered correct. Baseline: considering that our method uses only a portion of the human-annotated data, we choose to use a model trained on the full dataset for reward modeling as a baseline.

PPO. For the RM obtained in the previous step, we apply it to the standard PPO process to optimize the LLM. We use LLM as a judge evaluate the performance of the model after PPO optimization. Specifically, we use GPT-4 as the evaluator to compare different responses to the same prompt. GPT-4 assesses the quality of the responses. We conduct the comparison in two different orders and if the results from these two orders are inconsistent, we consider the results as a tie. Baseline: We compare SER with the SFT baseline to intuitively demonstrate the improvement of our method in aligning model preferences. Additionally, we compare our approach with an RM trained using the full preference dataset, despite the fact that our method uses significantly less data.

C.3 TRAINING DETAILS

SFT training. We use the following hyperparameters for instruction fine-tuning training. We employ a learning rate of 2e-5 with cosine decay, 2 warmup steps, and a batch size of 16. We calculate the loss only for the target tokens rather than the full input sequence, and we train for 3 epochs on the training data. For smaller parameter models (e.g., llama 8B, Mistral 7B, llama 13B), we conduct the training on 8 NVIDIA A100 80G GPUs. For the llama 70B model, we perform the training on 16 NVIDIA A100 80G GPUs.

Reward training. To enable the model to learn the relative ranking among different responses, we use a pair-wise loss. We employ the sigmoid function to normalize the reward scores to a range of 0-1. We utilize the Lo RA method to train the RM on the SFT baseline, with a rank of 8, a Lo RA alpha of 32, and a Lo RA dropout of 0.1. The task type is sequence classification. We use a learning rate of 2e-5 with linear decay and the Adam W optimizer for training over 2 epochs, with a batch size

Published as a conference paper at ICLR 2025

of 4 (batch size of 2 for the LLa MA 70B model). We conduct the training on 8 NVIDIA A100 80G GPUs (32 NVIDIA A100 GPUs for the LLa MA 70B model).

PPO training. For PPO training, we use a learning rate of 1.4e-5 and set the generate sample length to 256. We employ a batch size of 8 and a mini-batch size of 1, with 4 PPO epochs and 1 gradient accumulation step. The target KL divergence is set to 0.1 and initial KL coefficient is set to 0.2. To ensure a more robust training process, we normalize the range of reward values to -1 to 1.

Thresholds. The thresholds τhigh, τlow, and τ were determined through extensive hyper-parameter tuning to balance precision and recall in the self-training process. Specifically, we experimented with the following values:

τhigh {0.55, 0.65, 0.75}

τlow {0.45, 0.35, 0.25}

τ {0.3, 0.4, 0.5}

After evaluating the RM s performance with these parameters, we selected τhigh = 0.55, τlow = 0.45, and τ = 0.3 as they provided the most consistent improvements in the RM s ability to self-label effectively without introducing significant error amplification.

D IMPLEMENTATION DETAILS

D.1 THE SPLIT OF THE DATASET

For the preference dataset, we split the training and testing sets according to the ratio of SFT:RM:PPO = 0.3:0.65:0.05. In this paper, SFT utilizes the chosen responses from the preference data for instruction fine-tuning. For the training of Reward Modeling, our approach randomly samples 15% of the RM data for training, while comprehensive comparison experiments train on the entire RM dataset. For the HH-RLHF dataset, it is divided into harmful and helpful subsets, and we only select the helpful subset.

D.2 STATISTICAL DATA OF THE ITERATIVE PROCESS

We quantify the amount of data filtered out during the iterative process, as shown in Figure 3. Loop 0 represents a fixed value, accounting for 15% of the overall dataset. We use this portion of the data to train the seed model, upon which all subsequent iterations are based for further evolution.

D.3 PPO LEARNING CURVE

0 200 400 600 800 1000 1200 Steps

Scaled Reward

LLama 8B our method Mistral 7B our method

0 200 400 600 800 1000 1200 Steps

Scaled Reward

Full Dataset

Mistral 7B Full Dataset LLama 8B Full Dataset

PPO Training Curve: HH-RLHF

Figure 5: The learning curve of the model on the HH-RLHF dataset, with the y-axis representing the reward score after scaling. The model reaches convergence after 1200 steps. The shaded area indicates the standard deviation.

Published as a conference paper at ICLR 2025

As shown in Figure 5, we present the reward curves of Mistral 7B and LLa MA 8B on the HH-RLHF dataset. Both models reach convergence at around 1200 steps. We scale the reward scores to the range of 1 to 1 using the following formula:

(1 + e SOriginal) 1 tclip (New Max New Min) 1 tclip

+ New Min (18)

In this equation, Soriginal represents the original reward score, tclip denotes the clipping value, New Min is the minimum value after scaling, which is 1, and New Max is the minimum value after scaling, which is 1.

D.4 GPT4 EVALUATION PROMPT

A crucial element of our experimental framework is the evaluation of win rates using GPT-4. In this section, we provide the prompts utilized to generate win rates for both the summarization and dialogue experiments. All experiments were conducted using the gpt-4o-20240806 model. The sequence of responses was randomized for each evaluation to ensure unbiased results.

GPT-4 as judge system prompt:

Review the user s question and the corresponding response using\ the additive 5-pointscoring system described below. Points are\ accumulated based on the satisfaction of each criterion:

- Add 1 point if the response is relevant and provides some\ information related to the user s inquiry, even if it is \ incomplete or contains some irrelevant content.

- Add another point if the response addresses a substantial \ portion of the user s question, but does not completely \ resolve the query or provide a direct answer.

- Award a third point if the response answers the basic \ elements of the user s question in a useful way, regardless \ of whether it seems to have been written by an Al Assistant or if it has elements typically found in blogs or search results.

- Grant a fourth point if the response is clearly written \ from an Al Assistant s perspective, addressing the user s \ question directly and comprehensively, and is well-organized \ and helpful, even if there is slight room for improvement in \ clarity, conciseness or focus.

- Bestow a fifth point for a response that is impeccably \ tailored to the user s question by an AI Assistant, without \ extraneous information, refecting expert knowledge, and \ demonstrating a high-quality, engaging, and insightful answer.

After examining the user s instruction and the response, please first give the total score. Then provide a brief \ explanation of your total score, up to 100 words. Output \ with the following format: Score: <total score> Evaluation evidence: <your brief explanation here>

Remember to assess from the Al Assistant perspective, utilizing web search knowledge as necessary. To evaluate the \ response in alignment with this additive scoring model,

Published as a conference paper at ICLR 2025

we ll systematically attribute points based on the outlined criteria.

User input prompt:

<Question>: {question}

<response>: {answer}

Table 3: A case study for GPT4 evaluation

Input Human: I m trying to teach myself to have nicer handwriting. Can you help? Assistant:

Mistral 7B SFT Sure, I can help! I m a human handwriting expert. GPT4-as-Judge Score: 1 Evaluation evidence: The response is relevant in acknowledging the user s request for help with handwriting, but it provides no substantial information or guidance related to improving handwriting skills. It also inaccurately claims to be from a human handwriting expert, which is misleading coming from an AI Assistant. The answer lacks depth and does not address the user s question in a meaningful way.

Mistral 7B Our method Sure, I can help you to learn to have nice handwriting. To do this, you will need to practice writing and drawing a lot, so that your handwriting will improve. GPT4-as-Judge Score: 2 Evaluation evidence: The response is relevant and provides basic advice related to improving handwriting, such as the importance of practice. However, it lacks depth and detail, offering only a very general suggestion without specific techniques or steps to actually enhance handwriting skills. Thus, it addresses part of the user s question but does not offer a comprehensive or directly useful guide.

Mistral 7B Full dataset Sure, I think I can! First, I need you to write a few sentences in a beautiful, swirly style, and then I can take a look at your handwriting and make some suggestions. GPT4-as-Judge Score: 2 Evaluation evidence: The response is relevant and provides some engagement with the user s request by suggesting an activity. However, it doesn t directly address the user s question or offer actionable advice on improving handwriting. It lacks comprehensive guidance or practical tips, failing to fully answer the question or provide useful information on handwriting improvement techniques.

E COST AND PERFORMANCE OF SER VS. HUMAN LABELING

The cost of using the SER method is significantly lower than employing human labeling, while achieving comparable performance. According to the following calculation method, the cost of the SER method is more than 6X lower than that of using human labeling.

E.1 COST ESTIMATION

Human Labeling Cost. Google Cloud s human annotation service 1 charges approximately $0.11 USD / 50 words for classification tasks at the time of writting. We assume that each classification task

1https://cloud.google.com/ai-platform/data-labeling/pricing

Published as a conference paper at ICLR 2025

1.0 0.5 0.0 0.5 1.0 Score J - Score K

a) llama 8b

loop0 loop1 loop2 loop3 x=0.3 x=-0.3

1.0 0.5 0.0 0.5 1.0 Score J - Score K

b) mistral 7b

loop0 loop1 loop2 loop3 x=0.3 x=-0.3

1.0 0.5 0.0 0.5 1.0 Score J - Score K

c) llama 13b

loop0 loop1 loop2 loop3 x=0.3 x=-0.3

Figure 6: The reward score distribution of the model on HH-RLHF, with the y-axis representing probability density and the x-axis representing pairwise score differences. Compared to other loops, loop 3 significantly increased the score differences between responses of similar quality by altering the error reduction strategy.

only consists of reading a document and two candidate summaries, which have a combined average word length of 304 words. We estimate the human labeling cost per example to be $0.668 USD (304 words *$0.11 / 50 words) (Lee et al., 2023). The detailed calculation can be found in Equation 19.

Human labeling Cost = 304 words 0.11 USD

50 words = 0.668 USD/sample (19)

LLM Labeling Cost. We evaluated the use of GPT-4o in place of human labels for the SER method. The average input length for each annotation was 525 tokens, and the average output length was 104 tokens. For the GPT-4o model 2, the input cost is $0.0025 per 1,000 tokens, and the output cost is $0.01 per 1,000 tokens. Each response was scored three times by GPT-4o to provide a preference pair. Based on these parameters, the labeling cost per example was calculated as follows:

LLM labeling Cost = 6 525 0.0025

= 0.0135, USD/sample (20)

Inference Cost. Based on estimations using GPU pricing from Amazon Cloud 3 ( A machine equipped with 8 A100 GPUs incurs a cost of 32.77 USD per hour ), the average inference cost per sample amounts to 1.338e-4 USD/sample. Based on our testing, we inferred 1,530 samples using a single A100 GPU, which required 3 minutes. The detailed calculation of inference cost can be found in Equation 21. Our estimated SER cost per sample is $0.10054 USD ( here we provide an approximate estimation. Per sample cost = 0.67*0.15 + 3*1.338e-4 USD. In SER, 1 inference is conducted per iteration, resulting in a total of 3 additional inferences).

Inference Cost = 32.77/8 USD/hour 1, 530 examples/3 minutes 20 3-minute slots/hour = 1.338e 4 USD/sample

E.2 PERFORMANCE AND COST COMPARISON

SER utilized only 15% of the human-labeled data, resulting in a significant reduction in data dependency. In training stage, our computational costs are comparable to those incurred when using the full dataset, with additional inference costs being introduced solely during step 1 Self-label with Reward Model. We compared the performance and cost of SER and human labeling, as shown in Table 4. We observe that:

2https://platform.openai.com/docs/pricing 3https://aws.amazon.com/pricing

Published as a conference paper at ICLR 2025

Table 4: Performance and cost comparison in HH-RLHF dataset. In addition to comparing the 15% human-labeled data with SER, we further explored the effectiveness of replacing human-labeled data with LLM-labeled data. Furthermore, we investigated the impact of increasing the proportion of human-labeled data while incorporating the SER method. Our results demonstrate that this approach outperforms the full human-labeled dataset in terms of performance.

Method Cost per Sample(USD) Accuracy(%) Full Human 0.668 70.45 15% Human-Labeled +SER 0.100 68.56 15% LLM-Labeled + SER 0.002 67.64 60% Human-Labeled + SER 0.402 71.83

Cost-Effectiveness of SER. The SER method achieves performance closely aligned with fully human-labeled data while significantly reducing costs. With only 15% human-labeled data and minimal compute costs, the SER method demonstrates that it is a practical and scalable approach to reducing annotation costs.

Trade-Off Between Annotation and Compute Costs. By leveraging accelerators for self-labeling, SER minimizes annotation costs without incurring prohibitive compute costs. At ( K = 4 ) loops, SER achieves a strong balance between performance and cost, suggesting it is a viable alternative to full human labeling for various tasks.

Scalability. Uner lower cost, the (60% human labeled + SER) outperform the Full Human accuracy. The SER method provides a cost-efficient framework for scaling reward models without significantly compromising performance, making it suitable for real-world applications where annotation budgets are constrained.

F BENCHMARK EVALUATION

To investigate in greater detail the impact of SER on reward models and preference alignment, we evaluated both the reward models and the models after PPO across multiple benchmarks. Specifically, for the reward models, evaluations were conducted on Reward Bench (Lambert et al., 2024). For the preference alignment models, assessments were carried out on MT-Bench (Zheng et al., 2023) and Arena-Hard (Li et al., 2024). Overall, we found that the evaluation results on the benchmarks were consistent with those obtained using LLM-as-a-judge, and that SER significantly enhanced model performance under data-limited conditions.

F.1 REWARD MODELING RESULTS IN REWARDBENCH

As shown in Table 5, we present results for the Reward Bench evaluation of models trained on the Ultra Feedback dataset. This dataset offers a comprehensive benchmark, allowing us to evaluate generalization across diverse tasks.

The results demonstrate that SER significantly enhances model performance compared to the baseline Loop 0, achieving results close to those obtained with the full human-annotated dataset. In tasks related to dialogue (Chat and Chat-Hard), which are central to the Ultra Feedback dataset, SER achieves performance nearly identical to models trained on the full dataset.

F.2 PPO MODEL RESULTS IN MT-BENCH AND ARENA-HARD

To further evaluate the downstream performance of models trained with SER, we provide results from MT-Bench and Arena-Hard, comparing SER to both the Full Dataset and SFT baselines, as shown in Table 6 and Table 7.

The results highlight that SER achieves competitive performance relative to models trained on the full dataset and often surpasses SFT-trained models (because Arena-Hard features more challenging test questions, it yields a relatively higher proportion of ties; however, improvements can still be observed). These improvements are particularly notable in dialogue-heavy tasks, further demonstrating the robustness of our approach across multiple tasks and domains.

Published as a conference paper at ICLR 2025

Table 5: The table shows the performance of reward models trained on the Ultrafeedback dataset, evaluated on Reward Bench. Here, Avg. denotes the average score. As demonstrated, under the same amount of human-annotated data, SER significantly outperforms Loop0 and achieves performance comparable to that obtained using the full human-annotated dataset.

Model Method Avg. Chat Chat-Hard Safety Reasoning

Llama 3 8B Loop 0 59.1 70.7 44.1 52.2 69.7 SER 72.3 97.2 58.2 67.8 75.0 Full Data 75.9 95.5 58.5 73.9 65.4

Mistral 7B Loop 0 56.3 55.9 51.3 59.4 63.2 SER 72.0 85.5 57.1 64.5 62.0 Full Data 66.8 93.8 52.4 64.1 60.3

Llama 2 13B Loop 0 56.3 82.7 45.2 66.0 59.0 SER 70.5 92.4 52.7 66.0 71.8 Full Data 74.1 95.5 54.1 68.7 63.7

Table 6: The performance of PPO models trained on the HH-RLHF dataset in MT-Bench.

Model Comparison Win Tie Lose

Llama 3 8B SER vs Full Dataset 96 (30.0%) 110 (34.3%) 114 (35.6%) SER vs SFT 116 (36.25%) 112 (35.0%) 92 (28.8%)

Mistral 7B SER vs Full Dataset 61 (19.0%) 214 (66.9%) 45 (14.1%) SER vs SFT 73 (22.8%) 199 (62.2%) 48 (15.0%)

Table 7: The performance of PPO models trained on the HH-RLHF dataset in Arena-Hard.

Model Comparison Win Tie Lose

Llama 3 8B SER vs Full Dataset 62 (12.4%) 363 (72.6%) 75 (15.0%) SER vs SFT 70 (14.0%) 379 (75.8%) 51 (10.2%)

Mistral 7B SER vs Full Dataset 34 (6.8%) 442 (88.4%) 24 (4.8%) SER vs SFT 41 (8.2%) 435 (87.0%) 24 (4.8%)