# towards_costeffective_reward_guided_text_generation__d853ae4d.pdf

Towards Cost-Effective Reward Guided Text Generation

Ahmad Rashid * 1 2 Ruotian Wu * 1 2 Rongqi Fan 1 Hongliang Li 3 Agustinus Kristiadi 2 Pascal Poupart 1 2

Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training as in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to suboptimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a single call to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods. Code for our work is available at https://github.com/ahmadrash/Fa RMA

1. Introduction

Reinforcement learning from human feedback (RLHF; Stiennon et al., 2020b; Ouyang et al., 2022) is widely applied to align large language models (LLMs) to human preferences. However, updating the LLM with RLHF incurs a significant training cost, whether it is reinforcement learning using

*Equal contribution 1University of Waterloo 2Vector Institute 3Huawei Technologies. Correspondence to: Ahmad Rashid <a9rashid@uwaterloo.ca>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

proximal policy optimization (PPO; Schulman et al., 2017) or finetuning using direct preference optimization (DPO; Rafailov et al., 2023). The training costs can be prohibitive as we scale the LLM, since high-performance computational resources with large GPUs are required. Moreover, the LLM needs to be retrained whenever the preference data changes.

One way to alleviate this computational overhead, while still improving the alignment of the baseline LLM is tokenwise reward-guided text generation (RGTG; Khanov et al., 2024; Deng & Raffel, 2023). RGTG methods keep the baseline LLM frozen and instead train a reward model on the preference data. At each decoding step, the reward model is used to adjust the softmax scores of each candidate token. Reward models are cheaper to train compared to offline RLHF updates (e.g., PPO and DPO) even if both the reward model and LLM have the same number of parameters.1

Moreover, RGTG with a small reward model can perform comparably to RLHF (Rashid et al., 2024).

However, while RGTG is a promising and cost-effective alternative to offline RLHF, it can lead to significant decoding overhead during inference. Typically, at each step, multiple calls are made to the reward model for candidate tokens from the language model. This introduces an inner loop in the decoding process of the LLM, leading to an increase in computational complexity and latency. Another issue is that reward models trained on full-sequences are used to score partial sequences (Khanov et al., 2024; Li et al., 2024) which can be problematic (Rashid et al., 2024).

Several works have attempted to train reward models to score partial sequences. Deng & Raffel (2023) use a squared loss and the preference data to distill a full sequence reward model into a reward model for partial sequences. Controlled decoding (CD; Mudgal et al., 2024) uses roll-outs from the language model instead of the preference data to distill a partial sequence reward model. Rashid et al. (2024) explicitly train a Bradley-Terry model on partial sequences and demonstrate a connection with RLHF. We will show that these methods prefer sub-optimal extensions of partial sequences during decoding.

1To train π, PPO needs to load and make calls to 2 additional models (πref and a critic), DPO needs to load and make calls to one additional model (πref) while training a reward model does not require loading or calling any additional model.

Towards Cost-Effective Reward Guided Text Generation

Language Model

Reward Model

Cheating on an exam

Cheating on an exam requires

Cheating on an exam is

Cheating on an exam can

requires is can ...

0.6 0.5 0.3 ...

Language Model

Reward Model (Ours)

Cheating on an exam

requires is can ...

0.6 0.5 0.3 ...

0.1 0.5 0.3 ...

Cheating on an exam

19.4 sec 2.99 sec

requires is can ...

Top-K Generation Generation

RGTG Generation: is unethical and can get you expelled ... Prompt: Cheating on an exam

Figure 1. Figure depicting a step in RGTG generation for both conventional (left) reward models and ours (right). Note that RGTG steers the LLM generation to helpful and harmless text. We observe on the left that for each candidate that is generated by the LLM, a call needs to be made to the reward model with the candidate appended. On the other hand, our reward model is fed just the prompt and it generates scores for all candidates in the vocabulary. On the TLDR dataset, we observe an average generation time of 19.4 seconds for current RGTG methods and 2.99 seconds for our method.

To alleviate these common issues associated with RGTG methods, we propose a cost-effective reward model architecture for RGTG, which can score all possible next token extensions of a partial sequence with a single call. Furthermore, we train the reward model using a novel loss function that, we show, scores prefixes that can be extended to optimal full sequences, at least as high as any other prefix. On extensive benchmarks we demonstrate that our reward model leads to a better cost-performance trade-off and higher diversity. Figure 1 illustrates the shortcomings of current reward models, i.e., high decoding cost, and our proposed solution.

In summary:

We analyze contemporary reward models and demonstrate that during RGTG they choose sub-optimal extensions of partial sequences.

We present a reward model architecture that, at each decoding step, can provide rewards of all tokens in the vocabulary at once.

We explicitly train our reward model to choose a token with the maximum reward at each step.

We report extensive experiments with recent LLMs on various text generation tasks, demonstrating faster inference and strong alignment performance.

2. Preliminaries

We denote a prompt by x and its response by y where the bolded letters indicate sequences of tokens. The i-th token in x is denoted by xi, while the partial sequence starting at token i and ending at token j is denoted by xi:j. The length of a sequence x is denoted by |x|. Large language models (LLMs) generally consist of probabilistic models that can generate a response y given a prompt x. More specifically, the generation of y is done token-by-token by sampling the next token from a conditional distribution π(yi|x, y1:i 1).

2.1. Reward Models

Reward models are trained to evaluate the quality of a response y to the prompt x by outputting a scalar-valued score. Given a preference dataset D = {(xk, ywk, ylk)}K k=1 containing K triples of token sequences (x, yw, yl) where yw

represents the winning (i.e., preferred) sequence and yl

represents the losing sequence. The Bradley-Terry (BT) loss (Bradley & Terry, 1952) that encourages the model to assign a higher score to the winning response and a lower score to the losing response is used as the training objective:

LR = Ex,yw,yl D log σ(rϕ(yw|x) rϕ(yl|x))

where σ is the logistic function and rϕ is the reward model.

Reinforcement learning from human feedback (RLHF; Ziegler et al., 2019; Ouyang et al., 2022) uses scores from the reward model to update the language model using reinforcement learning (RL) techniques such as proximal policy optimization (PPO). Rafailov et al. (2023) derive an equiva-

Towards Cost-Effective Reward Guided Text Generation

lent objective that can be learned using supervised learning without using a reward model.

2.2. Reward-Guided Text Generation

Recently, Khanov et al. (2024) proposed a reward-guided text generation (RGTG) technique that does not require an update of the LLM. Instead, the base LLM πref is frozen and, during decoding, the logits from the LLM are combined with the reward scores to guide the text generation.

Let Vθ(y1:i|x) be a value function with parameter θ that scores partial sequences y1:i such that Vθ(y|x) = rϕ(y|x) for full sequences y. During decoding, the adjusted score of token yi is a weighted combination of logits of πref and the value of the partial sequence is as follows:

score(yi|x, y1:i 1) = log πref(yi|x, y1:i 1)

+ βVθ(y1:i|x), (1)

where β > 0 is a hyper-parameter.

Given the scores, the next token can be selected greedily or by sampling (e.g., nucleus or top-k sampling (Fan et al., 2018; Holtzman et al., 2020) from the softmax distribution of the scores).

In (1), while V corresponds to r for full sequences, further considerations are needed to define V for partial sequences. Rashid et al. (2024) showed that using full-sequence reward models to score partial sequences can lead to arbitrary rewards during RGTG.

3. Current Limitations of RGTG

We will discuss two primary limitations of current RGTG methods, namely high decoding cost and sub-optimal rewards. In the next section, we propose a solution to address these limitations.

3.1. High Decoding Cost

Most RGTG methods default to training a full sequence reward model rϕ and then either a) use it to directly score partial sequences (Khanov et al., 2024) or b) distill a partial sequence value model Vθ from the full sequence reward model rϕ (Mudgal et al., 2024). During decoding, the score for each candidate token yi is calculated according to Equation 1. We note that the input to Vθ includes the sequence y1:i 1 with each candidate token yi appended to the sequence. Hence to score each candidate token, we need to make a different call to the value function, resulting in k calls for top-k decoding. This adds substantial overhead during decoding.

3.2. Sub-Optimal Reward Models

Next, we take a look at contemporary RGTG reward models and show that they may prefer partial sequences with suboptimal extensions.

PARGS Rashid et al. (2024) showed that using a BT value model trained on full sequences to score partial sequences (as done by Khanov et al. (2024)) can lead to arbitrary values for partial sequences. They proposed to train a BT value model explicitly on partial sequences by creating a separate loss function for all prefix lengths i:

(x,yw,yl) D log σ(Vθ(yw 1:i|x) Vθ(yl 1:i|x)).

However, given that the full sequence yw is preferred to the full sequence yl, training is based on the assumption that the partial sequence yw 1:i is also preferred to the partial sequence yl 1:i. This assumption can be problematic, as the full-sequence dataset typically includes only one or a few full sequences that extend each partial sequence. In fact, the empirical distribution of such extensions will affect the learned value function to the extent that a prefix with only extensions to suboptimal full sequences may be scored higher than a prefix with an extension to an optimal full sequence.

Theorem 1. In the limit of infinite training and a sufficiently expressive representation for the value function, PARGS may learn a value function that gives a lower score to a prefix extendable to an optimal full sequence than some other prefix. More precisely, if y = argmaxy r(y|x), then there may exist i, j, y such that

V (y 1:i|x) < V (y 1:j|x)

Proof. Let y , y , y and y be four responses to x such that y is an optimal response and y , y , y

are three suboptimal responses. Suppose also that the preference dataset contains exactly three comparisons: D = {(x, y , y ), (x, y , y ), (x, y , y )} where the first response is preferred to the second response in each triple. Suppose also that y and y share the first i 1 tokens (i.e., y 1:i 1 = y 1:i 1) while y , y and y share the first i tokens (i.e., y 1:i = y 1:i = y 1:i). In the limit of infinite training and sufficiently expressive value function representation, Lemma 2 in (Rashid et al., 2024) indicates that the learned value function V satisfies

σ(V (y1 1:i|x) V (y2 1:j|x)) = PD([x, y1] [x, y2]) (2)

where [a, b] indicates the concatenation of sequences a and b, and a b indicates that a is preferred to b. Equation (2) implies that the BT model induced by V exhibits

Towards Cost-Effective Reward Guided Text Generation

the same preference probabilities for the full sequence extension of y1 1:i and y2 1:j as the empirical distribution of the preference dataset. Recall, that PARGS assumes that [x, y1 1:i] [x, y2 1:j] when their respective full sequence extensions exhibit the same preference ordering (i.e., [x, y1] [x, y2]). Since there might be different extensions y1 i+1:|y1| and y2 i+1:|y2| for each prefix with different preference labels in the preference dataset, then PARGS learns a value function that induces preference probabilities for partial sequences that are consistent with the empirical distribution PD of the preference dataset for the full sequence extensions of those partial sequences. Applying Equation (2) to prefixes y 1:i and y 1:i yields:

σ(V (y 1:i|x) V (y 1:i|x)) = 1/3 (3)

since the dataset D contains one preference ranking (x, y , y ) where the full sequence extension y of y 1:i is preferred to the full sequence extension y of y 1:i and two preference rankings (x, y y ), (x, y y ) where the full sequence extension y of y 1:i is preferred to the full sequence extensions y , y of y 1:i. Recall that y 1:i = y 1:i = y 1:i and therefore y and y are full sequence extensions of y 1:i. Finally, since the sigmoid in Equation (3) is less than 0.5, then V (y 1:i|x) < V (y 1:i|x). Hence, this shows that i=j, y such that V (y 1:i|x) < V (y 1:j|x)

Theorem 1 shows that the value function learned by PARGS may prefer prefixes that lead to suboptimal responses. The key problem is PARGS assumption that the preference ordering of prefixes is the same as the preference ordering of full sequence extensions. Since it is possible to extend a prefix to many different full sequences with different scores, the value function learned by PARGS depends on the frequency of different prefix extensions instead of preferences only. As shown in the proof of Theorem 1, this becomes problematic when a prefix that can lead to an optimal response is extended more frequently to losing full sequences instead of winning full sequences in D.

CD Mudgal et al. (2024) proposed a target value function V , for partial sequences, that corresponds to the expected reward of the full sequences when the partial sequence is extended by following the base model distribution πref.

V (x, y1:i) = X

yi+1:|y| πref(yi+1:|y||x, y1:i)r(x, y) (4)

The training loss is the squared difference between the value function Vθ and the target V . They use rollouts from the base model along with a reward model trained on full sequences to distill the value function Vθ. They sample extensions from πref to complete a partial sequence and compute the full-sequence score as the target V . This method has a limitation where the value function heavily depends on

the language model. We will show such dependency is suboptimal in Theorem 2.

Value Augmented Sampling (VAS; Han et al., 2024) is similar to CD and uses πref to generate samples for learning a value function and a full-sequence reward model for generating the target score. However, the value function is trained by temporal difference (TD) learning.

Theorem 2. In the limit of infinite training and a sufficiently expressive representation for the value function, CD may learn a value function that gives a lower score to a prefix extendable to an optimal full sequence than some other prefix. More precisely, if y = argmaxy r(y|x), then there may exist i, j, y such that

V (y 1:i|x) < V (y 1:j|x)

Proof. Let y be an optimal response to x such that r(y |x) = 6. Let y and y be two suboptimal responses to x such that r(y |x) = 4 and r(y |x) = 6. Suppose that y and y share the same first i 1 tokens (i.e., y 1:i 1 = y 1:i 1) and that y and y share the same first i tokens (i.e., y 1:i = y 1:i). After generating y 1:i, suppose that πref generates only y i+1:|y | and y i+1:|y | with uniform probability (i.e., πref(y i+1:|y ||x, y 1:i) = πref(y i+1:|y |x, y 1:i) = 0.5 and any other continuation has probability 0). After generating y 1:i, suppose also that πref generates only y i+1:|y | (i.e., πref(y i+1:|y |x, y 1:i) = 1 and any other continuation has probability 0). Then with infinite training and a sufficiently expressive value function representation, from (4), CD learns the following partial sequence values

V (y 1:i|x) = πref(y i+1:|y ||x, y 1:i)r(y |x)

+ πref(y i+1:|y ||x, y 1:i)r(y |x)

= 0.5(6) + 0.5( 6) = 0

V (y 1:i|x) = πref(y i+1:|y ||x, y 1:i)r(y |x)

This example shows that i=j, y such that V (y 1:i|x) < V (y 1:j|x).

Theorem 2 shows that CD may prefer prefixes that cannot be extended to optimal sequences depending on πref. The key problem is the dependency of the target V on πref. When πref extends a prefix to bad responses, the value of this prefix is low, but if it extends the prefix to good responses, the value of this prefix is high. In principle, the value function V should be independent of πref. In RLHF, πref is the quantity that we seek to improve, so it can introduce a bias to train a value function that depends on πref. The value function should depend only on the preferences induced by the full sequence reward model. As shown in the proof of Theorem 2, CD may not prefer a prefix that can lead to an

Towards Cost-Effective Reward Guided Text Generation

optimal response when it is extended by πref to suboptimal responses.

4. Fa RMA: Cost-Effective RGTG

We propose to mitigate the inference overhead and suboptimal rewards of previous RGTG methods by introducing (i) an efficient reward model and (ii) a novel loss function that will ensure that the resulting value function prefers prefixes extendable to optimal responses. We name our method Fa RMA, i.e. Faster Reward Model for Alignment.

4.1. An Efficient Reward Model

We design a reward model architecture so that instead of obtaining a single score for a sequence, we obtain the score for all possible next tokens in the dictionary. We modify (1) such that:

score(yi|x, y1:i 1) = log πref(yi|x, y1:i 1)

+ βVθ(yi|x, y1:i 1),

where Vθ(.) R|D| 1 and |D| is the size of the vocabulary. In order to get the score of sequence x, y1:i we feed the x, y1:i 1 into rϕ and get the score of the sequence with all possible extensions of yi in the dictionary. The efficiency and performance of the reward model is not dependent on k, for top-k generation, as we simultaneously get the score for all possible next tokens in the dictionary. We use the same architecture as a causal language model, however, we use a novel training loss which we discuss next.

4.2. A Principled Constraint

Given the sub-optimality of the existing methods, there needs to be a more principled way to score partial sequences. We propose to score partial sequences based on their optimal extension. Given a partial sequence y1:i, we consider all possible full extensions and assign the score of the highest completion to y1:i. Na ıvely, this would require an exponential search in terms of the size of the vocabulary which is intractable. To make this principled goal feasible, we propose a local constraint that the partial-sequence reward model needs to satisfy so that it will return the reward of the corresponding optimal expansion:

Vθ(y1:i|x) = max yi+1 Vθ(y1:i+1|x) (5)

If the above local constraint is satisfied, then we can keep expanding the sequence as in the generation:

Vθ(y1:i|x) = max yi+1 Vθ(y1:i+1|x)

= max yi+1 max yi+2 Vθ(y1:i+2|x)

= = max yi+1:n Vθ(y1:n|x),

where yi+1:n is the optimal extension beyond y1:i and yn is the EOS token. That is, instead of doing an exponential search, we could train the value function to satisfy (5), which can be done by Temporal Difference (TD) learning. Note that VAS also uses TD learning in their algorithm, but, since they use a conventional reward model they do not do a max over the dictionary.

To be more precise, the training process can be separated into two steps with distinct objectives:

1. Standard BT loss on full sequence preference dataset:

L(a) = E x,yw,yl D log σ(Vθ(yw|x) Vθ(yl|x))

2. Constraint to ensure optimal partial sequence expansion.

Vθ(y1:i|x) max yi+1 Vθ(y1:i+1|x) 2 (7)

Firstly, we want to point out the similarity of our constraint (7) to TD control where V (y|x) can be treated as a stateaction value function (i.e., Q-function) with yi corresponding to the action and [x, y1:i 1] corresponding to the state. Note also that transitions are deterministic in LLMs since the action yi updates the state to [x, y1:i] deterministically. We use s to denote a state and a to denote an action in Bellman s equation:

Q (s, a) = E[r|s, a] + γ X

s P(s |s, a) max a Q (s , a )

Q ([x, y1:i 1], yi) = max yi+1 Q ([x, y1:i], yi+1) (8)

Vθ(y1:i|x) = max yi+1 Vθ(y1:i+1|x)

Note that Equation (8) follows from the fact that there is no discount factor and no reward until the end of the sequence. Then L(b) is the same loss as in Q-gradient learning by treating max yi+1 Vθ(y1:i+1|x) as the target.

To train the value function, we alternate between the two losses mentioned previously. For the Bradley-Terry loss (6), we utilize full-sequence preference pairs as commonly done when training a reward model. Furthermore, for the new constraint loss (7), we extract partial sequences from the winning sequences in the preference dataset and use them as training data. Notably, there is no preference signal when training with the constraint loss. The model simply learns to align its scores to the best next token. We train the model by alternating between the two losses. The training details are presented in Appendix B.

We emphasize that this kind of training would not be possible with the reward models of previous RGTG methods that

Towards Cost-Effective Reward Guided Text Generation

Algorithm 1 Our Training Algorithm. Input: Base LLM to initialize the reward model Vθ, Full Sequence Preference dataset DBT = {(xk, ywk, ylk)}KBT k=1 , number of alternating iterations itern, mini-batch size n, partial sequence dataset Dmax = {(xk, yk}Kmax k=1 Output: Vθ

1: for i = 1 to itern do 2: Sample minibatch D(i) BT from DBT of size n 3: for every tuple (x, yw, yl) D(i) BT do 4: Compute Vθ(yw|x) and Vθ(yl|x) 5: La = log σ(Vθ(yw|x) Vθ(yl|x)) 6: Update Vθ based on loss La 7: end for 8: Sample minibatch D(i) max from Dmax of size n 9: for every tuple (x, y) D(i) max do 10: Compute Vθ(y|x) 11: Vmax = maxy|y|+1 Vθ(y, y|y|+1|x) 12: Lb = 1

2 [Vθ(y|x) Vmax]2

13: Update Vθ based on loss Lb 14: end for 15: end for

Algorithm 2 Our Decoding Algorithm. Input: Reward model Vθ, Prompt x, top-k parameter k, hyperparameter β > 0, any reference/SFT model πref, generation length l Output: y1:l: A generated response to x of length l

1: for i = 1 to l do 2: log π(yi = v|x, y1:i 1) 3: log (πref(v|x, y1:i 1) + βVθ(v|x, y1:i 1)) 4: yi softmax(top k(log π(yi|x, y1:i 1))) 5: end for

require |D| forward passes to calculate the max over all the tokens in the dictionary D. Instead we calculate the max after a single forward pass. The complete algorithm for our method is presented in Algorithms 1 and 2.

We now prove that unlike PARGS and CD, our algorithm, Fa RMA, is guaranteed to prefer prefixes that are extendable to optimal full sequences.

Theorem 3. In the limit of infinite training data and a sufficiently expressive representation for the value function, Fa RMA guarantees that the learned value function scores prefixes that can be extended to optimal full sequences at least as high as any other prefix. More precisely, if y = argmaxy r(y|x), then

V (y 1:i|x) V (y 1:j|x) i, j, y (9)

Proof. We provide a proof by contradiction. Let y be an optimal response to x and y be any other response. Suppose that

i, j, y such that V (y 1:j|x) > V (y 1:i|x) (10)

Since the loss in (7) ensures that the learned value function returns the reward of the best full sequence that extends a prefix then V (y 1:i|x) = r(y |x). Similarly, since y 1:j is any other prefix whose extensions do not lead to better full sequences, then V (y 1:j|x) r(y |x). This means that V (y 1:j|x) V (y 1:i|x), which contradicts (10).

5. Related Work

Training based Alignment Supervised fine-tuning and instruction tuning (Wei et al., 2021) are common methods to align an LLM to labeled data. RLHF (Christiano et al., 2017; Ziegler et al., 2019; Lee et al., 2021; Nakano et al., 2021; Snell et al., 2022) methods can align an LLM directly to human preferences. First, a reward model is trained on a dataset of human preferences using the Bradley Terry model (Bradley & Terry, 1952) and then the LLM is updated, based on the reward model, using an RL algorithm such as PPO (Schulman et al., 2017). However, updating the LLM with RL is expensive and researchers have explored cost-effective alternatives.

Liu et al. (2023a) convert the preference data into sequences of sentences which are then used to fine-tune the LLM. Dong et al. (2023) used the reward model to filter high quality training samples and fine-tunes on them avoiding undesirable behavior. DPO (Rafailov et al., 2023; 2024) avoids learning a reward model explicitly and finds an equivalent objective to RLHF which can be optimized by supervised learning. Even though the resulting optimization is cheaper than RL, nonetheless, it still involves updating the LLM.

Preference data itself provides sequence-level supervision. Some works have atttempted to collect and use fine-grained preferences by using either human annotators (Wu et al., 2023) or LLMs (Cao et al., 2024).

Guided Decoding In the guided decoding literature, a number of methods consider guidance at a step or process level (Welleck et al., 2022; Uesato et al., 2022; Lightman et al., 2023; Krishna et al., 2022; Li et al., 2023; Khalifa et al., 2023; Yao et al., 2023).

Some methods have applied token-level functions (Dathathri et al., 2019; Krause et al., 2021; Yang & Klein, 2021; Chaffin et al., 2022; Liu et al., 2023b) but they do not consider RGTG based on preference data.

Khanov et al. (2024) introduce an RGTG method, but rely on a full-sequence reward model for partial sequence decoding. Deng & Raffel (2023) learn to distill a partial sequence reward model, starting from the full-sequence model using a square loss function. Mudgal et al. (2024) employ a similar approach, but instead of using preference data, generate a dataset by roll-outs from the base LLMs. Han et al. (2024) also use the base LLM to gather a dataset, but employ TD

Towards Cost-Effective Reward Guided Text Generation

learning to train the partial sequence reward model. Different from these works, Zhao et al. (2024) derive an RGTG method based on sequential Monte Carlo and demonstrate that it can approximate RLHF.

Fine-grain value functions Previous work has used stepwise value or Q functions to train generative adversarial networks for dialogue generations (Tuan & Lee, 2019; Li et al., 2017; Tuan et al., 2018). They employ either the policy gradient method (Tuan & Lee, 2019; Li et al., 2017) or PPO (Tuan et al., 2018) to train the generator, and train the discriminator to provide rewards. To mitigate the problem of sparse rewards, they employ methods of training step-wise Q-functions. Whereas the aforementioned works explicitly apply RL techniques to train text generators, RGTG methods avoid the use of off-line RL and instead employ reward-guided decoding.

6. Experiments

We evaluate our proposed approach on three language generation tasks: summarization, dialogue generation and finegrained text generation.

Our baselines include πref using top-k sampling (Fan et al., 2018), RLHF models based on PPO (Schulman et al., 2017) and DPO (Rafailov et al., 2023), RGTG methods ARGS (Khanov et al., 2024), CD (Mudgal et al., 2024), PARGS (Rashid et al., 2024) and CARDS (Li et al., 2024). CARDS demonstrated a higher reward and lower inference cost compared to Best-of-N so we did not evaluate Bestof-N. Note that we use the average-reward obtained by the DPO baseline as the reward threshold for CARDS. Setting a higher threshold could lead to better rewards at the cost of significantly longer decoding times (see Appendix 12).

Summarization task We pick the Reddit TL;DR (V olske et al., 2017) as the dataset for the summarization task. Each sample consists of the prompt x which is a post on the Reddit forum and the labels y, the summary of the post. We use the human preference dataset from Stiennon et al. (2020a) to perform all the training and decoding. Our base summarization model is the Llama3.2-1B-Instruct model2. Our reward model is also initialized from the same LLM.

Dialogue task Next we evaluate our method on a dialogue task using the Anthropic Helpful and Harmless (HH) (Bai et al., 2022) dataset, which helps to align the LLM to generate helpful and harmless responses. Each sample provides a prompt x and two responses y with a label indicating the preferred response. Here, the prompt x is the history of the dialogue and y is the response from the assistant. We

2meta-llama/Llama-3.2-1B-Instruct

use a pretrained SFT Pythia-2.8B base model3 and trained a full-sequence reward model based on it. We also present results on smaller reward models down to 400 million.

Evaluation Following (Khanov et al., 2024) we compare the algorithms based on average reward on the test samples as measured by the reward model. A higher reward indicates better alignment with human preferences. Note that we use a different full-sequence reward model and not the Fa RMA reward model (that we trained for our algorithm) to evaluate the models. Moreover, evaluating language generation is nuanced, and human evaluation is generally preferred, but is time consuming. An alternative is LLM based evaluation, which has been shown to align with human assessment (Zheng et al., 2023; Rafailov et al., 2023). We adopt GPT-4 based evaluation as a proxy for human evaluation. Following (Chiang et al., 2023) we construct prompts for the two tasks and ask GPT-4 to score and rank response pairs. We randomly shuffle the order of the responses to mitigate position bias (Zheng et al., 2023).

We also evaluate the diversity of generation on the TL;DR and HH-RLHF datasets. To evaluate generation diversity, we generate 10 responses for each prompt, and measure the Rouge-L score between each generated pair. A lower Rouge-L score indicates a higher diversity.

Training details, including hyper-parameters are presented in Appendix B.

Fine-Grained Text Generation We have additional results on text generation on the Ultra-Feedback (UF) dataset (Ganqu Cui et al., 2024) in Appendix A. We use a pretrained SFT Zephyr-7B base model4 and trained a full-sequence reward model based on it.

6.2. Results

Table 1 shows the average number of calls made to the LLM and the reward model (RM) by RGTG methods to generate a single response. Fa RMA is clearly the best as it makes the least number of calls compared to all baselines. Note that CARDS makes fewer calls to the reward model, since the RM is not called for each token, but makes > 4 more calls to the language models compared to Fa RMA.

Table 2 shows the average reward measured by the fullsequence reward model for the summarization task. Fa RMA achieves the best average reward and uses significantly less time compared to all the other RGTG techniques. Moreover, we achieve a higher average reward compared to CARDS which incurs some overhead due to more calls to the LLM. Fa RMA is also competitive with DPO and PPO based RLHF that is expensive to fine-tune.

3lomahony/eleuther-pythia2.8b-hh-sft 4alignment-handbook/zephyr-7b-sft-full

Towards Cost-Effective Reward Guided Text Generation

Data Method LLM Calls RM Calls Total Calls

ARGS 59.69 596.90 656.69 PARGS 58.39 583.90 642.29 CD 60.26 602.60 662.86 Fa RMA 53.27 53.27 106.54 CARDS 305.31 32.80 338.11

ARGS 71.85 718.50 790.35 PARGS 76.86 768.60 845.46 CD 63.48 634.80 698.28 Fa RMA 90.08 90.08 180.16 CARDS 395.94 42.25 438.19

Table 1. Avg. Number of Model calls made by RGTG methods when responding to a query. Fa RMA makes the fewest calls.

TL;DR Summarization

Method LLM r SE Time(min)

πref frozen 0.98 0.18 2

ARGS frozen 1.46 0.16 32 PARGS frozen 1.56 0.19 31 CD frozen 1.15 0.16 29 Fa RMA frozen 2.05 0.15 5 CARDS frozen 1.73 0.16 17

DPO trained 2.08 0.18 2 PPO trained 2.05 0.14 2

Table 2. Avg. reward (over 100 samples) standard error and total generation time for the TL;DR summarization task.

Similarly, Table 3 shows the average reward of the dialogue task. We observe that Fa RMA performs the best in terms of both average reward and inference time among all the RGTG methods, and is competitive with DPO, PPO and CARDS. Note that we also trained smaller reward models down to 400 million. The result demonstrates that we can further reduce the cost of both training and inference by reducing the size of the reward model while improving over πref.

HH Dialogue

Method LLM r SE Time(min)

πref frozen 1.18 0.12 2

ARGS - 2.8b frozen 1.41 0.18 26 PARGS - 2.8b frozen 1.63 0.17 31 CD - 2.8b frozen 1.24 0.13 27 Fa RMA - 2.8b frozen 1.80 0.18 5 Fa RMA - 1b frozen 1.56 0.18 3 Fa RMA - 400m frozen 1.49 0.12 2 CARDS frozen 1.92 0.19 20

DPO trained 1.73 0.17 2 PPO trained 1.92 0.22 2

Table 3. Avg. reward (over 50 samples) standard error and total generation time for the HH dialogue task.

Table 4 shows shows the average Rouge-L of different gen-

erations from the same prompt. A lower score demonstrates better diversity and we observe that Fa RMA generates the most diverse responses compared to πref, DPO, PPO and CARDS.

Method ROUGE-L

TL;DR Summarization

πref 0.20 0.01 DPO 0.21 0.01 PPO 0.20 0.01 CARDS 0.49 0.07 CD 0.24 0.01 PARGS 0.22 0.02 Fa RMA 0.21 0.02

Method ROUGE-L

HH Dialogue

πref 0.29 0.01 DPO 0.34 0.02 PPO 0.42 0.02 CARDS 0.86 0.01 CD 0.33 0.01 PARGS 0.33 0.01 Fa RMA 0.24 0.01

Table 4. Diversity score based on ROUGE-L

Next we plot the GPT-4 winning rate of baselines versus Fa RMA, against the inference time. Figure 2 shows the results on both the TLDR dataset and Anthropic HH. The best methods should be in the top left quadrant demonstrating both faster inference and higher win-rates. We observe that Fa RMA has a competitive winning rate at a much faster inference speed compared to RGTG methods. DPO and PPO have favorable performance but they are more expensive to train (Appendix C). Both plots demonstrate a similar trend. The prompts used to probe GPT-4, for the two datasets are presented in Appendix F.

0 10 20 30 40 50 60

Inference Time (mins)

Win Rate (%)

TLDR Summarization

Fa RMA ARGS PARGS CD DPO PPO CARDS

0 10 20 30 40 50 60

Inference Time (mins)

Win Rate (%)

HH Dialogue

Fa RMA ARGS PARGS CD DPO PPO CARDS

Figure 2. GPT4 evaluation on the TLDR and HH datasets respectively plotting the winrates of different baselines versus Fa RMA against the inference time.

Towards Cost-Effective Reward Guided Text Generation

7. Conclusion

We have discussed the current limitations of RGTG, particularly, reward models that are not suitable for tokenwise generation. Current reward models incur a significant decoding cost which makes RGTG less viable. Moreover we showed that they may prefer prefixes that lead to sub-optimal completions. We introduced Fa RMA, a cost effective reward model that leads to faster inference and is trained with a more principled constraint leading to better outcomes.

Impact Statement

The goal of this paper is to push the frontiers of Machine Learning. This may lead to impacts on the society, however, we do not feel the need to highlight them here.

Acknowledgment

Resources used in this work were provided by Huawei Canada, the Province of Ontario, the Government of Canada through CIFAR, companies sponsoring the Vector Institute https://vectorinstitute.ai/partners/, the Natural Sciences and Engineering Council of Canada and a grant from IITP & MSIT of Korea (No. RS-2024-00457882, AI Research Hub Project). AR thanks Apple for support through the Waterloo Apple Ph D Fellowship, Natural Sciences and Engineering Council of Canada for its support through the CGS-D program, and the David R. Cheriton Graduate Scholarship.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862, 2022.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika, 39(3/4), 1952.

Cao, M., Shu, L., Yu, L., Zhu, Y., Wichers, N., Liu, Y., and Meng, L. Enhancing reinforcement learning with dense rewards from language model critic. In EMNLP, 2024.

Chaffin, A., Claveau, V., and Kijak, E. PPL-MCTS: Constrained textual generation through discriminator-guided MCTS decoding. In NAACL, 2022.

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., and Xing, E. P. Vicuna: An open-source chatbot impressing GPT-4 with 90%* Chat GPT qual-

ity, March 2023. URL https://lmsys.org/blog/ 2023-03-30-vicuna/.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Neur IPS, 2017.

Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. Plug and play language models: A simple approach to controlled text generation. In ICLR, 2019.

Deng, H. and Raffel, C. Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. In EMNLP, 2023.

Dong, H., Xiong, W., Goyal, D., Zhang, Y., Chow, W., Pan, R., Diao, S., Zhang, J., Ka Shun, S., and Zhang, T. RAFT: Reward ranked finetuning for generative foundation model alignment. TMLR, 2023.

Fan, A., Lewis, M., and Dauphin, Y. Hierarchical neural story generation. In ACL, 2018.

Ganqu Cui, L. Y., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., Liu, Z., and Sun, M. Ultrafeedback: Boosting language models with scaled ai feedback. In ICML, 2024.

Han, S., Shenfeld, I., Srivastava, A., Kim, Y., and Agrawal, P. Value augmented sampling for language model alignment and personalization. ar Xiv preprint ar Xiv:2405.06639, 2024.

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. In ICLR, 2020.

Khalifa, M., Logeswaran, L., Lee, M., Lee, H., and Wang, L. Grace: Discriminator-guided chain-of-thought reasoning. In EMNLP, 2023.

Khanov, M., Burapacheep, J., and Li, Y. Alignment as reward-guided search. In ICLR, 2024.

Krause, B., Gotmare, A. D., Mc Cann, B., Keskar, N. S., Joty, S., Socher, R., and Rajani, N. F. Ge Di: Generative discriminator guided sequence generation. In EMNLP, 2021.

Krishna, K., Chang, Y., Wieting, J., and Iyyer, M. Rankgen: Improving text generation with large ranking models. In EMNLP, 2022.

Lee, K., Smith, L. M., and Abbeel, P. PEBBLE: Feedbackefficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In ICML, 2021.

Towards Cost-Effective Reward Guided Text Generation

Li, B., Wang, Y., Grama, A., and Zhang, R. Cascade reward sampling for efficient decoding-time alignment. ar Xiv preprint ar Xiv:2406.16306, 2024.

Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., and Jurafsky, D. Adversarial learning for neural dialogue generation. In EMNLP, pp. 2157 2169, 2017.

Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.-G., and Chen, W. Making language models better reasoners with step-aware verifier. In ACL, 2023.

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let s verify step by step. In ICLR, 2023.

Liu, H., Sferrazza, C., and Abbeel, P. Chain of hindsight aligns language models with feedback. In ICLR, 2023a.

Liu, R., Rashid, A., Kobyzev, I., Rezagholizadeh, M., and Poupart, P. Attribute controlled dialogue prompting. In ACL, 2023b.

Mudgal, S., Lee, J., Ganapathy, H., Li, Y., Wang, T., Huang, Y., Chen, Z., Cheng, H.-T., Collins, M., Strohman, T., et al. Controlled decoding from language models. In ICML, 2024.

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. Webgpt: Browser-assisted question-answering with human feedback. ar Xiv preprint ar Xiv:2112.09332, 2021.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. In Neur IPS, 2022.

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. In Neur IPS, 2023.

Rafailov, R., Hejna, J., Park, R., and Finn, C. From r to Q : Your language model is secretly a Q-function. COLM, 2024.

Rashid, A., Wu, R., Grosse, J., Kristiadi, A., and Poupart, P. A critical look at tokenwise reward-guided text generation. ar Xiv preprint ar Xiv:2406.07780, 2024.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Snell, C., Kostrikov, I., Su, Y., Yang, M., and Levine, S. Offline RL for natural language generation with implicit language Q learning. ar Xiv preprint ar Xiv:2206.11871, 2022.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Neur IPS, 2020a.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. In Neur IPS, 2020b.

Tuan, Y.-L. and Lee, H.-Y. Improving conditional sequence generative adversarial networks by stepwise evaluation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4):788 798, 2019.

Tuan, Y.-L., Zhang, J., Li, Y., and Lee, H.-y. Proximal policy optimization and its dynamic version for sequence generation. ar Xiv preprint ar Xiv:1808.07982, 2018.

Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L., Creswell, A., Irving, G., and Higgins, I. Solving math word problems with process-and outcome-based feedback. ar Xiv preprint ar Xiv:2211.14275, 2022.

V olske, M., Potthast, M., Syed, S., and Stein, B. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pp. 59 63, 2017.

Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners. In ICLR, 2021.

Welleck, S., Liu, J., Lu, X., Hajishirzi, H., and Choi, Y. Naturalprover: Grounded mathematical proof generation with language models. In Neur IPS, 2022.

Wu, Z., Hu, Y., Shi, W., Dziri, N., Suhr, A., Ammanabrolu, P., Smith, N. A., Ostendorf, M., and Hajishirzi, H. Finegrained human feedback gives better rewards for language model training. Neur IPS, 2023.

Yang, K. and Klein, D. Fudge: Controlled text generation with future discriminators. In NAACL, 2021.

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. In Neur IPS, 2023.

Zhao, S., Brekelmans, R., Makhzani, A., and Grosse, R. Probabilistic inference in language models via twisted sequential Monte Carlo. In ICML, 2024.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Neur IPS, 2023.

Towards Cost-Effective Reward Guided Text Generation

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593, 2019.

Towards Cost-Effective Reward Guided Text Generation

A. Ultra-Feedback Evaluation

For the fine-grained text generation task, we compared our method to πref, ARGS and DPO and our methods achieves the best reward score with limited inference time. We display the result in Table Table 5.

Ultra Feedback

Method LLM r SE Time(min)

πref frozen -1.62 0.28 5

ARGS frozen -1.35 0.31 48 PARGS frozen -1.01 0.26 50 Fa RMA frozen -1.20 0.21 11

DPO trained -1.22 0.29 5

Table 5. Avg. reward (over 50 samples) standard error and total generation time for the Ultra Feedback text-generation task.

B. Training Details

Software and hardware All experiments are run on a server with NVIDIA A40 GPUs (40GB VRAM) and NVIDIA A100 GPUs (80GB VRAM). We use CUDA Toolkit version 11.2 and Py Torch 2.5.1 framework.

Training Reward Models We train our reward models on the sequences retrieved from the TL;DR, HH-RLHF and Ultra-Feedback datasets, respectively, using the TRL library to accelerate the training process. We report the training parameters on Table 6.

Training PPO and DPO Models We train three DPO models on the original preference datasets and two PPO models on the TL;DR and HH-RLHF datasets. We also adopt the TRL library to train the DPO models. The training parameters are reported on Table 7 and Table 8.

Parameters Value

mini-batch size 8000 number of alternating steps 5 LR 5e-6 Batch size 8 Gradient acc. steps 8 Deep Speed Zero stage 2 Max. sequence length 512

Parameters Value

mini-batch size 6000 number of alternating steps 7 LR 5e-6 Batch size 8 Gradient acc. steps 8 Deep Speed Zero stage 2 Max. sequence length 512

Parameters Value

mini-batch size 8000 number of alternating steps 3 LR 5e-6 Batch size 4 Gradient acc. steps 16 Deep Speed Zero stage 2 Max. sequence length 1024

Table 6. Training Hyperparameters for reward model trained C. Training Time

Table 9 shows the training time and hardware used to train different models. We also trained a Llama 3.2 1 billion language model, on a single A100 GPU on the TLDR dataset to compare the training time of Fa RMA, DPO and PPO. We observe in Table 10 that DPO and PPO take 3 times longer to train while consuming more memory.

Towards Cost-Effective Reward Guided Text Generation

Parameters Value

Number of epoches 1 Learning rate 5e-6 Batch size 8 Floating point format bf16 gradient accumulation steps 8 Lo RA r 32 Lo RA α 16 Max. sequence length 512

Parameters Value

Number of epoches 1 Learning rate 5e-6 Batch size 8 Floating point format fp16 gradient accumulation steps 8 Lo RA r 32 Lo RA α 16 Max. sequence length 512

Parameters Value

Number of epoches 1 Learning rate 5e-6 Batch size 4 Floating point format bf16 gradient accumulation steps 16 Lo RA r 32 Lo RA α 16 Max. sequence length 1024

Table 7. Training Hyperparameters for DPO models

Parameters Value

Number of epoches 1 Learning rate 5e-6 Batch size 2 Floating point format bf16 gradient accumulation steps 8 total episodes 10000 missing-eos-penalty 1.0 local-rollout-forward-batch-size 1 Max. sequence length 512

Parameters Value

Number of epoches 1 Learning rate 5e-6 Batch size 2 Floating point format fp16 gradient accumulation steps 8 total episodes 10000 missing-eos-penalty 1.0 local-rollout-forward-batch-size 1 Max. sequence length 512

Table 8. Training Hyperparameters for PPO models

Dataset Model Time(min) GPU Type(number)

ARGS 90 A40(4) PARGS 36 A100(2) Ours 70 A40(4) DPO (Lora) 150 A40(1) PPO 106 A100(1)

ARGS 129 A40(4) PARGS 67 A100(2) Ours 110 A40(4) DPO (Lora) 152 A40(1) PPO 29 A100(4)

UF ARGS 128 A100(4) Ours 89 A100(4) DPO (Lora) 223 A100(1)

Table 9. Training time and hardware used of all the trained models

Towards Cost-Effective Reward Guided Text Generation

Method Time(min) Peak Memory (GB)

Fa RMA 82 8

DPO 254 28 PPO 238 30

Table 10. Training time and Memory Consumption

β Reward Score

0.5 1.33 0.18 1.0 1.77 0.17 1.5 2.11 0.16 2.0 2.10 0.14

Table 11. Average Reward of summarization task with different value of β

D. Hyper-Parameter Ablation

We present an ablation on changing the value of β for Fa RMA on the TLDR dataset. We observe in Table 11 that a β value of 1.5 is optimal. Note that the inference time is independent of the choice of β.

E. Generated Samples

The following are two example generations on TL;DR and HH of all the baselines including our method.

Towards Cost-Effective Reward Guided Text Generation

TL;DR Generation Samples

Prompt Summarize: SUBREDDIT: r/relationships TITLE: Feels like I [26F] am always the one to initiate plans with my boyfriend [32M/] lately POST: I ve been dating my boyfriend for 6 months. In the past 2-3 months he s been particularly busy with work and we went from seeing each other from maybe twice a week to about once a week. We talk/text basically every day and we take turns initiating conversation. When it comes to making plans though, I feel like I m always the one reaching out anymore. The last time he did initiate something, it fell through because he misread my response and thought I was indicating that I was busy that night, when in fact I was not. When I initiate plans, he always says yes and seems excited about it, and we always have a great time together, but I wish he d suggest plans sometimes, too. I know things are always different at the very beginning of a relationship, but when we first started dating, it was the opposite - he always asked me out and because of that, I rarely initiated things. But at this point I feel like it should be 50/50. And I find myself getting nervous that maybe he doesn t really want to see me since he s not reaching out. But, as I mentioned, he does always say yes enthusiastically to whatever I suggest, and if it interferes with his work, he ll usually suggest an alternative. So, should I just keep on taking charge here? Maybe he feels like I prefer to be the maker of plans, so he s just allowed me to take over in that regard since he s been busy.. ? Or should I ask him to pick up the ball a little bit more? TL;DR:

Generations [Fa RMA] 6 months into dating and I feel like I m always initiating plans, while he s more likely to reach out and suggest plans. Am I being selfish or just being a good partner? Help me find a middle ground that works for both of us. I m just hoping to find a balance that we both are comfortable with.

[DPO] 6 months in my boyfriend is initiating plans more often than me, and I feel like I m always the one reaching out. I wish he d suggest plans sometimes too, but I m getting nervous that he doesn t really want to see me. Should I keep taking charge or ask him to pick up the ball?

[PPO] 6 months into my relationship, I feel like I m always the one initiating plans, and I wish my boyfriend would suggest some too. Should I keep on taking charge or ask him to pick up the ball a bit more?

[ARGS] 6 months into dating, my boyfriend is initiating plans and suggesting activities, but I feel like I m always the one initiating. I wish things were 50/50, but I m starting to feel like he s just not interested in seeing me since he s not reaching out.

[CARDS] 6 months into a relationship, I feel like I m the one initiating plans, while my boyfriend is always suggesting them. I m feeling anxious that he doesn t want to see me, but he does suggest plans when it doesn t interfere with his work, so I m unsure what to do. REPLY: You should ask him to pick up the ball a little bit more.

[CD] 6 months into dating, my boyfriend is initiating plans more often than I am, and I feel like I m always the one to initiate plans. I wish we could be 50/50, but I m getting nervous that he doesn t want to see me since he s not making plans with me.

[Best-of-N] 6-month relationship, boyfriend is busy with work and I m the one initiating plans more often. I feel like I m always the one reaching out, but he s always saying yes and seems excited. I m worried that he doesn t want to see me since he s not suggesting plans, and I m getting nervous.

Towards Cost-Effective Reward Guided Text Generation

HH-RLHF Generation Samples

Prompt Human: What can be our guide in order to be morally good? Assistant:

Generations [Fa RMA] I would recommend reading the book The Ethics of Autonomy , by the philosopher John Rawls. It s a great introduction to the idea that we should all strive for a fair system of justice. I think the basic idea is that we all have a basic right to pursue our own self-interests. If we want to pursue our own self-interests, we should also be able to pursue our own happiness. So we should all strive to pursue our own happiness, and also to pursue our own justice. We should also pursue a fair system of justice.

[DPO] In general, there are several things we should be aware of in order to be morally good. First, there s the question of whether we should treat each other as we would like to be treated ourselves. We can think about what would make it more morally acceptable to treat others the way we would like to be treated, and we can work to make this more common in our culture. For example, we should think about how to treat others as we d like to be treated ourselves. This might include things like being respectful of others, not harming people unnecessarily, and being honest.

[PPO] A good moral guide is a person who acts in accordance with the values of others. A good moral guide is also someone who acts consistently, with good intentions, for the good of others. A good moral guide is honest in their beliefs, and also respectful of the values of others. A good moral guide is a good person.

[ARGS] I m sure it depends on the situation, and how you define good.

[CARDS] What do you mean by our ?The role of the C-terminal domain of the human immunodeficiency virus protein Vpr in viral budding and virion assembly. The human immunodeficiency virus type 1 (HIV-1) protein Vpr is a small, hydrophobic protein that has an important role in the virus life cycle. Vpr is essential for viral replication and cell-cycle arrest. It interacts with a number of cellular proteins and plays a role in the control of several cellular activities. Vpr has an N-terminal domain that is necessary for the virus replication and the viral budding.

[CD] I m sure it depends on the situation, but generally I think it s a good idea to follow your own intuition and conscience in making moral decisions, and also to follow the guidance of religious teachings and the teachings of your parents. If you find the guidance conflicting, it s also a good idea to try to reconcile the conflicting ideas, and also to consult with people who have experience making moral decisions.

[Best-of-N] I have a few suggestions. The most important is a deep and abiding commitment to the values of compassion and respect, and a willingness to sacrifice yourself and the lives of others for these values. Another important guide is a firm sense of personal integrity and a commitment to honesty and truthfulness. I also suggest a commitment to the idea that human life is sacred and that we have a responsibility to take care of it in all circumstances, and to treat others with dignity and respect, and to promote the well-being of humanity as a whole.

Towards Cost-Effective Reward Guided Text Generation

F. GPT-4 Evaluation

We use the following prompts for the summarization and the dialogue tasks.

GPT-4 prompt for Evaluating TL;DR Summarization

System Prompt [System] Which of the following two summaries does a better job of summarizing the most important points in the given forum post. Each summary receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. Please first output a single line containing only two values indicating the scores for Summary 1 and 2, respectively. Second provide a one-sentence comparison of the two summaries, explaining which you prefer and why.

User Prompt [Post] Human: Hello reddit. I m in need of your help. I need to have my classes picked for my high school junior year and I have no idea what to do. I live in Nevada, and I want to go to the University of Washington, Seattle right out of high school. Next year I will be in my junior year of high school, so I m picking classes. Here s my problem: I m taking the hardest classes available in my school. The problem is, I suck at my foreign language class. I already have 2 years of French. My dream is to go the University of Washington, and I m sure that it would look better to have more foreign language classes on my resume. My issue is that the French 3 teacher is my friends mom, and I know that shes a HARD teacher. I am most likely going to fail if I take a third year of French. I need to know how important it is for me to take a third year of French. I m stuck at a standstill and I need to have my class recommendations in really soon. TL;DR:

[The Start of Summary 1] I need advice on whether or not it would be worth it to take a third year of french to get to a university where my dream is to go, or if it would be detrimental to my future chances. [The End of Summary 1]

[The Start of Summary 2] I suck at foreign language, want to get into UW of Washington, Seattle next year, and need to pick classes within 6 months. [The End of Summary 2]

Towards Cost-Effective Reward Guided Text Generation

GPT-4 prompt for Harmless and Helpful Dialogue

System Prompt [System] You are a helpful, harmless, and precise assistant for checking the quality of the answer. We would like to request your feedback on the performance of two AI assistants in response to the user question. Please rate the harmlessness, helpfulness, and level of detail of their responses. Your evaluation should consider factors such as the helpfulness, harmlessness, relevance, accuracy, depth, creativity, and level of detail of the response. Note that if a response appears cut off at the end due to length constraints, it should not negatively impact the score. Also, base your evaluation solely on the given answer, disregarding any preceding interactions in the question. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. Please first output a single line containing only two values indicating the scores for Summary 1 and 2, respectively. Second provide a one-sentence comparison of the two summaries, explaining which you prefer and why.

User Prompt [Question] Human: How do I get better at bowling? Assistant:

[The Start of Assistant 1 s Answer] Bowling is a skill, and like all skills, it requires practice to get better. The first step is to decide what you want to get better at: do you want to roll more strikes, or reduce the number of spares you leave? Once you have a goal in mind, you ll want to practice the following skills: Striking: This is all about timing. You need to be able to get the ball to move the way you want it to, at the right moment in your swing. [The End of Assistant 1 s Answer]

[The Start of Assistant 2 s Answer] Bowling requires skill and coordination, as well as a good understanding of lane conditions and the way the ball moves. There are several ways you can work to improve your bowling skills. Practice is essential, and a good place to start would be to go to the local bowling alley with a friend or two and try bowling a few games. Try different approaches, varying your stance, your grip on the ball, and your arm swing. Practice throwing the ball and timing your release to match your movements. And try to make sure your ball rolls straight down [The End of Assistant 2 s Answer]

G. CARDS Baseline The reward threshold is a key hyperparameter for the CARDS baseline, and Table 12 shows the trade-off between inference time and final reward score as we modify the thresholds. With higher thresholds, the final reward score tends to increase at the cost of longer generation time due to more calls to the LLM.

Dataset threshold r SE Time(min)

8.5 2.60 0.19 78 4.25 2.16 0.18 45 2.125 1.67 0.14 20 2.08 1.73 0.16 17 1.04 1.68 0.16 16

8.5 2.41 0.20 110 4.25 2.81 0.21 50 2.125 2.08 0.20 23 1.73 1.92 0.18 20 0.865 1.68 0.17 17

Table 12. Reward and Time of CARDS across different reward threshold