# variational_bestofn_alignment__1d0c75a1.pdf

Published as a conference paper at ICLR 2025

VARIATIONAL BEST-OF-N ALIGNMENT

Afra Amini Tim Vieira Elliott Ash Ryan Cotterell ETH Z urich {afra.amini, ryan.cotterell}@inf.ethz.ch tim.f.vieira@gmail.com ashe@ethz.ch

Best-of-N (Bo N) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, Bo N is computationally expensive; it reduces sampling throughput by a factor of N. To make Bo N more efficient at inference time, one strategy is to fine-tune the language model to mimic what Bo N does during inference. To achieve this, we derive the distribution induced by the Bo N algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the Bo N distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational Bo N (v Bo N). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on controlled generation and summarization tasks show that Bo N is the most effective alignment method, and our variational approximation to Bo N achieves the closest performance to Bo N and surpasses models fine-tuned using the standard KL-constrained RL objective. In the controlled generation task, v Bo N appears more frequently on the Pareto frontier of reward and KL divergence compared to other alignment methods. In the summarization task, v Bo N achieves high reward values across various sampling temperatures.

https://github.com/rycolab/vbon

1 INTRODUCTION

Language models are pre-trained on large corpora to model a distribution over natural language text.1 Beyond their initial pre-training, they are often additionally fine-tuned on domain-specific data through a process called supervised fine-tuning (SFT). The goal of SFT is to enable the model to better perform various downstream tasks of interest. While the fine-tuned model, called the reference model in our paper, is indeed typically much better at performing the downstream task of interest, e.g., dialogue generation or summarization, it may still generate undesirable content, e.g., harmful or offensive text. To mitigate this issue, aligning the reference model to human preferences has become a fundamental step in the development of modern large language models (Meta, 2023; Open AI, 2023; Gemini, 2024).

The degree to which text is aligned with human preferences is typically operationalized using a real-valued reward function. Rather than constructing a reward function by hand, it is typically estimated from a dataset of human preferences.2 And, after estimation, we expect the reward function to return higher values for text that is more likely to be preferred by humans, and lower values for text that is more likely to be dispreferred. Then, given an estimated reward function, an alignment algorithm further alters the reference models in a manner such that it places the highest probability on the text that is high reward under the reward model and high probability under the reference model.

Alignment algorithms can be taxonomized into two groups: (i) alignment via fine-tuning, where we change the language model s parameters to achieve alignment (Christiano et al., 2017; Rafailov

1Many language models are also used to model text in non-natural languages, e.g., programming languages. 2In some cases, the reward model is not estimated from human preference data. It is either known, e.g., code-based execution scores, or given by a classifier, e.g., toxicity or sentiment classifiers.

Published as a conference paper at ICLR 2025

Best-of-(N=4)

a delight to watch.

credited for a few ﬁne

spots, including

Jack, a troubled driver

bad. he was not one

for this ﬁlm

Billy Wilder is

LM with Variational Best-of-N Fine-tuning

Billy Wilder is

a delight to watch.

a delight to watch.

Figure 1: Best-of-N (on the left) is an effective alignment-via-inference method: it draws N samples from the language model, ranks them according to a reward model, and outputs the best sample. Variational Best-of-N (on the right) approximates this process via fine-tuning. The goal is to ensure that sampling a single string from the fine-tuned model produces a result equivalent to applying Best-of-N. This approach allows us to achieve similar performance while increasing the throughput by a factor of N.

et al., 2023), and (ii) alignment via inference (Nakano et al., 2022; Mudgal et al., 2024). A common alignment-via-fine-tuning method is reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022). RLHF typically consists of further fine-tuning the language model under a KL-constrained RL objective, which is made up of two terms: a term that encourages the model to maximize the reward, and a term that discourages high KL divergence between the language model and the reference model. This objective is often maximized with an RL algorithm, e.g., proximal policy optimization (PPO; Schulman et al., 2017). A common alignment-via-inference method is the Best-of-N (Bo N; Stiennon et al., 2020) algorithm. As such, it does not require any fine-tuning of the language model. The algorithm is straightforward: One draws N samples from the reference model and returns the text that achieves the highest reward among those N samples. The Bo N algorithm has also been effectively applied in controlled decoding (Yang & Klein, 2021; Mudgal et al., 2024) and to generate a dataset for supervised fine-tuning (Meta, 2023).

Despite its simplicity, Bo N has proven incredibly practical in generating high-reward text that still has a high probability under the reference model. Theoretically, Yang et al. (2024) prove that under some simplifying assumptions, the Bo N distribution is asymptotically equivalent to the optimal distribution under the KL-constrained RL objective. Empirically, it has been repeatedly shown (Gao et al., 2023; Rafailov et al., 2023; Mudgal et al., 2024) that Bo N often appears on the frontier of reward and KL curves, surpassing the performance of models fine-tuned with RLHF. However, the main factor preventing Bo N from replacing fine-tuning methods for alignment is its significant computational overhead during inference. Even when sampling is done in parallel, Bo N decreases the text generation throughput by a factor of N. This drawback limits its practicality for generating text from large language models.

To speed up Bo N, we devise a scheme to convert it into an alignment-via-fine-tuning algorithm rather than an alignment-via-inference algorithm. To this end, we first formally derive the probability distribution induced by the Bo N algorithm. Then we approximate this distribution by minimizing the reverse KL divergence between the language model and the Bo N distribution. This leads to an optimization objective that we refer to as the v Bo N objective. By analyzing a lower bound of this objective, we find that it behaves similarly to the KL-regularization objective in the limit, i.e., N 1 or N . Importantly, the v Bo N objective has a unique and useful property: it is insensitive to applying any monotonically increasing function to the reward values. This distinctive feature, along with the empirical success of the Bo N algorithm, suggests that the v Bo N objective is a promising and interesting objective to explore. Finally, we fine-tune the language model using PPO to optimize the v Bo N objective. Our scheme, depicted in Fig. 1, allows us to achieve performance close to that of the Bo N algorithm while increasing the inference throughput by a factor of N.

We experiment with v Bo N on controlled generation and summarization tasks, comparing its performance to models fine-tuned using the KL-constrained RL objective. For controlled generation, our results indicate that models fine-tuned with the v Bo N objective are more likely to fall on the Pareto frontier of the reward vs. KL curve compared to other fine-tuning-based alignment methods. This suggests that v Bo N achieves a better balance between maximizing reward and maintaining

Published as a conference paper at ICLR 2025

proximity to the reference model. On a summarization task, fine-tuning with v Bo N yields higher reward values and greater win rates on average than models fine-tuned with the KL-constrained RL objective, further demonstrating its effectiveness.

2 BACKGROUND: REINFORCEMENT LEARNING FROM HUMAN FEEDBACK

Let Σ be an alphabet, a finite, non-empty set of symbols.3 The elements of Σ may be characters, tokens, or words; the choice lies with the modeler. A string is a finite sequence of symbols drawn from Σ. A language model is a distribution over strings y Σ , where Σ is the set of all strings over the alphabet Σ. In this paper, we consider language models, e.g., those based on neural networks, that are parameterized by a real vector θ Θ, denoted as πθ. Furthermore, we restrict ourselves to language models that are differentiable functions of θ. In conditional generation tasks, e.g., summarization or dialogue generation, it is desirable to prompt the language model with a string x Σ . Consequently, we consider prompted language models, i.e., those that give a conditional distribution over response strings y, given a prompt string x, as πθ(y | x). However, for notational convenience, we will drop the explicit conditioning on the prompt x and simply write πθ(y).

Algorithms for RLHF fine-tune the language model to increase the expected reward of the strings sampled from it while not diverging too far from the reference model. RLHF consists of three steps. First, the language model is fine-tuned on a task-specific dataset using the maximum-likelihood objective. Recall we term the language model after this step the reference model and show that with πref. Next, a reward model r: Σ R is trained to capture human preferences; the reward of a string is high if it is preferred by humans.4 Finally, the reference model is fine-tuned to maximize the KL-constrained RL objective,

J RL(θ) = E y πθ

r(y) i β DKL

πθ πref , (1)

where DKL( ) is the KL divergence between two distributions, modulated by a hyperparameter β. This objective encourages the model to assign greater probability mass to high-reward outputs while simultaneously penalizing excessive divergence from the reference model. Levine (2018) shows that the policy that maximizes5 this objective (Eq. (1)) is

Z πref(y) exp 1

β r(y) , Z = X

y Σ πref(y) exp 1

β r(y) . (2)

In simple terms, π θ is the reference model reweighted by the exponentiated reward values and normalized by the partition function Z. However, direct sampling from π θ is not feasible, as computing Z requires evaluating an infinite sum, making it intractable. However, a heuristic approach to sampling from π θ would be to sample many strings from πref and only keep those that have high rewards. Indeed, this heuristic is the motivation behind the Bo N algorithm.

3 DERIVING THE BEST-OF-N OBJECTIVE

Best-of-N is a simple alignment-via-inference algorithm. The algorithm works as follows. Let YN = {y(n)}N n=1 be the multi-set containing N i.i.d. samples from πref. Then, Bo N returns y , where6

r(y(n)). (3)

We present the probability distribution induced by Bo N with πbon. Notably, πbon is not the optimal distribution under Eq. (1), the KL-constrained RL objective.7 Despite this, the Bo N algorithm often

3Please refer to Tab. 3 for a summary of notations used throughout the paper. 4For example, in a summarization task, a preference dataset consists of a document, two candidate summaries for that document, and a label indicating which summary is preferred by humans. The reward model is trained on this dataset to maximize the likelihood of correctly predicting human preference. 5This formulation implicitly assumes that there exists a θ Θ that achieves the unconstrained maximum. 6We assume that the argmax is unique, or ties are broken in a well-defined manner. 7Under simplifying assumptions is πbon asymptotically (in string length) equivalent to π θ (Yang et al., 2024).

Published as a conference paper at ICLR 2025

performs well even in comparison to RLHF-based methods. This naturally raises the question: under what optimization objective is πbon the optimal distribution? To answer this question, we first compute the probability of strings under πbon. Proposition 1. Suppose r: Σ R is a one-to-one mapping. Then, the probability of a string y under πbon is given by

r(y) N iπref(y)i, F

r(y) def= P y πref (r(y ) < r(y)) . (4)

Proof. See App. B.

F can be understood as the strict cumulative density function of reward values under πref. In other words, F

represents the probability that a random sample drawn from πref has a reward value less than r(y). We now describe how to fine-tune the language model to approximate πbon. Similar to variational inference, we minimize the reverse KL divergence between πθ and πbon. Concretely,

J VBON(θ) = DKL

h log πbon(y) log πθ(y) i (5a)

h log πbon(y) i + H

r(y) N iπref(y)ii + H

where H( ) is the entropy of a distribution. Thus, Eq. (5) offers an answer to the question of what objective Bo N optimizes. Inspecting the objective further, we see that Eq. (5) is an entropyregularized objective, where we use the probability of the string under the Bo N distribution as the reward and discourage the model from having low entropy.

Monotonically invariant. An important property of the variational Bo N objective is that it is invariant to applying a strictly monotonically increasing function to rewards. This is because the v Bo N objective relies on reward values solely through F, which, as defined in Eq. (4), only depends on the ranking between the reward values and not their exact magnitude. This suggests that the v Bo N objective may be less sensitive to outliers and the scale of rewards. This property is important as RL algorithms are notoriously sensitive to the scale of reward values (Henderson et al., 2018; Schaul et al., 2021).

Approximating log F( ). Maximizing Eq. (5) requires us to compute log F( ) for any r(y). This, however, is computationally expensive, as we have to sum over the probabilities of all strings that have rewards less than r(y). Fortunately, we can instead maximize a lower bound of Eq. (5) using a Monte Carlo estimator of F( ). Concretely, we can write F( ) as an expectation,

r(y) = E y πref

1{r(y ) < r(y)} . (6)

We approximate F

using M i.i.d. samples from πref, termed y (1),... , y (M) i.i.d. πref, using

which we compute b F

r(y) def= 1 M PM m=1 1{r(y (m)) < r(y)}. We then take the log of this Monte Carlo estimator as a biased, but consistent estimator of log F( ) in Eq. (5).8 In 5.1, we empirically assess the number of samples needed for log b F to accurately approximate log F.

8Using Jensen s inequality, we show biasedness. Concretely, note the following lower bound

= log E y (1),...,y (M)

m=1 1{r(y (m)) < r(y)}

E y (1),...,y (M)

m=1 1{r(y (m)) < r(y)}

where Jensen s inequality is applicable because log is concave. Consistency can be shown with an application of the delta method ( 5.5.4; Casella & Berger, 2001).

Published as a conference paper at ICLR 2025

Bo NBo N DPO PPO v Bo N Bo N

(a) 4% of points on Pareto front belong to Bo NBo N, 4% to PPO, 42% to DPO, and 50% to v Bo N.

Bo NBo N DPO PPO v Bo N Bo N

(b) 7% of points on Pareto from belong to Bo NBo N, 10% DPO, 33% PPO, and 50% v Bo N.

Figure 2: Steering generated movie reviews towards positive sentiment. Points that are not on the Pareto front of each method have lower opacity. Bo N is the most effective approach in achieving high win rates and high rewards while not diverging too far from the reference model. Our variational approximation to Bo N gets closest to the performance of Bo N compared to other fine-tuning methods, as reflected in the percentage of times it appears on the Pareto front.

4 COMPARING THE BON AND RL OBJECTIVES

To explore the connection between the v Bo N objective and the KL-regularized RL objective, we derive a lower bound for J VBON. Through this lower bound, we hope to achieve a deeper insight into how the reward function is used in the variational Bo N objective, and why this objective discourages high KL divergence from the reference model.

To derive such a lower bound, we substitute the Bo N distribution in Eq. (4) into the v Bo N objective in Eq. (5). We then simplify this objective to arrive at the following theorem. Theorem 2. We have J VBON(θ) L(θ), where

L(θ) def= (N 1) E y πθ

πθ πref . (8)

Proof. See App. D.

Empirically, we observe that models that are fine-tuned to maximize L(θ) perform competitively to the ones that are fine-tuned to maximize the v Bo N objective; see App. G for experimental results. Interestingly, if we compare Eq. (8) to the KL-constrained RL objective, Eq. (1), we see they have a very similar structure. We observe that N (in the v Bo N objective) acts as a regularization parameter. As N 1, the optimal distribution gets closer to πref, which has the same effect as β in Eq. (1). Furthermore, as N , the optimal distribution only generates the string with the maximum rewards, which is equivalent to β 0 in Eq. (1). Importantly, in both limits, the optimal distribution under the KL-regularized RL objective and the v Bo N objective are equivalent.

The main difference between the KL-constrained RL objective Eq. (1) and the derived v Bo N lower bound Eq. (8) is in how the reward function is used. The KL-constrained RL objective aims to maximize the expected reward values, whereas v Bo N maximizes the cumulative probability that strings sampled from the aligned model, πθ, achieve higher rewards compared to those sampled from πref.

5 SENTIMENT CONTROL

We now employ the variational Bo N objective, Eq. (5), to fine-tune language models. We perform an open-ended text generation task where the goal is to generate movie reviews with positive sentiment.

Published as a conference paper at ICLR 2025

The reference model, πref, is GPT-IMDB9, a GPT-2 (Radford et al., 2019) model fine-tuned on IMDB corpus (Maas et al., 2011). We use a binary sentiment classifier,10 denoted as p, with two classes {POS, NEG} as the reward model, and define r(y) def= p(POS | y). Following Rafailov et al. (2023), we sample 5000 movie reviews from the training set of IMDB dataset and for each sample, we randomly choose a prefix length from {2,... , 8} and take that prefix as the prompt. We further generate 512 prompts in the same way from the test set of IMDB that we use to evaluate our models.

We fine-tune the reference model with PPO using the v Bo N objective Eq. (5). Then, we compare the performance of the fine-tuned model (v Bo N) to the exact Bo N (Bo N), i.e., applying Bo N at inference time.

We implement and compare the following existing methods for language model alignment:

Bo N-SFT: Perhaps the most straightforward way to approximate Bo N distribution is to fine-tune the model to maximize the likelihood of the samples taken with Bo N algorithm. Unfortunately, we find that SFT is incapable of achieving a good trade-off between achieving high rewards and low KL divergence, see App. H (Fig. 7) for the experimental results. PPO: We use PPO to optimize the KL-constrained objective in Eq. (1). We use the default hyperparameters in trlx library (Havrilla et al., 2023) for fine-tuning with PPO. DPO. Direct preference optimization (DPO; Rafailov et al., 2023) is a popular alternative to RLHF that does not require training a reward model. Following DPO s experimental setup, we generate 6 reviews per prompt and use the resulting 12 pairwise comparisons per prompt to construct DPO s contrastive loss.11

Bo NBo N: Concurrent work (Gui et al., 2024) explores another approach to approximate Bo N distribution. Assuming that the reference model distribution πref is continuous, Gui et al. (Theorem 3; 2024) prove that the expected difference between the relative likelihood, i.e., πbon( )

πref( ) , of the Best-of-N response and the Worst-of-N response is 1 2β = 1 (N 1) PN 1 k=1 1/k. They use this property

to construct a loss function similar to that of IPO (Azar et al., 2023). Furthermore, they add another term to the loss function, which simply maximizes the likelihood of the Best-of-N response. The final loss function is a convex combination of the IPO-like loss and the negative log-likelihood loss, regulated by a hyperparameter α.12

We fine-tune models by varying the degree of regularization. For Bo N approaches, that is achieved by varying N, and for DPO and PPO, we vary β.13 Conveniently, N in v Bo N is a hyperparameter, meaning that we do not need to generate more samples from πref when we increase N. However, with Bo N and Bo NBo N methods, we need to increase the number of samples from the reference model as we increase N.

We generate movie reviews based on prompts from our test set using fine-tuned models and then measure three metrics: (i) KL divergence between the fine-tuned model and the reference model; (ii) win rate, defined as the percentage of times the fine-tuned model s generations receive higher rewards compared to the reference model s generations; and (iii) average rewards obtained by the fine-tuned model s sampled strings.

For the Bo N method, we report the empirical upper bound of log N N 1

N for KL divergence (Beirami et al., 2024; Mroueh, 2024) in our plots. Furthermore, the win rate of Bo N over the reference model can be computed analytically and is equal to N N+1.

We visualize the win rate vs. KL curves in Fig. 2a, and Fig. 2b the average rewards of generations under πθ vs. the KL divergence. As expected, Bo N is the most effective approach; however, this comes at an extra inference cost that grows with N. We observe that among the fine-tuning methods, our variational approximation to Bo N gets closest to the performance of Bo N, as it appears more

9Specifically, we use https://huggingface.co/lvwerra/gpt2-imdb. 10Specifically, we use https://huggingface.co/lvwerra/distilbert-imdb. 11One could argue that DPO has a slight advantage over other methods in this setup since it has seen 6 unique generations per prompt during training, while the others only have seen one (or 2 with Bo NBo N). Nevertheless, we observe that v Bo N is more effective than DPO. 12Following the authors recommendation, we set α so that both terms contribute equally to the final loss. 13See App. F for more details regarding the regularization hyperparameters.

Published as a conference paper at ICLR 2025

r(y) r(y) r(y)

prompt: I thoroughly enjoyed this movie because there prompt: Horrible. I see many user comments prompt: Billy Wilder is

Figure 3: Estimates of log F( ) with increasing the number of Monte Carlo samples. We test an adversarial prompt (left plot), a neutral prompt (middle plot), and a prompt with a positive sentiment (right plot). Overall, we hardly see any difference between the estimates after taking 200 samples. For the adversarial prompt, the distribution of rewards is peaked, and we do not see any changes in our estimator after taking only 100 samples.

often on the Pareto front of the two curves compared to other methods. Notably, we observe that DPO performs better than PPO in terms of win rates but worse in terms of average rewards; this could be attributed to the contrastive nature of DPO s loss function.

5.1 ERROR IN ESTIMATING log F( )

We empirically quantify the error when estimating log F( ) with a finite number of i.i.d samples from πref. To get a better intuition on the error of our estimators, in Fig. 3, we visualize the estimators for 3 different prompts: one adversarial prompt (left plot), where the prompt itself has a negative sentiment, one neutral prompt (middle plot), and one prompt with a positive sentiment (right plot). We vary the number of Monte Carlo samples from 10 to 600. We observe that for all the 3 prompts, the estimated CDF hardly changes after 200 samples. When using the adversarial prompt, the reward distribution is negatively peaked, and the estimated CDF does not change after taking only 100 samples.

We then quantify the change in the estimator by performing a two-sample Kolmogorov Smirnov test (Hodges, 1958). This test measures the closeness of two empirical cumulative distribution functions. Concretely, the test statistic is

where b FM1 and b FM2 are estimated CDFs from M1 and M2 samples respectively. The statistics show the magnitude of the difference between the two empirical distributions of samples. The null hypothesis is that the two distributions are identical.

Table 1: Measuring the estimation error with increasing the sample size. After 250 samples, the estimated CDF is unchanged for all the prompts.

M Rejection rate Test statistics p-value

5 6.14% 0.63 0.02 20 4.02% 0.33 0.03 100 1.14% 0.17 0.02 200 0.06% 0.12 0.02 250 0 - -

In Tab. 1, for each sample size M, we compare the estimated CDF with M samples to the estimated CDF with 600 samples. If the two distributions are identical according to the test, we can reliably use the M sample to estimate the CDF. We report the number of prompts (out of 5000 prompts) for which we reject the null hypothesis, meaning that the distributions are not identical. Furthermore, for those prompts, we report the average test statistics and p-values. In general, for very few prompts, the null hypothesis is rejected. Moreover, with 250 samples, the estimated CDFs are identical to the estimated CDF with 600 samples for all prompts.

Published as a conference paper at ICLR 2025

5.2 EFFICIENCY ANALYSIS

We break down the efficiency analysis into 3 main parts: (i) the inference cost, (ii) the preference optimization cost, (iii) and the preprocessing cost.

Inference cost. As discussed earlier, v Bo N is an alignment-via-fine-tuning method, and along with other alignment-via-fine-tuning methods, it is N times more efficient at inference compared to Bo N.

Optimization cost. We compare v Bo N s preference optimization cost to its closest alignmentvia-fine-tuning counterpart, PPO. In the optimization loop, the main difference between PPO and v Bo N is that v Bo N requires computing the strict CDF function, F, using M samples. Crucially, N in v Bo N serves as a regularization hyperparameter, and increasing N does not incur additional computation costs. To implement v Bo N efficiently, we precompute the F function before starting the optimization loop. This means the computational overhead is incurred only once, regardless of the number of optimization runs.14 Since the F values are precomputed, we empirically observe that the time needed to run the v Bo N optimization loop is the same as running the PPO optimization loop, and the cost of evaluating F is negligible. Therefore, the main computational overhead in v Bo N comes from precomputing log F( ).

Average Reward Win Rate

Total Elapsed Time (Minutes)

Performance

Figure 4: The average reward and win rate of the aligned models improve as we increase the sample size M used for approximating the v Bo N loss function.

Preprocessing cost. Estimating log F( ) requires only forward passes through the LLM and reward model without the need to compute and store gradients. This makes the process highly parallelizable. Our experiments utilize a memory-efficient library for LLM inference (VLLM; Kwon et al., 2023), which allows us to perform these approximations efficiently.

We examine the impact of increasing the computational cost of v Bo N by varying M, which directly affects the total elapsed time and downstream performance. For this analysis, we fix N = 10 and fine-tune the model using three random seeds. We report the average and standard deviation of reward values and win rates in Fig. 4 on a single A100-40GB GPU. Our results show that increasing M generally improves the aligned model s rewards and win rates. Notably, even with M = 32 samples (taking only 10 minutes), the performance remains competitive with higher values of M. We hypothesize that the data efficiency of the simple Monte Carlo estimator can be improved by taking into account the similarity between different prompts to learn an approximation to log F function, which we plan as future work.

6 SUMMARIZATION

We further employ variational Bo N in a summarization task, where the goal is to generate summaries that align with human preferences. The reference model, πref, is a pythia-2.8B model fine-tuned on human-written summaries of Reddit posts Stiennon et al. (2020).15 We use SFT to refer to this model in the plots. We use two separate reward models for training and evaluation: a pythia-2.8B16 reward model for fine-tuning and a larger pythia-6.9B17 model exclusively for evaluation.

Dataset. To evaluate the generalization ability of the aligned models on out-of-distribution data, we fine-tune the models using only posts from the relationship and relationship advice subreddits

14This is particularly advantageous since practitioners often perform the optimization multiple times to test various hyperparameter settings. 15We use https://huggingface.co/cleanrl/Eleuther AI pythia-2.8b-deduped sft tldr. 16We use https://huggingface.co/cleanrl/Eleuther AI pythia-2.8b-deduped reward tldr. 17We use https://huggingface.co/cleanrl/Eleuther AI pythia-6.9b-deduped reward tldr.

Published as a conference paper at ICLR 2025

0.25 0.5 0.75 1

Temperature

(a) Comparing the win rates of alignment methods against samples from the πref. v Bo N achieves closer results to Bo N compared to other alignment-via-fine-tuning methods.

0.25 0.5 0.75 1

SFT Bo NBo N DPO PPO v Bo N Bo N

Temperature

Average Reward

(b) Comparing the average rewards obtained from the evaluator reward model. Bo N outperforms other alignment methods, and v Bo N achieves closer results to Bo N compared to other alignment-via-fine-tuning methods.

Figure 5: Performance of different alignment methods on the summarization task. Solid traces show the performance on in-distribution Reddit posts, while dashed lines demonstrate the out-ofdistribution performance. Overall, Bo N is the most effective approach in achieving high win rates and average rewards across all sampling temperatures. Our variational approximation to Bo N (v Bo N) gets closest to the performance of Bo N while being significantly cheaper at inference time.

of the Reddit TL;DR (Stiennon et al., 2020) dataset. We then assess the models performance on the two types of data by dividing the test set into two equally-sized groups: in-distribution Reddit posts from the relationship and relationship advice subreddits, and out-of-distribution posts from the rest of the subreddits. We visualize the performance of methods on in-distribution data with a solid trace and on out-of-distribution data with a dashed trace.

Experimental setup. We fine-tune the model with both the KL-constrained RL objective and v Bo N objective for 10000 episodes. Similar to the previous experiment, we use 200 samples to estimate log F( ) values. To create a smooth and continuous reward function, we further fit an exponential curve18 to the estimates. We set N = 100 for Bo N and v Bo N methods and the equivalent value of β = 0.05 for the KL-constrained RL objective. We closely follow Huang et al. (2024) for setting the hyperparameters of the PPO algorithm; please refer to App. F for more experimental details. After fine-tuning, we sample from the aligned models with different sampling temperatures t [0.25, 0.5, 0.75, 1.], each with 3 different random seeds.

Win rates. In Fig. 5a, we visualize the average and standard deviation of win rates compared against the samples from the SFT model. Notably, Bo N achieves the highest win rates, which is consistent with findings from previous studies (Rafailov et al., 2023). We do not observe any significant differences between Bo N performance on in-distribution (solid trace) and out-of-distribution data,19 which is expected as Bo N is an alignment-via-inference method. Similarly, we mostly do not observe significant differences between inand out-of-distribution performance of all alignment-via-finetuning methods, indicating that these methods can generalize effectively in this experimental setup. DPO and Bo NBo N only manage to perform competitively to other methods at lower temperatures (0.25, 0.5), and their performance drops significantly at higher temperatures (0.75, 1). Importantly, while PPO and v Bo N perform comparably at higher temperatures, v Bo N significantly outperforms PPO at lower temperatures (0.25 and 0.5).

18We fit an exponential function of the form f(x) = a exp( bx) to the data using non-linear least squares. 19The difference between the two data distributions becomes more apparent at temperature 1, potentially due to increased sample diversity in this setting.

Published as a conference paper at ICLR 2025

Table 2: An example of summaries sampled at temperature 0.5 and their corresponding reward obtained from the evaluator reward model.

Content Reward

SUBREDDIT: r/relationship advice TITLE: Stuck in a rut and in need of advice/inspiration! POST: My boyfriend and I have been together for 3 years, and living together for 2. I m quite the homebody, and when we first met, he was very outgoing and loved partying and socialising (although he was a student at the time). We re both working now, and most nights we find ourselves doing the same things: watching series (luckily we enjoy the same shows), playing Minecraft or playing various board games. We re tired after work, and can t bring ourselves to leave the house. The weekend is much the same lots of sleep, or sitting around staring at one screen or another. We do party occasionally (we ll head to a pub once every few months) and there are a few mutual friends we enjoy spending time with, but I worry that we ve become stuck in our boring ways. I really enjoy our lifestyle, and would be quite happy to never leave the house again, but I m starting to feel guilty for turning him into a 50 year-old when he s only 24. Any ideas for shaking things up a little? Bear in mind that we live in a small town in South Africa, and neither of us has a car.

SFT: I m stuck in a rut, and need to shake things up to see if it ll work out. Any advice? 3.08

PPO: In need of inspiration to break out of rut and live life fully! Any ideas welcome! 4.59

v Bo N: Been happily living together for 2yr+, feeling bored after work regularly, looking for ideas to spice things up!

Bo N: My boyfriend and I have been together for 3 years, and are both working full time. We spend most of our time in the house, and have become boring. What can we do to shake things up?

Average rewards. In Fig. 5b, we measure the average rewards across different temperatures. As the temperature increases, the average reward decreases consistently across all methods. This trend is also evident in the qualitative analysis in App. I, where we show sampled summaries at different temperatures. DPO and Bo NBo N suffer more from increasing the temperature, as the average rewards get close to (or even worse than) the SFT average rewards. Generally, the average reward results align with the win-rate trends, and we observe that v Bo N achieves significantly higher rewards compared to PPO at lower temperatures. In Tab. 2, we show an example of summaries generated from the fine-tuned models with their associated reward values.

7 CONCLUSION

Motivated by the effectiveness of the Bo N algorithm, we formally derive a variational approximation to the distribution induced by Bo N algorithm via fine-tuning language models. Our analysis highlights the similarities and distinctions between the variational Bo N objective and the KL-constrained RL objectives. Our empirical findings reveal that models fine-tuned using the variational approximation to Bo N not only attain high reward values but also maintain proximity to the reference models. Crucially, inference on the fine-tuned models with the v Bo N objective remains as cost-effective as inference on the original reference model.

ACKNOWLEDGEMENTS

We thank Ahmad Beirami for the fruitful discussion in the early stages of this project. We also thank Amrit Singh Bedi for identifying a typo in a previous version of the bound derivations. Finally, we thank the anonymous reviewers for their feedback. Afra Amini is supported by the ETH AI Center doctoral fellowship.

Published as a conference paper at ICLR 2025

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and R emi Munos. A general theoretical paradigm to understand learning from human preferences. Computing Research Repository, ar Xiv:2310.12036, 2023. URL https://arxiv. org/abs/2310.12036.

Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D Amour, Jacob Eisenstein, Chirag Nagpal, and Ananda Theertha Suresh. Theoretical guarantees on the best-of-n alignment policy. Computing Research Repository, ar Xiv:2401.01879, 2024. URL https://arxiv.org/abs/2401. 01879.

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher R e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. Computing Research Repository, ar Xiv:2407.21787, 2024. URL https://arxiv.org/abs/2407. 21787.

George Casella and Roger L. Berger. Statistical Inference. Chapman and Hall/CRC, Pacific Grove, CA, 2nd edition, 2001. ISBN 9781032593036. URL https://www.routledge.com/ Statistical-Inference/Casella-Berger/p/book/9781032593036.

Eugene Charniak and Mark Johnson. Coarse-to-fine n-best parsing and Max Ent discriminative reranking. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2005. doi: 10.3115/1219840.1219862. URL https://aclanthology.org/P05-1022.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017. URL https://proceedings.neurips.cc/paper files/paper/2017/file/ d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Ka Shun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zbl Y.

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Proceedings of the International Conference on Machine Learning, Proceedings of Machine Learning Research, 2023. URL https://proceedings.mlr.press/v202/gao23h.html.

Gemini. Gemini: A family of highly capable multimodal models. Technical report, Google, 2024. URL https://arxiv.org/pdf/2312.11805.

Lin Gui, Cristina Gˆarbacea, and Victor Veitch. Bo NBo N alignment for large language models and the sweetness of best-of-n sampling. Computing Research Repository, ar Xiv:2406.00832, 2024. URL https://arxiv.org/pdf/2406.00832.

Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, Stella Biderman, Quentin Anthony, and Louis Castricato. trl X: A framework for large scale reinforcement learning from human feedback. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023. doi: 10.18653/v1/2023.emnlp-main.530. URL https://aclanthology.org/2023.emnlp-main.530.

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference and AAAI Symposium on Educational Advances in Artificial Intelligence, 2018. URL https://dl.acm.org/doi/pdf/10. 5555/3504035.3504427.

Joseph L. Hodges. The significance probability of the Smirnov two-sample test. Arkiv f or Matematik, 3, 1958. URL https://api.semanticscholar.org/Corpus ID:121451525.

Published as a conference paper at ICLR 2025

Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. The N+ implementation details of RLHF with PPO: A case study on TL;DR summarization. In Conference on Language Modeling, 2024. URL https://openreview.net/forum?id= k HO2ZTa8e3.

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with Paged Attention. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles, 2023. URL https://arxiv.org/abs/2309.06180.

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. Computing Research Repository, ar Xiv:1805.00909, 2018. URL https://arxiv.org/pdf/1805. 00909.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011. URL https://aclanthology.org/P11-1015.

Meta. Llama 2: Open foundation and fine-tuned chat models. Technical report, Meta, 2023. URL https://ai.meta.com/research/publications/ llama-2-open-foundation-and-fine-tuned-chat-models/.

Youssef Mroueh. Information theoretic guarantees for policy alignment in large language models. Computing Research Repository, ar Xiv:2406.05883, 2024. URL https://arxiv.org/abs/2406. 05883.

Sidharth Mudgal, Jong Lee, Harish Ganapathy, Ya Guang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad Beirami. Controlled decoding from language models. In Proceedings of The International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2024. URL https://arxiv.org/pdf/2310.17022.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Web GPT: Browser-assisted question-answering with human feedback. Computing Research Repository, ar Xiv:2112.09332, 2022. URL https://arxiv.org/pdf/2112.09332.

Open AI. GPT-4 technical report. Technical report, Open AI, 2023. URL https://cdn.openai.com/ papers/gpt-4.pdf.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022. URL https://proceedings.neurips.cc/paper files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.

Aliz ee Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. West-of-n: Synthetic preference generation for improved reward modeling. Computing Research Repository, ar Xiv:2401.12086, 2024. URL https://arxiv.org/abs/2401.12086.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019. URL https://d4mucfpksywv.cloudfront.net/ better-language-models/language models are unsupervised multitask learners.pdf.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/pdf/2305.18290. pdf.

Published as a conference paper at ICLR 2025

Tom Schaul, Georg Ostrovski, Iurii Kemaev, and Diana Borsa. Return-based scaling: Yet another normalisation trick for deep RL. Computing Research Repository, ar Xiv:2105.05347, 2021. URL https://arxiv.org/abs/2105.05347.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Computing Research Repository, ar Xiv:1707.06347, 2017. URL https: //arxiv.org/abs/1707.06347.

Pier Giuseppe Sessa, Robert Dadashi, L eonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ram e, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos, Am elie H eliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, and Olivier Bachem. BOND: Aligning LLMs with best-of-N distillation. Computing Research Repository, ar Xiv:2401.12086, 2024. URL https://arxiv.org/abs/2401.12086.

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. Computing Research Repository, ar Xiv:2408.03314, 2024. URL https://arxiv.org/abs/2408.03314.

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020. URL https://proceedings.neurips.cc/ paper files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.

Joy Qiping Yang, Salman Salamatian, Ziteng Sun, Ananda Theertha Suresh, and Ahmad Beirami. Asymptotics of language model alignment. Computing Research Repository, ar Xiv:2404.01730, 2024. URL https://arxiv.org/pdf/2404.01730.

Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021. URL https://aclanthology.org/2021. naacl-main.276.

Published as a conference paper at ICLR 2025

Symbol Type Explanation

Σ alphabet Σ is a set of symbols y, y Σ strings in Σ x Σ prompt string in Σ θ Θ A real vector representing the parameters of a language model πθ language model A language model parameterized by θ πref language model A supervised-fine-tuned language model r Σ R A reward model β R Regularization parameter for the KL divergence term F R R A strict cumulative density function of reward values under πref N Z+ Number of samples used in Bo N algorithm M Z+ Number of samples used in the MC estimator

Table 3: A summary of the notation used in the paper

A RELATED WORK

Best-of-N. Bo N is a straightforward alignment-via-inference algorithm to optimize the output of the language model using a trained reward model (Charniak & Johnson, 2005; Stiennon et al., 2020). Despite its simplicity, Bo N performs comparably or even better than other alignment methods, such as RLHF and direct preference optimization (Nakano et al., 2022; Gao et al., 2023; Rafailov et al., 2023). However, as noted by Stiennon et al. (2020), Bo N is an inefficient algorithm due to the reduced throughput at inference time.

Applications. Bo N has been applied successfully at various stages of the development of language models. Meta (2023); Dong et al. (2023) employ iterative supervised fine-tuning on the outputs of the Bo N algorithm to clone its behavior in the model. Pace et al. (2024) leverage Bo N to enhance reward modeling by training the reward model on both the best and worst responses. Additionally, Brown et al. (2024); Snell et al. (2024) explore the scaling laws for alignment-via-inference methods and demonstrate how to utilize the limited inference budget to achieve the alignment.

Best-of-N as an alignment-via-fine-tuning method. Two concurrent efforts to ours have also attempted to convert Bo N to an alignment-via-fine-tuning method. First, Gui et al. (2024) approximate the Bo N by maximizing the likelihood of the Best-of-N response and adjusting the relative likelihood of the Best-of-N and the Worst-of-N response. Second, Sessa et al. (2024), similar to ours, uses reinforcement learning to minimize the distance between the language model and the Bo N policy. Different from ours, and to reduce the fine-tuning time, the authors use a crude estimation of log F and approximate the distance to Best-of-N by iteratively distilling the Best-of-2 model as a moving anchor.

B PROOF OF PROP. 1

Proposition 1. Suppose r: Σ R is a one-to-one mapping. Then, the probability of a string y under πbon is given by

r(y) N iπref(y)i, F

r(y) def= P y πref (r(y ) < r(y)) . (4)

Proof. The proof follows Casella & Berger (2001, Theorem 5.4.3). To compute πbon(y), we first define two events: (i) the event that all N samples have rewards less than or equal to r(y), and (ii) the

Published as a conference paper at ICLR 2025

event that all N samples have rewards less than r(y). The probability of those events is as follows:20

p1(y) def= P(all N samples have rewards r(y)) =

r(y) + πref(y) N

p2(y) def= P(all N samples have rewards < r(y)) = F

r(y) N. (10b)

Note that for Eq. (10a) to hold, we need the assumption that the reward function is a one-to-one mapping.21 Furthermore, given this assumption, πbon(y) is the probability that at least one of the sampled strings out of N samples have the reward exactly equal to r(y) and the rest of the samples have rewards less than or equal to r(y). Given how we defined p1 and p2, we have πbon(y) = p1(y) p2(y).

r(y) + πref(y) N F

r(y) N iπref(y)i. (11)

C STRATEGIES FOR NON-INJECTIVE REWARD FUNCTIONS

If the reward function is not injective, we need a tie-breaking strategy for the Bo N algorithm. We formalize this as defining a total order r on Σ as follows: for any two strings y1 and y2, if r(y1) < r(y2) then we have y1 r y2. If r(y1) = r(y2) then y1 r y2 only if y1 y2, where is some arbitrary but fixed total order, e.g., lexicographic order. Therefore, we define F(y) as

F(y) def= P

We then need to define the two events and their probabilities, p1 and p2, given this total order on strings, as follows:

p1(y) def= P(all N samples are r y) =

+ πref(y) N

p2(y) def= P(all N samples are r y) = F

The rest of the proof is the same as with the one-to-one reward functions.

D PROOF OF THM. 2

Theorem 2. We have J VBON(θ) L(θ), where

L(θ) def= (N 1) E y πθ

πθ πref . (8)

20The PMF of Bo N is also derived by Beirami et al. (Lemma 1; 2024). In their notation, p1 = F and p2 = F 1. 21If the reward function is not a one-to-one mapping, we need to devise a tie-breaking strategy. See App. C for further discussion.

Published as a conference paper at ICLR 2025

Proof. First, we prove J VBON(θ) L(θ).

h log πθ(y) log πbon(y) i (14a)

h log πθ(y) log

r(y) N iπref(y)ii

h log πθ(y) log

r(y) N iπref(y)ii

h log πθ(y) log N F

r(y) N 1πref(y)1i

h log πθ(y) log F

r(y) N 1πref(y) i

h log πθ(y) log πref(y) (N 1) log F

πθ || πref (N 1) E y πθ

r(y) i def= L(θ). (14g)

The inequality in Eq. (14c) stems from the fact that we drop positive terms in the summation and only keep the first term. Therefore, the lower bound for our objective is:

J VBON(θ) = DKL

(N 1) E y πθ

πθ || πref . (15)

Another approach to deriving a lower bound is by using Jensen s inequality. By doing so, we arrive at the following theorem.

Theorem 3. Let α = (N+2)(N 1)

2 , β = N(N+1)

2 , and γ = N(N 1)

2 . Then, we have J VBON(θ) L1(θ), where we further define

L1(θ) def= γ E y πθ

πθ || πref . (16)

Published as a conference paper at ICLR 2025

h log πθ(y) log πbon(y) i (17a)

h log πθ(y) log

r(y) N iπref(y)ii

h log πθ(y)

i=1 log N i

r(y) N iπref(y)ii

h log πθ(y)

i=1 log N i

i=1 log πref(y)ii

h log πθ(y)

i=1 log N i

i=1 (N i) log πref(y)

h log πθ(y) N(N 1)

r(y) N(N + 1)

2 log πref(y) i

h log πθ(y) N(N + 1)

2 log πref(y) N(N 1)

πθ || πref + E

h (N + 2)(N 1)

2 log πθ(y) N(N 1)

πθ || πref + (N + 2)(N 1)

In Eq. (17c), because log(x) is convex for x 0, we applied Jensen s inequality to obtain the upper bound. Abstracting away from the three multiplicative factors, naming them γ, α and β, we end up with the following function

J VBON(θ) = DKL

γ E y πθ log F

r(y) αH(πθ) βDKL (πθ || πref) , (18)

which is a bound for some settings of γ, α and β.

Importantly, L1 is a looser bound compared to L. We formalize this in the following theorem.

Theorem 4. For every θ Θ, we have L(θ) L1(θ).

Proof. We prove L1(θ) L(θ), meaning that L is a tighter lower bound. According to Eq. (17f), we have:

L1(θ) E y πθ

h log πθ(y)

r(y) N iπref(y)ii

h log πθ(y)

r(y) N iπref(y)ii

h log πθ(y) log F

r(y) N 1πref(y) i = L(θ). (19c)

Published as a conference paper at ICLR 2025

Hypterparameter Value

Episodes 10000 Optimizer Adam W (ϵ = 1e 5, lr= 3e 6) Scheduler Linear Batch Size 32 β (Both for v Bo N and KL-constrained RL objective) 0.05 γ (Discount Factor) 1 λ (for GAE) 0.95 Number of PPO Update Iteration Per Epoch 4 PPO s Policy Clipping Coefficient 0.2 Value Clipping Coefficient 0.2 Value Function Coefficient 0.2 Value Function Loss Clipping True Sampling Temperature 0.7

E VBON PSEUDOCODE

Algorithm 1 The v Bo N algorithm

1: procedure VBON(πref, r, N, E, B) D: the prompt dataset, E: number of epochs, B batch size 2: Initialize πθ with πref 3: for E epochs : 4: for each batch in D : 5: y(1),... , y(B) πθ( ) Sample 1 response for each prompt in the batch 6: Compute r(y(1)),... , r(y(B)) 7: Compute F

r(y(1)) ,... , F

8: Optimize πθ with Eq. (5) (or Eq. (8)) using PPO 9: return πθ

F EXPERIMENTAL DETAILS

Hyperparameter sweep in the sentiment experiment. To visualize the trade-off between the expected rewards and KL divergence, we vary the degree of the visualization using the following hyperparameters for each method:

Bo N-SFT: N [10, 50, 90, 130, 170, 210, 250, 290, 330, 370, 410, 450, 490, 530, 570, 600] with 2 different seeds, resulting in 32 runs. PPO: β [0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1., 2., 3., 4., 5.] with 2 different seeds, resulting in 32 runs. DPO: β [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1., 2., 3., 4., 5.] with 3 different seeds, resulting in 33 runs. Bo NBo N and v Bo N: N [1, 2, 3, 4, 8, 16, 32, 64, 128, 256, 512] with 3 different seeds, resulting in 33 runs. v Bo N with L bound: β [0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1., 2., 3., 4., 5.] with 2 different seeds, resulting in 32 runs. Note that comparing Eq. (5) and Eq. (1), we have N = 1

PPO hyperparameters. In App. F, we include the hyperparameters used with the PPO algorithm for the summarization experiment.

G COMPARING THE VBON OBJECTIVE AND L LOWER BOUND

We compare the performance of models fine-tuned with the v Bo N objective and its lower bound (L) in Fig. 6. We observe that the performance of the models is very close to each other.

Published as a conference paper at ICLR 2025

v Bo N-L v Bo N Bo N

v Bo N-L v Bo N Bo N

Figure 6: Comparing models trained with the v Bo N objective and its lower bound (L). We observe that the performance of the two methods is very close to each other.

Bo N-SFT Bo NBo N DPO PPO v Bo N Bo N

(a) 4% of points on Pareto front belong to Bo NBo N, 4% to PPO, 42% to DPO, and 50% to v Bo N.

Bo N-SFT Bo NBo N DPO PPO v Bo N Bo N

(b) 7% of points on Pareto from belong to Bo NBo N, 10% DPO, 33% PPO, and 50% v Bo N.

Figure 7: Steering generated movie reviews towards positive sentiment. Points that are not on the Pareto front have lower opacity.

H ADDITIONAL EXPERIMENTS WITH BON-SFT

We further experiment with training with the maximum likelihood objective on Bo N generations when varying N. The results are depicted in Fig. 7. We observe that Bo N diverges too much from the reference model compared to other fine-tuning methods for alignment.

I QUALITATIVE RESULTS

Published as a conference paper at ICLR 2025

Table 4: An example of summaries sampled at temperature 0.25 and their corresponding reward obtained from the evaluator reward model.

Content Reward

SUBREDDIT: r/relationship advice TITLE: Stuck in a rut and in need of advice/inspiration! POST: My boyfriend and I have been together for 3 years, and living together for 2. I m quite the homebody, and when we first met, he was very outgoing and loved partying and socialising (although he was a student at the time). We re both working now, and most nights we find ourselves doing the same things: watching series (luckily we enjoy the same shows), playing Minecraft or playing various board games. We re tired after work, and can t bring ourselves to leave the house. The weekend is much the same lots of sleep, or sitting around staring at one screen or another. We do party occasionally (we ll head to a pub once every few months) and there are a few mutual friends we enjoy spending time with, but I worry that we ve become stuck in our boring ways. I really enjoy our lifestyle, and would be quite happy to never leave the house again, but I m starting to feel guilty for turning him into a 50 year-old when he s only 24. Any ideas for shaking things up a little? Bear in mind that we live in a small town in South Africa, and neither of us has a car.

SFT: I m stuck in a rut with my boyfriend, and I d like to shake things up a little. Any ideas?

PPO: In need of inspiration for moving past boring routines and becoming active and open minded individuals. Any ideas welcome!

v Bo N: Been dating for three years and living together for two. Feeling stuck in rut after working hard all week. Looking for ideas to shake things up!

Bo N: Been together for 3 years, and living together for 2. We re both working now, and find ourselves doing the same things: watching series, playing games, or having sex. We re tired after work, and can t bring ourselves to leave

Table 5: An example of summaries sampled at temperature 0.5 and their corresponding reward obtained from the evaluator reward model.

Content Reward

SUBREDDIT: r/relationship advice TITLE: Stuck in a rut and in need of advice/inspiration! POST: My boyfriend and I have been together for 3 years, and living together for 2. I m quite the homebody, and when we first met, he was very outgoing and loved partying and socialising (although he was a student at the time). We re both working now, and most nights we find ourselves doing the same things: watching series (luckily we enjoy the same shows), playing Minecraft or playing various board games. We re tired after work, and can t bring ourselves to leave the house. The weekend is much the same lots of sleep, or sitting around staring at one screen or another. We do party occasionally (we ll head to a pub once every few months) and there are a few mutual friends we enjoy spending time with, but I worry that we ve become stuck in our boring ways. I really enjoy our lifestyle, and would be quite happy to never leave the house again, but I m starting to feel guilty for turning him into a 50 year-old when he s only 24. Any ideas for shaking things up a little? Bear in mind that we live in a small town in South Africa, and neither of us has a car.

SFT: I m stuck in a rut, and need to shake things up to see if it ll work out. Any advice? 3.08

PPO: In need of inspiration to break out of rut and live life fully! Any ideas welcome! 4.59

v Bo N: Been happily living together for 2yr+, feeling bored after work regularly, looking for ideas to spice things up!

Bo N: My boyfriend and I have been together for 3 years, and are both working full time. We spend most of our time in the house, and have become boring. What can we do to shake things up?

Published as a conference paper at ICLR 2025

Table 6: An example of summaries sampled at temperature 0.75 and their corresponding reward obtained from the evaluator reward model.

Content Reward

SUBREDDIT: r/relationship advice TITLE: Stuck in a rut and in need of advice/inspiration! POST: My boyfriend and I have been together for 3 years, and living together for 2. I m quite the homebody, and when we first met, he was very outgoing and loved partying and socialising (although he was a student at the time). We re both working now, and most nights we find ourselves doing the same things: watching series (luckily we enjoy the same shows), playing Minecraft or playing various board games. We re tired after work, and can t bring ourselves to leave the house. The weekend is much the same lots of sleep, or sitting around staring at one screen or another. We do party occasionally (we ll head to a pub once every few months) and there are a few mutual friends we enjoy spending time with, but I worry that we ve become stuck in our boring ways. I really enjoy our lifestyle, and would be quite happy to never leave the house again, but I m starting to feel guilty for turning him into a 50 year-old when he s only 24. Any ideas for shaking things up a little? Bear in mind that we live in a small town in South Africa, and neither of us has a car.

SFT: Boyfriend and I are stuck in a rut. We don t do anything fun, and I m starting to feel guilty for not even trying to get away from it all.

PPO: In rutting boredom with boring life; any ideas for shaking things up a little? Any ideas for improving relationships?

v Bo N: Been happily cohabiting for three years with boyfriend of two years. Feeling guilty for turning him into an old man! Any ideas for shaking things up?

Bo N: Boyfriend and I are a little boring, and want to shake things up so we can go out and have fun but we re too stuck on the idea of staying in all day.

Table 7: An example of summaries sampled at temperature 1. and their corresponding reward obtained from the evaluator reward model.

Content Reward

SUBREDDIT: r/relationship advice TITLE: Stuck in a rut and in need of advice/inspiration! POST: My boyfriend and I have been together for 3 years, and living together for 2. I m quite the homebody, and when we first met, he was very outgoing and loved partying and socialising (although he was a student at the time). We re both working now, and most nights we find ourselves doing the same things: watching series (luckily we enjoy the same shows), playing Minecraft or playing various board games. We re tired after work, and can t bring ourselves to leave the house. The weekend is much the same lots of sleep, or sitting around staring at one screen or another. We do party occasionally (we ll head to a pub once every few months) and there are a few mutual friends we enjoy spending time with, but I worry that we ve become stuck in our boring ways. I really enjoy our lifestyle, and would be quite happy to never leave the house again, but I m starting to feel guilty for turning him into a 50 year-old when he s only 24. Any ideas for shaking things up a little? Bear in mind that we live in a small town in South Africa, and neither of us has a car.

SFT: My boyfriend and I used to party all the time and are ending up reverting to the boring old lifestyle of a 24-year-old. How can we both break out of a rut and get some fun back in our relationship?

PPO: Taken classic boring suburban lifestyle and turned into boring 24yo. Looking for ideas for breaking out of rut and refreshing lifestyle. Any ideas welcome!

v Bo N: boyfriend and I seem to have fallen into a rut of regularly doing mundane activities after working together non-stop for several months! ideas to spice things up?

Bo N: in a relationship that s getting a bit stale, looking for some inspiration to make changes on a whim in hopes of rejuvenating it!