# design_considerations_in_offline_preferencebased_rl__530ad223.pdf

Design Considerations in Offline Preference-based RL

Alekh Agarwal 1 Christoph Dann 1 Teodor V. Marinov 1

Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLi C and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize loglikelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

1. Introduction

The now substantial literature on Reinforcement Learning with Human Preferences (RLHF) can be broadly categorized into two families of methods. Given a dataset of human preferences, the first class of online algorithms are based on learning a reward function that assigns numerical scores to a response y given some input x, such that the high-scoring responses are preferred over the low-scoring ones in our preference dataset. These methods subsequently maximize this reward function using an online RL algorithm like PPO (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022) or Reinforce (Ahmadian et al., 2024). A different approach forgoes the reward learning step and uses a reparameterization trick to directly learn a

*Equal contribution 1Google Research. Correspondence to: Alekh Agarwal <alekhagarwal@google.com>, Christoph Dann <chrisdann@google.com>, Teodor V. Marinov <tvmarinov@google.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

good policy from the preference dataset, with Direct Preference Optimization (DPO) (Rafailov et al., 2024b) being the pioneering approach in this line of work. These approaches are referred to as offline or direct alignment, since they do not draw any fresh samples from the learned policy during training, and only use the responses observed in the preference dataset. This paper focuses on this latter family of algorithms, and studies how the properties of the data and the learning objective affect the quality of the learned policy.

Offline methods for RLHF such as DPO (Rafailov et al., 2024b), IPO (Azar et al., 2024), SLi C (Zhao et al., 2022; 2023) and KTO (Ethayarajh et al., 2024), along with a growing number of variants have received a significant attention in the academic literature owing to a number of attractive properties. First, the removal of an explicit reward learning step simplifies the number of modeling choices and steps required for the RLHF pipeline, along with reducing the demand on computational resources. Furthermore, the requirement to only evaluate a policy s likelihood on a fixed set of responses in the preference dataset, as opposed to generating responses from the learned policy as in online RL, adds further resource efficiency. However, the growth of this literature has also spurred an equally large body of work now detailing the various deficiencies of these techniques, such as the tendency of the algorithms to shift the probability mass outside the support of the observed responses in the preference dataset, significant preference hacking behaviors and notorious collapses in the learning dynamics with continued training (Pal et al., 2024; Park et al., 2024; Rafailov et al., 2024a; Fisch et al., 2024).

While some of the issues raised above have received theoretical treatment for specific approaches, the literature still lacks a comprehensive theoretical foundation underlying these offline RLHF techniques. Most of these techniques have been motivated by some variant of the original reparameterization argument in the DPO paper, but the argument hinges on unformalized assumptions regarding the coverage of the data with respect to the learned policy, and does not capture all the algorithmic variants which have subsequently been developed in the literature.

In this paper, we instead adopt the perspective of offline RLHF as just solving a loss minimization problem on a preference dataset, and study when the optimal solution

Offline Learning Preference-based RL

Method Base policy µ(y|x) Loss ℓ Constraint/Regularizer

DPO (Rafailov et al., 2024b) πref(y|x) ℓ(z) = log 1 1+exp( βz)

- IPO (Azar et al., 2024) πref(y|x) ℓ(z) = (z τ)2 - Slic-HF (Zhao et al., 2023) 1 ℓ(z) = max{τ z, 0} CE(πref, π) GPO (Tang et al., 2024) πref(y|x) ℓ(z) = f(βz) -

CPO (Xu et al., 2024) 1 ℓ(z) = log 1 1+exp( βz)

R-DPO (Park et al., 2024) πref(y|x) exp α β |y|

ℓ(z) = log 1 1+exp( βz)

ODPO1 (Amini et al., 2024) πref(y|x) ℓ(z) = log 1 1+exp( βz+τ)

Sim PO2 (Meng et al., 2024) 1 ℓ(z) = log 1 1+exp( βz) γ

Table 1. A collection of methods for offline RLHF from preference feedback, along with the instantiation of the different design choices that make them a special case of our general framework. For ODPO and Sim PO, they are not included in the full generality in our framework as discussed in the footnotes.

of this loss minimization yields a desirable policy. A key challenge in undertaking such a study is the identification of a benchmark policy which is both desirable in terms of the responses it generates, and is attainable under reasonable assumptions using offine RLHF. With this background, our paper makes the following contributions.

1. We identify a benchmark policy to measure the performance of offline RLHF against. Prior work (Swamy et al., 2024) shows that the class of offline techniques considered here cannot attain the optimal policy for online RLHF in general, and we instead develop a weaker benchmark under reasonable assumptions on the data generating process and the learning setup. For DPO, this benchmark does correspond to the optimal softmax policy when the preference data follows the Bradley-Terry Luce (Bradley & Terry, 1952; Luce, 2012) model. 2. We provide a bound on the sub-optimality of the learned policy to the yardstick identified above in a learning framework that encapsulates most existing offline RLHF variants. The bound depends on the curvature of the loss function, as well as the coverage of the offline dataset. 3. We corroborate some of our theoretical findings using an empirical study on a summarization task, and find that squared loss of IPO outperforms the logistic loss of DPO due to its nicer curvature properties, while dropping the normalization with reference policy likelihood causes a small, but consistent deterioration in the quality of the learned policy, as predicted by our theory.

2. Problem Setup

Preference-based learning. We consider the setting of learning from offline preference data. In this setting, we are typically given a dataset of samples where each sample consists of an input x X Dx, two responses y, y Y Y i.i.d. Dy( |x) and a binary label ω { 1, 1}

Dω( |x, y, y ). We use the shorthand (x, y, y , ω) Dxyω to succinctly denote samples from this generative process. The label ω captures our preference for the response y over y , for the input x. We study offline preference-based RL methods, which are parameterized by some loss function ℓ: [ R, R] [0, ). In particular, we study methods which take as input a policy class Π {X (Y)}, and find a policy which minimizes the following loss, given n samples:

bπ = argmin π Π

log π(yi|xi)

µ(yi|xi) log π(y i|xi) µ(y i|xi) | {z } :=ωπ,µ i

(1) Here µ: X RY + refers to an arbitrary base policy, which does not need to be normalized. Typically ℓis chosen so that it incentivizes ωiωbπ,µ i to be positive. This means that bπ tends to increase the probability of producing yi compared with y i, relative to the base probabilities under µ, when ωi = 1. In the particularly simple case when µ is just the uniform policy, we see that bπ results in a larger probability of producing yi over y i, when ωi = 1, which captures the basic intuition behind offline preference-based RL.

Choice of policy class. A common policy parameterization is to use softmax policies πθ(y|x) exp(fθ(y|x)) with some parameter set θ Θ, where f is some fixed network architecture. While we could explicitly define the policy class Π this way, we intentionally leave Π general at this

1ODPO in the general case allows a margin of the form f(score(x, y) score(x, y )), which is not admissible in our setup as specified, though it can be incorporated in our analysis with some additional work. When f is the identity function, we can simply define µ(y|x) πref(y|x)score(x, y), as a special case. 2Sim PO in the general case has an additional length normalization, leading to the loss log(ωsigmoid(β log π(y|x)/|y| β log π(y |x)/|y |) γ). This length normalization is currently not included in our theoretical setup and analysis.

Offline Learning Preference-based RL

point, to allow our framework to further include:

Additional constraints (1): Several works (e.g. Zhao et al. (2023); Xu et al. (2024)) regularize or constrain the preference objective of Equation 1 with a cross-entropy term based on CE(π0, π) := Ex Dx,y π0( |x) ln π(y|x), where π0 is some policy of a reasonable quality. This is done by adding αCE(π0, π) to the objective or a constraint that CE(π0, π) λ. The addition keeps the loss optimization from degenerating when the distribution D underlying our data has a limited support, and the optimal policy bπ might produce responses outside the support of Dxy, where we do not have preference feedback. We capture such additional constraints or regularizers as optimization with a reduced policy class Π.

Early stopping: A different way to constrain the extent of optimization is directly in the parameter space. Most often the preferred optimizer is a first-order method such as Gradient Descent, Ada Grad, Adam or Ada Factor. Early stopping optimization with first-order methods directly corresponds to a constraint in the parameter space Θ (Yao et al., 2007; Raskutti et al., 2014; Neu & Rosasco, 2018; Suggala et al., 2018; Vaskevicius et al., 2019; Sonthalia et al., 2024). For example, if GD is used as the optimizer, early stopping would correspond to an ℓ2 distance bound in parameter space between the parameters underlying π and the initial policy π0 at the start of optimization. Early-stopping or limited training in typical fine-tuning setups is a common practice that we again capture through an appropriate choice for Π, e.g. with additional ℓ2 constraints on the policy parameters.

We assume for simplicity that the policy class Π and the base policy µ are chosen such that log π(y|x)

µ(y|x) log π(y |x)

µ(y |x) [ R, R] for x, y, y drawn according to the generative process described above, with probability 1. Consequently, the loss on each sample is also bounded by some B, and we expect that the empirical loss minimizer bπ from (1) has also small population loss:

Lµ(bπ; Dxyω) min π ΠLµ(π; Dxyω) ϵn, with (2)

Lµ(π; Dxyω) = E(x,y,y ,ω) Dxyω[ℓµ(π; ω, x, y, y )],

ℓµ(π; ω, x, y, y ) = ℓ(ω ωπ,µ(x, y, y )).

The error bound ϵn typically scales as O( p

Bdπ/n), with dπ being the statistical complexity of π, such as ln |Π| for finite classes, log-covering number for the infinite case, or other complexity measures like Rademacher or Gaussian complexities. We abstract such treatment into a general error term ϵn, as this analysis of the empirical loss minimizer is standard and not key focus of our study.

Existing offline RLHF methods. The formalization of preference-based learning above captures a wide range of existing offline RLHF methods through appropriate choices of the base policy µ and loss ℓ. Table 1 presents a selection of methods that fit the setup and which our formulation captures. We note that the methods predominantly vary along three axes: the loss function ℓ, the choice of the base policy µ and the choice of the constraint or regularizer to guide the policy optimization. In the next section, we present a theoretical framework and our main technical results on the quality of the learned policy bπ, with an emphasis on understanding the effects of these choices. We note here that the GPO paper of Tang et al. (2024) considers an almost identical set of design choices (other than the flexibility in µ and the choice of Π) as our work, but their emphasis is on empirical evaluation while we seek to understand the design space in theory.

While our setup is general and captures many existing RLHF methods, there are some approaches which are closely related but do not quite fit our framework. S-DPO, (Chen et al., 2024), samples multiple responses with a single response that is preferred to all others. Ramesh et al. (2024) study a version of the problem where multiple pairs of responses are sampled from multiple groups and the objective is to minimize the worst-case expected preference loss across groups. Sim PO (Meng et al., 2024), nearly matches the DPO objective, however, the authors include additional length normalization, leading to the loss log(ω sigmoid(β log π(y|x)/|y| β log π(y |x)/|y |) γ). There is no straightforward way to incorporate the length normalization in our analysis. One option is to incorporate the normalization into the reward class, however, this also affects the policy class definition. We note that the other main modification in Sim PO, which is to remove the normalization by the log-likelihoods of the reference policy is indeed covered by our analysis through the choice of µ.

3. Analysis Framework and Main Results

In this section, we set up a framework for analyzing offline preference learning algorithms which optimize (1), and present our main results. We begin with a discussion to set up the performance criterion.

3.1. Analysis Framework

Performance criterion. How to measure the efficacy of a preference-based learning technique? As described above, based on our choices of ℓ, µ and Π, we get a guarantee on the expected loss of the resulting policy. But we want to measure how well the policy bπ does in terms of producing highly preferred outputs y, given inputs x. It is not clear that a policy which has a small loss also produces good outputs. For instance, suppose that the learned bπ is such that

Offline Learning Preference-based RL

E[ω|x, y, y ]ωbπ,µ(x, y, y ) > 0 for any x, y, y in the support of our training distribution, and the base policy µ is uniform. In this case, we can conclude that bπ(y|x) > bπ(y |x) whenever E[ω|x, y, y ] > 0. But this does not preclude the two probabilities from being extremely close, though correctly ordered, and more generally bπ might still place a non-trivial probability on the least desirable outputs y for many inputs x.

Ideally, we would like to say that bπ places most of the mass on the most desirable outputs y. Since the conditional probabilities Dω( |x, y, y ) can be interpreted as a preference function P(y y |x), one notion of an optimal policy is provided by the Nash equilibrium policy for the two playergame encoded by this preference function, as considered in prior works (Wang et al., 2023; Munos et al., 2023; Swamy et al., 2024). However, as Swamy et al. (2024) show, this optimal solution cannot be attained by minimizers of the objective (1) in general, where they provide a lower bound for the special case of DPO. Consequently, we need a different yardstick to measure our performance for the setup of offline preference-based RL, which we do next. We begin with introducing some useful notation and our formal assumptions needed to define the benchmark policy.

Given the form of the loss in Equation 1, we have to reason about the log probabilities as the main object of interest. It is therefore convenient to denote for each policy π its log probabilities by Rπ with Rπ(x, y) = log π(y|x), and conversely by πR the policy associated with R. We will also refer to such R as a reward function since it measures the quality of the output y for input x, and the policy πR ascribes higher probability to outputs with high rewards under R. For the policy class Π we can then define the accompanying reward class as R = {Rπ : π Π}. Note that this is only nomenclature and is not a modeling assumption such as a reward-based Bradley-Terry-Luce (BTL) model of preferences in the data generating process.

Modeling assumptions. We start by assuming that the log-probabilities of all policies and the base policy µ are bounded, which ensures that the inputs of the loss function are in a bounded range. Additionally, we assume that also the loss outputs are bounded which holds for all practical loss functions, given the bounded domain [ R, +R].

Assumption 3.1. For all x, y, y and all π Π, we have |Rπ(x, y)| R

4 , | log µ(y|x)| R

4 and ℓµ(π; ω, x, y, y ) B.

Given the policy class Π, we now make a realizability assumption on the data generating mechanism with respect to this class.

Assumption 3.2 (Realizability). There exists π Π such

that for all x X, y Y, y Y:

ωπ ,µ(x, y, y ) = argmin v [ R,R] EDω[ℓ(ω v)|x, y, y ]

We discuss in Appendix B.1 how this assumption can be relaxed to an approximate form of realizability, and still imply similar performance guarantees with a slightly different structural assumption on the loss function ℓ. We focus on exact realizability in the main text for the cleanest analysis. A necessary condition for Assumption 3.2 to hold is that there is a fixed policy π which minimizes the loss ℓµ in a pointwise manner for all x, y, y , when we take conditional expectation only over the preference labels ω. The assumption further requires that within the range of Rπ which parameterizes π Π we have that π is the pointwise minimizer of ℓ(ω v). This makes the optimal policy π

independent of the distributions D and Dy over x, y, y . To further understand why it is helpful to have such an optimal policy π , we make a standard calibration assumption on the loss function ℓin Equation 1.

Assumption 3.3 (Proper loss). We assume that the loss function ℓis a proper loss for class probability estimation (Reid & Williamson, 2010). That is, there is a function gℓwhich depends only on ℓ, such that for all η [0, 1]: argminv ηℓ(v) + (1 η)ℓ( v) = gℓ(η).

That is, when we take conditional expectation over the binary label according to probability η, then the minimizer of the loss correctly recovers some loss-dependent function of η. This condition is satisfied by most commonly used differentiable losses for binary classification such as the logistic loss, squared loss, squared hinge loss etc. For instance, the function gℓis given by gsq(η) = 2η 1 for squared loss and glog(η) = ln(η/(1 η)) for the logistic loss.

Realizability, proper losses, and optimal policies. When we use proper losses, the realizability condition takes a particularly intuitive form when the data generating process and the gℓfunction underlying the loss ℓagree with each other. For instance, suppose we use the logistic loss and the preferences are generated according to a BTL model: P(ω = 1|x, y, y ) = 1/(1 + exp(R (x, y ) R (x, y))) for some R R. Then under Assumptions 3.2 and 3.3, we have that π = πR for R such that for any x, y, y

R(x, y) R(x, y ) = R (x, y) R (x, y )

+ ln µ(y|x) ln µ(y |x). (3)

That is, R is given by R + ln µ, up to an x dependent offset, within the support of the data. A similar conclusion holds for the squared loss, and if P(ω = 1|x, y, y ) = 0.5+(R (x, y) R (x, y ))/2. A more detailed discussion of how modeling Dω as part of the exponential family leads

Offline Learning Preference-based RL

to proper losses, ℓ, can be found in Appendix A. We see that in these cases, the policy underlying realizability learns a ground-truth reward function which underlies our data generation process. In such a scenario, where there is a reward function R underlying the observed preferences, a natural benchmark is the KL-regularized reward maximizing policy, π(y|x) π0(y|x) exp(βR (x, y)), where π0 is some base policy with (such as the SFT policy), with respect to which the KL divergence is defined. When µ = π0, then we see that the policy π exactly corresponds to this optimal policy for an appropriately chosen loss function.

Performance criterion under realizability. Based on these insights, we adopt the policy π as our performance yardstick, and seek a policy π to minimize KL(π ||π). Approximately minimizing Lµ may not be sufficient to derive meaningful bounds on KL(π ||π), however, as there still needs to be alignment between the data-generating distribution Dy and π . Otherwise, if there is no good coverage of the support of π ( |x) by Dy( |x) there is no guarantee that π will be able to distinguish good responses, y, according to π from highly sub-optimal ones. This necessitates making the following coverage assumption. Assumption 3.4 (Coverage for optimal policy). Let R(x, y) = R(x, y) EDy[R(x, y)|x] denote the y-centered reward and let R denote the parameterization of π . Further, let R(x, y) = R(x, y) R (x, y). We assume that there exists a constant C s.t. for R R it holds that

Ex,y π ( |x)[ R(x, y)2] C Ex,y D[ R(x, y)2].

The coverage condition is akin to generalized coverage conditions used in the offline RL literature (Xie et al., 2021; Jiang & Xie, 2024). A sufficient condition to ensure this holds is to have supx,y π (y|x) Dy(y|x) C, but the generalized no-

tion also holds when the class R = {w ϕ(x, y) : w 2 1} and we have that λmax(Σ 1/2 y Σπ µΣ1/2 y ) C. Here we denote Σπ = E(x D,y π( |x)[ϕ(x, y)ϕ(x, y) ], and abbreviate Σy = ΣDy. Clearly, this second condition can be much weaker than the density ratio assumption, and indeed has underpinned several methods that effectively handle coverage issues in offline RL with large function spaces and high-dimensional data, motivating our definition here.

Before stating our main result we need a somewhat standard curvature assumption on the loss. Assumption 3.5 (Curvature around optimum). For any policy π Π and functions R, R R such that π = πR, π = πR , there is a constant cµ > 0 such that

Lµ(π; Dxyω) Lµ(π µ; Dxyω) E ωℓ (ωωπ ,µ(x, y, y ))

(ωπ,µ(x, y, y ) ωπ ,µ(x, y, y ))

2 E h ωπ ,µ(x, y, y ) ωπ,µ(x, y, y ) 2i

A sufficient condition for Assumption 3.5 is to instead have the stronger condition that for any u, v [ R, R], we have

ℓ(u) ℓ(v) + ℓ (v) (u v) + cµ

This condition holds for cµ = 1 2 for the squared loss: ℓ(u, v) = (u v)2, and with an R-dependent constant for many other losses that are induced by log-likelihoods of exponential families, which includes the logistic loss and the probit loss. Assumption 3.5 weakens this condition by requiring curvature on the expected loss L only around the optimal policy π , rather than pointwise on ℓ.

3.2. Main Results

With our main modeling assumptions set up, we now give the main theoretical result on the KL divergence between an approximate minimizer of the population loss Lµ(π; Dxyω) and the benchmark π .

Theorem 3.6. For any π Π such that Lµ(π; Dxyω) Lµ(π µ; Dxyω) ϵ, where the corresponding loss to Lµ, given by ℓis proper, and under Assumptions 3.1-3.5, it holds that

Ex [KL(π ( |x)||π( |x))] r ϵ

In Section B we show a version of Theorem 3.6 which only uses approximate realizability, that is, the version utilizes a relaxed version of Assumption 3.2. Remark 3.7 (Choice of loss function). Our bound scales inversely with the curvature constant, meaning that losses with high curvature will lead to more favourable bounds in Theorem 3.6. The squared loss satisfies Assumption 3.5 with cµ = 1 for any range of ωπ ,µ as it is strongly convex, while the squared hinge loss only satisfies Assumption 3.5 in ( , 1), finally the logistic loss satisfies Assumption 3.5 with a range dependent cµ as the loss becomes less curved as the ωπ ,µ approaches . Our theorem suggests that optimizing the squared loss is ideal in terms of the final bound as cµ is constant and bounded away from 0 across the full range of the loss. Squared loss is also a proper loss, so that the main assumption which might fail is realizability. We do note that the probabilistic model corresponding to squared loss naturally is less realistic than say, the BTL model, corresponding to the logistic loss. Nevertheless, the benefits of squared loss are verified by our experiments in Section 5. We note that a related discussion of the curvature properties comparing the squared and logistic cases can also be found in Azar et al. (2024), who correctly identify similar issues with the curvature of the logistic loss, and use this to motivate the IPO algorithm. However, there is no

Offline Learning Preference-based RL

explicit quantitative analysis of the role of the link function beyond a simple example, since the focus is more on obtaining a practical algorithm for the identity link function case. We instead derive precise convergence guarantees, and also highlight the role of offline data coverage which is not captured in the IPO paper. Remark 3.8 (Choice of base policy). While the choice of base policy µ does not appear directly in the KL bound, it influences the realizability assumption. As already discussed, under a proper loss such as the logistic loss and a corresponding reward model such as BTL, realizability becomes equivalent to having the reward model plus a log µ term be part of the reward space R. Further, the choice of µ can change π , and as we discuss in Section 3.1, the choice of µ = πref naturally yields a desirable π . In our experiments we use two commonly studied choices of µ, the uniform policy which puts equal probability on all responses, and a SFT policy πref. Remark 3.9 (Effect of constraints). Recall that we constrain the optimization problem to a policy class Π which captures any constraints that we incorporate such as CE to a SFT policy, π0, or the implicit constraints induced by the choice of optimizer and early stopping. We note that the choice of Π determines if the benchmark, π , which satisfies the realizability and curvature assumptions has to be part of Π. In this context, a cross-entropy regularization to πref essentially makes an assumption that π lies in the vicinity of πref. Biasing the reference policy in cross-entropy towards preferred responses such as in CPO (Xu et al., 2024) can be further beneficial in ensuring the feasibility of π . Remark 3.10 (Connections with prior results). As mentioned earlier, there is now a substantial literature on the degeneracies of DPO in particular, due to its popularity, with primarily empirical (Park et al., 2024; Rafailov et al., 2024a; Fisch et al., 2024), but also some theoretical results in these works that demonstrate that DPO tends to shift mass away from the support of the preference data, with probabilities of both preferred and dispreferred responses in the data rapidly degrading to zero. The reader might wonder how to reconcile these negative observations with our positive result on the loss minimizers of DPO-style losses. However, there are a few caveats which apply to the specific case of DPO. First, as we remarked earlier, the curvature constant cµ for DPO degrades exponentially fast with R. While this scaling is not ideal we note that it is typical and appears in prior and concurrent works as well (Zhao et al., 2024; Zhu et al., 2023; Xiong et al., 2023). Further, since DPO does not control the log-likelihood ratios through regularization terms, the quantity R can rapidly grow large empirically, as pointed out in multiple papers, and as we corroborate in the next section. In fact, some works on incorporating pessimistic reasoning in DPO (Fisch et al., 2024; Liu et al., 2024; Cen et al., 2024; Huang et al., 2024) result in adding regularization terms which partly mitigate some of these

degeneracies.

4. Analysis

The high-level reasoning to prove Theorem 3.6 is the following. We first use Assumption 3.5 to establish that any ϵ-minimizer of L(πR; Dxyω) admits a bound on the expected error in the centered rewards E(x,y) D[ R(x, y)2]. We then invoke the coverage condition of Assumption 3.4 to translate this error bound to be under the benchmark policy π . Subsequently, we relate the KL divergence between π and π in terms of expectation of R(x, y)2 under π , using a careful anaylsis of the log-partition function.

We start with the first step in the sketch above. Lemma 4.1. Under Assumptions 3.2 and 3.5, any policy π Π with Lµ(π; Dxyω) Lµ(π ; Dxyω) ϵ, satisfies

Ex,y D R(x, y)2

= Ex,y,y D[( R(x, y) R(x, y ))2] 2ϵ

Proof. The first condition of the lemma is equivalent to

E[ℓ(ωωπ,µ(x, y, y ))] E[ℓ ωωπ ,µ(x, y, y ) ]

+ E[ωℓ ωωπ ,µ(x, y, y ) (ωπ,µ(x, y, y ) ωπ ,µ(x, y, y ) ]

2 (ω(R(x, y) R(x, y ) R (x, y) + R (x, y )))2]

Now, Assumption 3.2 implies that for any π Π,

Eω|x,y,y [ℓ ωωπ ,µ(x, y, y )

ω(ωπ,µ(x, y, y ) ωπ ,µ(x, y, y ) ] 0,

where we have used the fact that v = ωπ ,µ(x, y, y ) is the minimizer of Eω|x,y,y [ℓ(ω v )] together with first order optimality so that Eω|x,y,y [ℓ (ωv )ω(v v )] 0 for any v and in particular for any, π, such that v = ωπ,µ(x, y, y ). This, together with the second assumption of the lemma imply that

ϵ E[ℓ(ωωπ,µ(x, y, y ))] E[ℓ ωωπ ,µ(x, y, y ) ]

2 E[(ω(R(x, y) R(x, y ) R (x, y) + R (x, y )))2],

where we have used the parametrization of π and π .

Next we show how to bound the KL by R and the fraction of log-partition functions Zπ and Z , where Zπ(x) = P

y exp( Rπ(x, y)). Lemma 4.2. Under Assumption 3.4, for any π Π such that π exp( R(x, y)), the expected KL divergence Ex [KL(π ( |x)||π( |x))] is bounded by q

2C Ex,y[ R(x, y)2] + Ex

log Zπ(x) Z (x)

Offline Learning Preference-based RL

Ex[KL(π ( |x)||π( |x))] Ex

Ey π ( |x) log π (y|x)

2 Ey π ( |x)

log π (y|x)

π(y|x) log Z (x) Zπ(x)

2 Ey π ( |x) log Zπ(x) log Z (x) 2

2C Ex,y[ R(x, y)2] + Ex

log Zπ(x) Z (x)

where in second inequality follows from Jensen s inequality, and the third inequality follows from a combination of (x + y)2 2x2 + 2y2 and x + y x + y, x, y 0, and the third inequality uses Jensen to push the expectation inside the square root, along with Assumption 3.4.

The next lemma bounds the ratio of log-partitions.

Lemma 4.3. For any π such that L(π; Dxyω) L(π ; Dxyω) ϵ, it holds that

log Zπ(x) Z (x)

Proof. Using the definition of Zπ we have

log Zπ(x) Z (x)

exp( Rπ(x, y))

y π (y|x)exp( Rπ(x, y))

exp( R (x, y))

Here the second equation rearranges the definition π (y|x) = exp( R (x, y))/ Z (x). Proceeding further

log Zπ(x) Z (x)

= Ex ln Ey π ( |x) exp( R(x, y))

ln 1 + Ey π ( |x) R(x, y)

2 Ey π ( |x) R(x, y)2

Ey π ( |x) R(x, y)

2 Ex Ey π ( |x) R(x, y)2 .

Here the first inequality uses that ex 1 + x + e Ax2/2, x A. The second inequality uses ln(1 + x) x. Con-

tinuing with our simplification, we get the bound

log Zπ(x) Z (x)

Ex Ey π ( |x) R(x, y)2

2 Ex Ey π ( |x) R(x, y)2

The first inequality above follows by triangle inequality to the sum inside absolute value, followed by Jensen s inequality on the first term. Finally, we invoke Assumption 3.4 and Lemma 4.1 to control the last term, which completes the proof of the lemma.

Combining Lemma 4.2 and Lemma 4.3 finishes the proof of Theorem 3.6.

5. Experiments

We evaluate the impact of different design choices for offline RLHF methods on the standard TL;DR summarization task (V olske et al., 2017; Stiennon et al., 2020), where the task is to provide short summaries of articles. For our policy, we use a large T5 model (Raffel et al., 2020) with 770M parameters, which has been fine-tuned to maximize the loglikelihood of the human responses in the TL;DR dataset. This ensures that the policy is initialized with parameters that have reasonable likelihoods for the responses observed in our data. We experiment with two different losses, ℓ:

ℓ(x) = log(1 + exp( βx)) (Logistic loss)

ℓ(x) = (βx 1)2, (Squared loss)

where β > 0 is a hyper-parameter, governing the strength of how much each preference in the dataset should in the policy. We pair each of losses with two possible choices for the base policy µ: The uniform base policy µ = 1, which results in the following optimization

i=1 ℓ ωi log π(yi|xi) log π(y i|xi) ,

and the second choice uses µ = πref, the policy obtained after fine-tuning on TL;DR which our optimization is initialized with. This corresponds to the objective is:

i=1 ℓ ωi log π(yi|xi) πref(yi|xi) log π(y i|xi) πref(y i|xi) .

We recall that when µ = πref, using the logistic loss from (Logistic loss) corresponds to the DPO algorithm (Rafailov et al., 2024b) and the squared loss from (Squared loss) corresponds to IPO (Azar et al., 2024).

Offline Learning Preference-based RL

Figure 1. Left panel shows the preference of the learned policy s summaries against those from the initial policy πref, as evaluated by a prompted Gemini 1.0 Ultra model. Shaded regions represent 95% error bands. Both the logistic loss variants quickly improve in terms of the preference scores initially, but then suffer a catastrophic collapse. Squared loss improves at a similar rate initially, and remains stable throughout the training regime. Right panel shows a direct comparison between the variants of logistic loss using µ = uniform and µ = πref (DPO) at regular intervals in the training process. Interestingly, the uniform variant is preferred in the early stages of training, but as the training collapses around the training step 5K, the πref variant starts to improve. Nevertheless, the absolute performance of both variants reaches its peak earlier in the training and rapidly worsens after 5K steps, suggesting that the preference for πref over uniform in this region might not be particularly significant. See text for a more nuanced discussion.

For each variant, we tuned the β parameter in the loss in the interval {0.1, 0.5, 1.0}. For logistic loss, the best results are obtained at β = 0.1, while we did not see a significant difference across these choices for the squared loss, and show the results at β = 0.5. See Appendix C for details.

We evaluate the different variants in two ways. First we compare the policies generated by each method at regular intervals in the training process, with the initial policy πref, by comparing the generated summaries in terms of their quality and conciseness by a prompted Gemini 1.0 Ultra model (Team et al., 2023). Figure 1 (left) shows that squared loss variants perform significantly better than those with logistic loss, which reach a peak preference at 2k steps and suffer from a dramatic collapse afterwards. In the case of µ = πref, this collapse is consistent with prior findings on the DPO algorithm in the literature (Fisch et al., 2024; Rafailov et al., 2024a). In comparison, both squared loss variants reach peak performance after similar number of steps, but maintain that performance more stably afterwards.

Notably, using µ = πref is consistently better than µ = 1, when combined with the squared loss. In case of logistic loss, the comparison against πref in Figure 1 (left) suggests that the base policy choice does not affect performance. However, when we compare the policies produced by the two variants with µ = 1 and µ = πref directly in Figure 1 (left), we do observe a strong impact. Initially, the uniform variant is slightly preferred with a strong preference afterwards in the other direction, after roughly 5K steps. However, at this point the preference of each variants against πref has already collapsed, indicating a perhaps less meaningful comparison between two sets of bad responses in this region.

To further understand the training dynamics, we plot the log-probabilities of the preferred and dispreferred responses from the preference data for all the variants in Figure 2. For the logistic loss, these plots demonstrate that while both the likelihoods rapidly decrease to zero, the rate is faster for the dispreferred responses. In terms of our analysis setup, this corresponds to ωπ,µ being large, which in turn leads to a small value of the curvature constant cµ and a large value of R. This makes the bound in Theorem 3.6 drastically worse, confirming our theoretical findings empirically. In the case of µ = πref, corresponding to DPO, a similar collapse of log probabilities was also observed in prior works (Fisch et al., 2024; Rafailov et al., 2024b). While the log-likelihoods also decline for the squared loss, the decrease is milder, which in turn means that the magnitude of ωπ,µ and the necessary value of R remain adequately bounded. This, in addition with the better curvature constant of the squared loss makes the findings to be consistent with the theory.

In Table 2 we report approximations to cµ and R for all the different loss variants, as measured empirically. We approximate R by the value of ωπ,µ averaged over the last mini-batch of training. Further, we approximate cµ by the curvature of ℓµ for the values of R and ωπ,µ obtained this way. We note that ωπ,µ decreases monotonically during training and so the approximation to R and cµ is tightest at the end of training. The reported values of cµ and R are consistent with our theory and the empirical observations in Figures1 and 2.

Offline Learning Preference-based RL

Figure 2. Evolution of the log-likelihoods of the preferred response (left) and dispreferred response (right) from the preference dataset across the training process. Both variants of the squared loss decrease the log-likelihoods of both the responses during training, but the decrease is relatively mild. The logistic loss, on the other hand, sends these log-likelihoods crashing sharply, even though the dispreferred responses have significantly lower values, so the difference of log-likelihoods remains highly negative, driving the loss to zero. We suspect that this degeneration of log-likelihoods is responsible for the eventual collapse observed for the logistic loss in Figure 1.

Loss Base policy R cµ ℓ(z) = log 1 1+exp( βz)

µ = πref 73.203 0.00066

ℓ(z) = log 1 1+exp( βz)

µ = 1 72.949 0.00068 ℓ(z) = (βz 1)2 µ = πref 4.678 2.0 ℓ(z) = (βz 1)2 µ = 1 4.924 2.0

Table 2. We approximate R by ωπ,µ averaged over the last mini-batch and approximate cµ by computing the curvature of ℓµ at ωπ,µ.

6. Discussion

This works adopts a different line of reasoning than the typical reparameterization arguments to motivate the correspondence between offline and online RLHF techniques, with a goal of understanding the impact of design choices in the offline methods, many of which do not fit cleanly in the arguments for equivalence of online and offline methods. Our theory and experiments collectively indicate that perhaps the limited coverage of offline data, and the propensity of log-likelihoods of the preference data to precipitously drop in certain methods are the key obstacles to reliable learning in this setup. Our findings suggest that using losses which do not decay to zero at a slow rate, like the logistic loss, and using experimental design techniques for data collection before offline RLHF can be fruitful avenues for addressing the concerns uncovered here.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Ahmadian, A., Cremer, C., Gall e, M., Fadaee, M., Kreutzer, J., Pietquin, O., Ust un, A., and Hooker, S. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. ar Xiv preprint ar Xiv:2402.14740, 2024.

Amini, A., Vieira, T., and Cotterell, R. Direct preference optimization with an offset. ar Xiv preprint ar Xiv:2402.10571, 2024.

Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp. 4447 4455. PMLR, 2024.

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952.

Cen, S., Mei, J., Goshvadi, K., Dai, H., Yang, T., Yang, S., Schuurmans, D., Chi, Y., and Dai, B. Value-incentivized preference optimization: A unified approach to online and offline rlhf. ar Xiv preprint ar Xiv:2405.19320, 2024.

Chen, Y., Tan, J., Zhang, A., Yang, Z., Sheng, L., Zhang, E., Wang, X., and Chua, T.-S. On softmax direct preference optimization for recommendation. ar Xiv preprint ar Xiv:2406.09215, 2024.

Offline Learning Preference-based RL

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. KTO: Model alignment as prospect theoretic optimization. ar Xiv preprint ar Xiv:2402.01306, 2024.

Fisch, A., Eisenstein, J., Zayats, V., Agarwal, A., Beirami, A., Nagpal, C., Shaw, P., and Berant, J. Robust preference optimization through reward model distillation. ar Xiv preprint ar Xiv:2405.19316, 2024.

Huang, A., Zhan, W., Xie, T., Lee, J. D., Sun, W., Krishnamurthy, A., and Foster, D. J. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi-squared preference optimization. ar Xiv preprint ar Xiv:2407.13399, 2024.

Jiang, N. and Xie, T. Offline reinforcement learning in large state spaces: Algorithms and guarantees. Statistical Science, 2024.

Liu, Z., Lu, M., Zhang, S., Liu, B., Guo, H., Yang, Y., Blanchet, J., and Wang, Z. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. ar Xiv preprint ar Xiv:2405.16436, 2024.

Luce, R. Individual Choice Behavior: A Theoretical Analysis. Dover Books on Mathematics. Dover Publications, 2012. ISBN 9780486153391. URL https://books. google.com/books?id=ERQs Kk Pi Kkk C.

Meng, Y., Xia, M., and Chen, D. Sim PO: Simple preference optimization with a reference-free reward. ar Xiv preprint ar Xiv:2405.14734, 2024.

Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Michi, A., et al. Nash learning from human feedback. ar Xiv preprint ar Xiv:2312.00886, 2023.

Neu, G. and Rosasco, L. Iterate averaging as regularization for stochastic gradient descent. In Conference On Learning Theory, pp. 3222 3242. PMLR, 2018.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730 27744, 2022.

Pal, A., Karkhanis, D., Dooley, S., Roberts, M., Naidu, S., and White, C. Smaug: Fixing failure modes of preference optimisation with dpo-positive. ar Xiv preprint ar Xiv:2402.13228, 2024.

Park, R., Rafailov, R., Ermon, S., and Finn, C. Disentangling length from quality in direct preference optimization. ar Xiv preprint ar Xiv:2403.19159, 2024.

Rafailov, R., Chittepu, Y., Park, R., Sikchi, H., Hejna, J., Knox, B., Finn, C., and Niekum, S. Scaling laws for reward model overoptimization in direct alignment algorithms. ar Xiv preprint ar Xiv:2406.02900, 2024a.

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 (140):1 67, 2020.

Ramesh, S. S., Hu, Y., Chaimalas, I., Mehta, V., Sessa, P. G., Bou Ammar, H., and Bogunovic, I. Group robust preference optimization in reward-free rlhf. Advances in Neural Information Processing Systems, 37:37100 37137, 2024.

Raskutti, G., Wainwright, M. J., and Yu, B. Early stopping and non-parametric regression: an optimal datadependent stopping rule. The Journal of Machine Learning Research, 15(1):335 366, 2014.

Reid, M. D. and Williamson, R. C. Composite binary losses. The Journal of Machine Learning Research, 11:2387 2422, 2010.

Sonthalia, R., Lok, J., and Rebrova, E. On regularization via early stopping for least squares regression. ar Xiv preprint ar Xiv:2406.04425, 2024.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33: 3008 3021, 2020.

Suggala, A., Prasad, A., and Ravikumar, P. K. Connecting optimization and regularization paths. Advances in Neural Information Processing Systems, 31, 2018.

Swamy, G., Dann, C., Kidambi, R., Wu, Z. S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2401.04056, 2024.

Tang, Y., Guo, Z. D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P. H., Valko, M., Pires, B. A., and Piot, B. Generalized preference optimization:

Offline Learning Preference-based RL

A unified approach to offline alignment. In Forty-first International Conference on Machine Learning, 2024.

Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023.

Vaskevicius, T., Kanade, V., and Rebeschini, P. Implicit regularization for optimal sparse recovery. Advances in Neural Information Processing Systems, 32, 2019.

V olske, M., Potthast, M., Syed, S., and Stein, B. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pp. 59 63, 2017.

Wang, Y., Liu, Q., and Jin, C. Is rlhf more difficult than standard rl? ar Xiv preprint ar Xiv:2306.14111, 2023.

Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683 6694, 2021.

Xiong, W., Dong, H., Ye, C., Wang, Z., Zhong, H., Ji, H., Jiang, N., and Zhang, T. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. ar Xiv preprint ar Xiv:2312.11456, 2023.

Xu, H., Sharaf, A., Chen, Y., Tan, W., Shen, L., Van Durme, B., Murray, K., and Kim, Y. J. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. ar Xiv preprint ar Xiv:2401.08417, 2024.

Yao, Y., Rosasco, L., and Caponnetto, A. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289 315, 2007.

Zhao, H., Ye, C., Gu, Q., and Zhang, T. Sharp analysis for kl-regularized contextual bandits and rlhf. ar Xiv preprint ar Xiv:2411.04625, 2024.

Zhao, Y., Khalman, M., Joshi, R., Narayan, S., Saleh, M., and Liu, P. J. Calibrating sequence likelihood improves conditional language generation. ar Xiv preprint ar Xiv:2210.00045, 2022.

Zhao, Y., Joshi, R., Liu, T., Khalman, M., Saleh, M., and Liu, P. J. SLi C-HF: Sequence likelihood calibration with human feedback. ar Xiv preprint ar Xiv:2305.10425, 2023.

Zhu, B., Jordan, M., and Jiao, J. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In International Conference on Machine Learning, pp. 43037 43067. PMLR, 2023.

Offline Learning Preference-based RL

A. From a preference model to a proper loss

A natural family for the reward generating distributions is the exponential family. In particular we are going to assume that Dω is in the exponential family parametrized by some v := v (x, y, y ), that is P(ω|x, y, y ) = exp(ωv ϕ(v )), where ϕ is some strictly convex function. Considering the exponential family naturally leads to an objective function for learning the unknown parameter v, that is to maximize the log-likelihood which will recover v in the following way. First we take the derivative of the log-likelihood with respect to v

v Eω[log exp(ωv ϕ(v))] = Eω[ω ϕ(v)] = ϕ(v ) ϕ(v).

Setting the derivative to 0 and using the strict convexity to invert ϕ(v) shows that v = ϕ 1(Eω[ω]). To summarize the above, a preference model which follows exponential family distribution gives rise to a natural loss function given by

min v [ R,R] Eω[ϕ(v) ωv],

and has a closed form solution v = ϕ 1(Eω[ω]).

To make this discussion concrete we focus on a BTL model, where we set v(x, y, y ) = R(x, y) R(x, y ). Then ϕ(v(x, y, y )) = log(exp(R(x, y) R(x, y )) + exp(R(x, y ) R(x, y))) and the corresponding loss function is then

ϕ(v(x, y, y )) ωv(x, y, y ) = log(exp(R(x, y) R(x, y )) + exp(R(x, y ) R(x, y)))

log(exp(ω(R(x, y) R(x, y ))))

= log(1 + exp( ω(R(x, y) R(x, y )))),

which is precisely the logistic loss. The above derivation already shows that any loss derived from the exponential family with link function ϕ is going to be proper as ϕ is strictly convex and hence ϕ is invertable with gℓin Assumption 3.3 satisfying gℓ ϕ 1. We can further check that for a minimizer, v, of the logistic loss we must have

v (η log(1 + exp( v)) + (1 η) log(1 + exp(v))) = η exp( v)

1 + exp( v) + (1 η) exp(v)

1 + exp(v) = 0,

where η = P(ω = 1|x, y, y ). The above implies v = log η 1 η. Equivalently, we could have computed the derivative of the convex conjugate of ϕ, ϕ , which is precisely ϕ 1. Finally, to establish the claimed connection between the parametrization of Π to the BTL parametrization with v we have the following. The fact that the logistic loss is proper implies

R(x, y) R(x, y ) log µ(y|x)

µ(y |x) = ωπ,µ(x, y, y )

= log η 1 η = log exp(R (x, y) R (x, y )) Z(x) (R (x, y) R (x, y )),

which further simplifying gives Equation 3.

A similar line of reasoning shows that when ℓ(v) = (1 v)2, the link function satisfies η = 1+v

2 and so the resulting

reward model is P(ω|x, y, y ) = 1+R (x,y) R (x,y )

B. Approximate realizability

We now relax Assumption 3.2 to the following.

Assumption B.1 (Approximate Realizability). There exists π Π such that for all x X, y Y, y Y:

EDω[ℓ(ω ωπ ,µ(x, y, y ))|x, y, y ] min v [ R,R] EDω[ℓ(ω v)|x, y, y ] ϵ

To recover a version of Theorem 3.6 we need to modify Lemma 4.1 and its proof as that is the only place where realizability is utilized. We will also need to use the stronger version of Assumption 3.5.

Offline Learning Preference-based RL

Assumption B.2. For any u, v [ R, R], we have

ℓ(u) ℓ(v) + ℓ (v) (u v) + cµ

Assumption B.2 posits that ℓis cµ-strongly convex.

Lemma B.3. Under Assumptions B.1 and B.2, any policy π Π with Lµ(π; Dxyω) Lµ(π ; Dxyω) ϵ, satisfies

Ex,y D R(x, y)2

= Ex,y,y D[( R(x, y) R(x, y ))2] 4(2ϵ + ϵ)

Proof. Using the strong convexity part of Assumption B.2 we have that for π and π

2 (ω(R(x, y) R(x, y ) v ))2] E[ℓ(ωωπ,µ(x, y, y ))] E[ℓ(ωv )]

=Lµ(π; Dxyω) Lµ(π ; Dxyω) + E[ℓ(ω ωπ ,µ(x, y, y ))] E[ℓ(ωv )]

2 (ω(R (x, y) R (x, y ) v ))2]

E[ℓ(ωωπ,µ(x, y, y ))] E[ℓ(ωv )] ϵ ,

where v = argminv [ R,R] EDω[ℓ(ω v)|x, y, y ] and we have used first order optimality, together with the strong convexity of ℓ. Using the above two inequalities we have

4 E[(ω(R(x, y) R(x, y ) R (x, y) + R (x, y )))2]

4 E[(ω(R(x, y) R(x, y ) v + v R (x, y) + R (x, y )))2]

2 (ω(R(x, y) R(x, y ) v ))2] + E[cµ

2 (ω(R (x, y) R (x, y ) v ))2] 2ϵ + ϵ.

The remainder of the proof of Theorem 3.6 remains unchanged except for replacing ϵ by 2ϵ + ϵ. And so we have

Theorem B.4. For any π Π such that Lµ(π; Dxyω) Lµ(π µ; Dxyω) ϵ, where the corresponding loss to Lµ, given by ℓis proper, and under Assumption 3.1, Assumption 3.4, Assumption B.1 and Assumption B.2, it holds that

Ex [KL(π ( |x)||π( |x))]

2 C(ϵ + 2ϵ )

C. Experiment details

We evaluate the different variants on the TL;DR dataset (V olske et al., 2017), where the task is to summarize posts on reddit forums. The dataset consists of an original reddit posts, along with a pair of responses which are rated by human judges to provide the groumd-truth preference annotations (Stiennon et al., 2020). Our experiments use a T5 large model (Raffel et al., 2020) with 770M parameters, which is further fine-tuned to maximize the log-likelihood of the winning responses in the TL;DR dataset. We train for 20000 iterations, with a batch size of 32. A KL regularizer is used to the reference πref checkpoint with coefficient equal to 0.005. The optimizer used is Adafactor with learning rate that is constant with a linear warm-up for 2000 steps and a base rate of 1e 4.

We used the following text to prompt the Gemini evaluator used for our experiments:

You are an expert summary rater who prefers very short and high quality summaries. Given a document and two candidate summaries, say 1 if SUMMARY1 is the better summary, or 2 if SUMMARY2 is the better summary. Give a short

Offline Learning Preference-based RL

reasoning for your answer. ARTICLE: <article-here > SUMMARY1: <summary-by-π> SUMMARY2: <summary-by-πref>.