# provable_offline_preferencebased_reinforcement_learning__80415425.pdf

Published as a conference paper at ICLR 2024

PROVABLE OFFLINE PREFERENCE-BASED REINFORCEMENT LEARNING

Wenhao Zhan Princeton University wenhao.zhan@princeton.edu

Masatoshi Uehara Genentech uehara.masatoshi@gene.com

Nathan Kallus Cornell University kallus@cornell.edu

Jason D. Lee Princeton University jasonlee@princeton.edu

Wen Sun Cornell University ws455@cornell.edu

In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (Pb RL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.

1 INTRODUCTION

In standard reinforcement learning (RL) setting, the agent learns to maximize an observed numerical reward signal. However, finding appropriate numerical rewards can often be challenging in practice, and getting rewards right significantly impacts the effectiveness of RL algorithms (Wirth et al., 2017). To address this challenge, preference-based RL (Pb RL) with human feedback has emerged as a promising alternative (Christiano et al., 2017). In Pb RL, the agent does not receive a numerical reward signal, but rather feedback from a human expert in the form of preferences for a state-action trajectory in given pairs of trajectories. Pb RL has gained considerable attention across multiple application domains, including games (Mac Glashan et al., 2017; Christiano et al., 2017; Warnell et al., 2018), large language models (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021; Nakano et al., 2021; Ouyang et al., 2022; Glaese et al., 2022; Bai et al., 2022; Ramamurthy et al., 2022; Liu et al., 2023), and robot learning (Brown et al., 2019; Shin et al., 2023).

In this work, we focus on the problem of offline Pb RL, where the learning process relies exclusively on pre-collected offline data without active interaction with the environment. Offline RL has gained significant attention in various applications where conducting real-time online experiments may be costly. In the context of Pb RL, an offline setting is particularly relevant due to the high cost and latency associated with obtaining human feedback. One of the key challenges in offline RL is the limited coverage of available offline data. Since coverage of the entire state-action space is rarely

The first two authors contributed equally. This work was done at Cornell University.

Published as a conference paper at ICLR 2024

feasible in practice (Chen and Jiang, 2019a), recent empirical and theoretical approaches to offline RL leverage pessimism so as to rely only on the coverage of one comparator policy (possibly the optimal one), i.e., the so-called partial coverage condition (Yu et al., 2020; Kidambi et al., 2020; Rashidinejad et al., 2021a; Li et al., 2022a; Shi et al., 2022; Yin and Wang, 2021; Xie et al., 2021; Uehara and Sun, 2021; Zhan et al., 2022a). In the context of Pb RL, it is also crucial to develop algorithms that work under the partial coverage condition.

Despite its significance, there are very few algorithms specifically designed for offline Pb RL with strong statistical guarantees. In this work, we provide such algorithms and guarantees when preferences depend on unknown reward functions over trajectories. Notably, we consider general reward functions that can be defined over the whole trajectory rather than just state-action pairs. This is consistent with many practical settings in natural language processing. For instance, all benchmarks presented in RL4LM (Ramamurthy et al., 2022) use metrics defined over the entire trajectories. Our main contributions can be summarized as follows:

We propose a simple algorithm with general function approximation that consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We prove that our algorithm can effectively compete with a target policy as long as the offline data cover the target policy. Our analysis leverages a newly defined concentrability coefficient which is tailored to Pb RL. As the concentrability coefficient differs from that in the standard RL setting where state-action-wise rewards are directly observed, we establish lower bounds that highlight the necessity of our partial coverage condition. To the best of our knowledge, this is the first theoretical separation result between standard offline RL and offline Pb RL. We extend the algorithm to the setting where the transition kernel is unknown, where we not only construct confidence sets for the reward function but also for the system dynamics. Notably, even though the reward can be trajectory-wise, we only need to estimate the per-step transition dynamics to ensure efficient learning. We further extend our results to the action-based comparison model, where preferences are defined over individual actions instead of entire trajectories based on the advantage function of the optimal policy (Ramachandran and Amir, 2007; Zhu et al., 2023). In comparison to the case of the trajectorywise comparison model, we can establish a partial coverage guarantee using a concentrability coefficient on pairs of state-action pairs rather than trajectories. In this scenario, our sample complexity only scales with a bound on the advantage function, which can be much smaller than a bound on per-trajectory rewards as shown in Ross et al. (2011); Agarwal et al. (2019).

2 RELATED WORK

Preference-based Reinforcement Learning. The closest work to ours is Zhu et al. (2023), which also studies offline Pb RL, but their algorithm and analysis are restricted to linear models. Our algorithm and analysis extend to general function approximation. Indeed, general classes such as neural networks are commonly employed in practice (Christiano et al., 2017; Abdelkareem et al., 2022). In the special case of linear rewards and preferences over trajectories, while our algorithms differ, our guarantees recover theirs. So, our guarantees are more general; see Remark 2. Moreover, they only consider the setting where the transition kernel is known, while our work can also handle unknown transitions. Finally, in the case of action-based preferences, Zhu et al. (2023) cannot provide guarantees with partial coverage, even under their restriction to linear models. We demonstrate how to achieve meaningful guarantees under partial coverage and a soft margin (Assumption 6).

Wirth et al. (2017) provide a survey of Pb RL. Pb RL has received considerable attention in theoretical RL (Yue et al., 2012; Novoseller et al., 2020; Xu et al., 2020; Pacchiano et al., 2021; Chen et al., 2022) but the focus is largely on online Pb RL. To the best of our knowledge, Zhu et al. (2023) is the only previous work to provide theoretical guarantees for offline Pb RL.

Offline RL. In offline RL, one of the most critical challenges is addressing the issue of insufficient coverage in the offline data. It is well-known that naive methods are unable to learn the optimal policy in such scenarios (Rashidinejad et al., 2021b). To tackle this problem, numerous algorithms have been proposed with theoretical guarantees (Liu et al., 2020; Kumar et al., 2020; Jin et al., 2021;

Published as a conference paper at ICLR 2024

Rashidinejad et al., 2021b; Uehara and Sun, 2021; Li et al., 2022b; Shi et al., 2022; Jin et al., 2020; Xie et al., 2021; Zhan et al., 2022a). The most relevant work is Uehara and Sun (2021), which focuses on offline model-based RL with general function approximation. However, their methods cannot be directly applied to Pb RL since per-step rewards are not observable in our setting. Furthermore, even in the standard RL setting, the construction of confidence intervals differs between our approach and theirs. Another related paper is Cheng et al. (2022), which considers the general offline pessimistic RL framework in the standard setting and also subtracts a reference term in their algorithm, similar to ours. However, our motivations for such reference terms are quite different from theirs. Additional detailed comparisons are given in Section 4.1 and Remark 3.

3 PRELIMINARIES

We first introduce our offline Pb RL setting with general function approximation.

Markov decision processes. We consider an episodic time-inhomogeneous Markov Decision Process (MDP) denoted by M, which consists of a state space S, an action space A, an initial state distribution P 0 S, and a horizon H N+. At each step h [H 1], we use P h : S A S to denote the ground truth transitions. The ground truth reward function for the entire trajectory is denoted by r : T [0, rmax], where T = (S A)H represents the set of all possible trajectories. Note that r is a trajectory-wise reward, which is more general than state-action-wise rewards commonly considered in standard RL, which is the special case where for some {r h}H h=1 we have r (τ) = PH h=1 r h(sh, ah) for a trajectory τ = (s1, a1, , s H, a H).

A history-dependent policy π := {πh}H h=1 is characterized by πh : (S A)h 1 S A, specifying the probability of selecting actions for the agent at each step h [H] based on the entire history. We denote the set of all such history-dependent policies as Πhis. Given a policy π, we define its expected reward with respect to a general reward function r and initial and transition distributions P = {Ph}H 1 h=0 as J(π; r, P) := Eτ (π,P )[r(τ)]. Here, Eτ (π,P )[ ] represents the expectation over the trajectory distribution when executing the policy π under the transition P starting from P0. We use Eτ π[ ] or Eπ[ ] to denote the special case when P is the ground truth distribution P := {P h}H 1 h=0 .

The optimal policy, denoted π , is the policy that maximizes the expected reward with respect to the true reward r and system dynamics P , i.e., π := arg maxπ Πhis J(π; r , P ). As the true reward function r is dependent on the entire trajectory, the optimal policy π is generally history-dependent. Thus, designing offline Pb RL algorithms that can learn history-dependent policies is crucial.

For any policy π, we can define its state-action visitation measure as follows: dπ h(s, a) = Pπ,P (sh = s, ah = a), h [H], where Pπ,P ( ) denotes the distribution of the trajectory when executing policy π in P . We will also use dπ(τ) to denote Pπ,P (τ) for the whole trajectory τ.

A policy is Markovian if at each step it depends solely on the current state. When the reward is state-action-wise and the policy is Markovian, we can define the associated Vand Q-functions as V π h (s) = Eπ[PH t=h r t (st, at)|sh = s], h [H], Qπ h(s, a) = Eπ[PH t=h r t (st, at)|sh = s, ah = a], h [H]. It is well-known that when the reward is state-action-wise, the optimal policy π is both Markovian and deterministic (Bertsekas, 2017). Furthermore, we have V π h (s) = supπ V π h (s) and Qπ h (s, a) = supπ Qπ h(s, a) for all h [H]. For brevity, we will use V and Q to represent the optimal state-value function and Q-function, respectively. The advantage function of the optimal policy, denoted by A , is defined to be A h(s, a) = Q h(s, a) V h (s) for all h [H], s S, A A.

Offline Preference-based Reinforcement Learning. We focus on the problem of offline Pb RL in this work. Specifically, in the trajectory-based pairwise comparison setting, we are provided with an offline dataset D = {τ n,0, τ n,1, on}N n=1, where τ n,0 = {sn,0 h , an,0 h }H h=1 and τ n,1 = {sn,1 h , an,1 h }H h=1 are i.i.d. sampled from the distributions µ0 and µ1, respectively, and on {0, 1} indicates preference for τ n,1 over τ n,2. We assume it satisfies the following preference model:

Assumption 1 (Preference-based model). Given a pair of trajectories (τ 0, τ 1), o {0, 1} satisfies

P(o = 1 | τ0, τ1) = P(τ1 is preferred over τ0 | τ0, τ1) = Φ(r (τ1) r (τ0)).

where Φ : R [0, 1] is a monotonically increasing link function.

Published as a conference paper at ICLR 2024

A commonly used link function is the sigmoid function σ(x) = 1/{1 + exp( x)}, leading to the Bradley-Terry-Luce (BTL) model (Christiano et al., 2017).

The objective of offline Pb RL is to learn a high-quality policy bπ Πhis, i.e., with J(πtar; r , P ) J(bπ; r , P ) ϵ where πtar is a target policy we want to compete with (potentially π ).

General function approximation. In our paper, we estimate the reward r with general function approximation. We introduce a function class Gr, such as linear functions or neural networks, to approximate the true reward. For each r Gr and trajectory pair (τ 0, τ 1), we denote the induced preference model with respect to r as Pr(o|τ 0, τ 1), defined as

Pr(o = 1 | τ 0, τ 1) := Φ(r(τ 1) r(τ 0)). (1)

We use bracketing numbers to measure the complexity of {Pr : r Gr}.

Definition 1 (ϵ-bracketing number of preferences). We say (g1, g2) where g1, g2 : T T R2 is an ϵ-bracket if g1( | τ 0, τ 1) g2( | τ 0, τ 1) and g1( | τ 0, τ 1) g2( | τ 0, τ 1) 1 ϵ for all trajectory-pairs (τ 0, τ 1). The ϵ-bracketing number of Gr, denoted by NGr(ϵ), is the minimal number of ϵ-brackets (gn,1, gn,2)N n=1 needed so that for any r Gr there is a bracket i [N] containing it, meaning gi,1( |τ 0, τ 1) Pr( |τ 0, τ 1) gi,2( |τ 0, τ 1) for all trajectory-pairs (τ 0, τ 1).

The ϵ-bracket number is widely used in statistics (van de Geer, 2000) to study MLE and related M-estimates. Particularly, in our setting the bracket number of reward classes will be of the same order as the covering number, another common complexity measure in statistics (Wainwright, 2019), for Pr( |τ 0, τ 1) has only two dimensions. One example for which we can bound the ϵ-bracket number is linear rewards under the BTL model (Pacchiano et al., 2021; Zhu et al., 2023).

Proposition 1. Suppose ϕ(τ) 2 R τ T , Gr {τ 7 ϕ(τ), θ : θ 2 B} for some featurization ϕ : T Rd and B > 0, and the link function is Φ( ) = σ( ). Then for any ϵ 1, log NGr(ϵ) O(d log BR

The proof is deferred to Appendix A. To handle unknown transitions, we similarly use function classes {GPh}H 1 h=0 to approximate the transition probabilities {P h}H 1 0=1 . Similarly, we use NGPh(ϵ) to denote the ϵ-bracket number of GPh. The formal definition is deferred to Appendix E.

4 TRAJECTORY-BASED PAIRWISE-COMPARISON WITH KNOWN TRANSITION

In this section, we present our algorithm and analyze the sample complexity for the trajectory-based pairwise-comparison setting when the ground truth transition P is known. In Sections 5 and 6, we will further explore the unknown transition setting and the action-based comparison setting.

4.1 ALGORITHM

Our proposed algorithm, FREEHAND described in Algorithm 1, consists of the following two steps.

Confidence set construction via MLE (Lines 2 3). We construct a confidence set for the ground truth reward from the implicit preference feedback. We achieve this by selecting reward models that nearly maximize the log-likelihood of observed data up to a slackness parameter ζ. We will show that the result, R(D), approximates the following confidence set:

R (D) := {r Gr : Eτ0 µ0,τ1 µ1[|{r(τ1) r(τ0)} {r (τ1) r (τ0)}|2] ζ}

for a certain ξ. Here the distance between r and r is measured using the total variation distance (i.e., ℓ1 norm) of r(τ1) r(τ0) and r (τ1) r (τ0) over the offline data.

Distributionally robust policy optimization (Line 4). After constructing the confidence set, we search for the policy that maximizes the policy value under the least favorable reward model, the r R(D) minimizing the policy value J(π; r, P ) minus Eτ µref[r(τ)], where µref is an arbitrary known reference trajectory distribution. It is generally recommended to set µref to µ1, as we will explain later, possibly a sample-average approximation thereof based on {τ 1,1, . . . , τ N,1}. By selecting the least favorable reward model instead of the MLE solution br, we penalize policies that are not well-covered by the offline data. The need for a reference policy arises because the approximated

Published as a conference paper at ICLR 2024

Algorithm 1 FREEHAND: o Ffline Reinforcem Ent l Earning with Hum AN fee Dback

1: Input: offline datset D, slackness parameter ζ, reference distribution µref, true transition P

2: MLE: compute br = argmaxr Gr PN n=1 log Pr(o = on | τ n,1, τ n,0) 3: Confidence set construction: construct

R(D) = r Gr : PN n=1 log Pr(o = on | τ n,0, τ n,1) PN n=1 log Pbr(o = on | τ n,0, τ n,1) ζ

4: Distributionally robust planning: return

bπ = argmaxπ Πhis minr R(D) (J(π; r, P ) Eτ µref[r(τ)])

confidence set measures the uncertainty for reward difference between two trajectories (r(τ1) r(τ0)), but it cannot measure the uncertainty of the reward of a single trajectory.

In the following, we compare our algorithm to existing works. Zhu et al. (2023) consider a pessimistic offline RL algorithm for Pb RL specialized to the linear reward class setting, while our FREEHAND can handle general function approximation. Specifically, they construct the confidence set using the feature-covariance-rotated ℓ2-ball around the MLE ˆθ, where br(τ) = ϕ(τ), ˆθ . In contrast, our confidence set is obtained directly from the log-likelihood objective and is generic. Uehara and Sun (2021) proposes a model-based pessimistic offline RL algorithm when we have access to rewards. The confidence set construction correspondingly differs significantly. Cheng et al. (2022) considers a general offline pessimistic RL framework. In their policy optimization step, they also subtract the value of a reference policy. This similarity is superficial, however, as the motivations are different. We subtract the value because we can only measure the difference between rewards of any two trajectories, while their motivation is to obtain a certain robustness result (their proposition 3). Remark 1 (Computational Efficiency). Line 4 in FREEHAND is computationally hard in general. Nevertheless, by leveraging Lagrangian formulation, we can use Lagrangian multiplier to convert the constraint r R(D) into a regularization term of the objective function and have a feasible version of our algorithm in practice. See more details in Appendix B.

4.2 ANALYSIS

To analyze the sample complexity of FREEHAND, we first quantify the discrepancy between the offline data D and the distribution induced by the target policy πtar. Definition 2 (concentrability coefficient for preference-based feedback). The concentrability coefficient w.r.t. a reward class Gr, a target policy πtar, and a reference policy µref is defined as

Cr(Gr, πtar, µref) := max

0, sup r Gr

Eτ 0 πtar,τ 1 µref[r (τ 0) r (τ 1) r(τ 0) + r(τ 1)] q

Eτ 0 µ0,τ 1 µ1 |r (τ 0) r (τ 1) r(τ 0) + r(τ 1)|2

Note, when we choose µref = µ1, by Jensen s inequality, the value of Cr(Gr, πtar, µ1) can always be upper bounded by the per-trajectory concentration coefficient: Cr(Gr, πtar, µ1) Ctr for any Gr, where Ctr := maxτ T dπtar(τ)

µ0(τ) . Moreover, while Cr(Gr, πtar, µ1) becomes Ctr in the worst case (e.g., when Gr is the set of all functions mapping from T to R), it can generally be much smaller. For example, when using linear models, it is a relative condition number, as explained in Appendix D.1. Finally, when µref = dπtar, our coefficient becomes 0. This implies that Cr(Gr, πtar, µ1) could be small when πtar and µref are close. While the concept of concentrability coefficient has been used in offline RL with explicit reward feedback (Chen and Jiang, 2019b; Song et al., 2022), this property is unique when the feedback is in the form of preferences.

In our following PAC analysis, we further assume the reward class Gr is realizable and bounded. Assumption 2 (Realizability). We have r Gr. Assumption 3 (Boundedness). We have 0 r(τ) rmax for all r Gr and τ T . Theorem 1. For any δ (0, 1], let ζ = c MLE log(NGr(1/N)/δ) where c MLE > 0 is a universal constant, then under Assumption 1,2 and 3, with probability 1 δ, we have

J(πtar; r , P ) J(bπ; r , P )

c C2r(Gr, πtar, µref)κ2 log(NGr(1/N)/δ)

Published as a conference paper at ICLR 2024

where c > 0 is a universal constant and κ = (infx [ rmax,rmax] Φ (x)) 1.

Theorem 1 indicates that FREEHAND can learn an ϵ-optimal policy compared to πtar with a sample complexity of

N = e O C2 r(Gr, πtar, µref)κ2 log(NGr(1/N)/δ)

Next we provide a detailed explanation of this sample complexity. Firstly, Cr(Gr, πtar, µref) represents the extent to which the dataset D covers the target policy πtar. In our theorem, to obtain a non-vacuous PAC guarantee, we only require the dataset D to cover the target policy πtar (i.e., Cr(Gr, πtar, µref) < ). The distributionally robust optimization step plays a crucial role in obtaining this guarantee under partial coverage. In particular, invoking the abovementioned third property of Cr(Gr, πtar, µref), when setting πtar = µref, (2) is reduced to

J(µref; r , P ) J(bπ; r , P ) (3)

This encourages us to choose µref = µ1 (or µ0) as it will allow us to ensure our performance is at least larger than the performance associated with the offline data.

Secondly, log(NGr(1/N)) measures the complexity of the function class Gr. For example, when using linear models, it takes O(d). We refer the reader to van de Geer (2000) for bracketing number computations for more general classes. Thirdly, κ represents the non-linearity of the link function Φ, which determines the difficulty of estimating the reward from human preferences. This dependence on κ is present in the existing literature of Pb RL, both in online settings (Pacchiano et al., 2021; Chen et al., 2022) and offline settings (Zhu et al., 2023). Remark 2 (Comparison to Zhu et al. (2023)). By specializing our result to the linear model, we recover the result in Zhu et al. (2023). Specifically, the bracketing number is calculated in Proposition 1, and Cr(Gr, πtar, µref) is reduced to a relative condition number. The details are deferred to Appendix D.1.

4.3 DISCUSSION OF THE CONCENTRABILITY COEFFICIENT

In the worst-case scenario (i.e., Gr is the set of all functions mapping from T to R), the value of Cr(Gr, πtar, µ1) is reduced to to the per-trajectory concentrability coefficient Ctr. The pertrajectory concentrability coefficient is generally larger than the per-step concentrability coefficient Cst commonly used in the general offline RL literature (Xie et al., 2021; Uehara and Sun, 2021; Zhan et al., 2022a). Specifically, Cst is defined as

Cst := max s,a,h dπtar h (s, a)/µ0,h(s, a),

where µ0,h(s, a) represents the marginal distribution at step h. In this section, we show the dependence on the per-trajectory concentrability coefficient is necessary for our offline Pb RL context. This is intuitively because our Pb RL setting involves reward functions defined over trajectories, reflecting the fact that human feedback is also trajectory-based.

In the next proposition, we first show that the per-trajectory concentrability coefficient Ctr can be exponentially larger than the per-step concentrability coefficient Cst. Proposition 2. For any S 1, A 2, H 1, C 1, there exists an MDP M with horizon H, a policy πtar and a data distribution µ0 such that |S| = S, |A| = A and Cst = C while Ctr = CH. Proposition 2 indicates that Ctr can be significantly larger than Cst. A natural question arises as to whether we can obtain suboptimality guarantees using Cst. Unfortunately, the following lower bounds reveal that per-step concentrability is not sufficient to guarantee efficient learning in trajectory-based Pb RL setting, even when the reward function is defined over state-action pairs: Theorem 2. Set πtar = π . Then, for any C > 1 and H 2, there exists a dataset distribution µ1 such that we have

inf bπ sup (M,µ0) Θst(C) ED[J(π ; r , P ) J(bπ; r , P )] min C 1, 1 ,

where bπ is any mesurable function of the data D (and knows the information of µ1). Θst(C) is the set of all MDPs with per-step reward, offline distribution (M, µ0) such that Cst C. Note ED is taken with respect to the randomness in D.

Published as a conference paper at ICLR 2024

Algorithm 2 FREEHAND-transition

Input: offline dataset D, slackness parameter ζ, ζPh, reference distribution µref MLE for reward: compute br = argmaxr Gr PN n=1 log Pr(o = on|τ n,1, τ n,0) MLE for transition: compute b Ph = argmax Ph GPh PN n=1 P1 i=0 log Ph(sn,i h+1|sn,i h , an,i h ) Confidence set construction: for 0 h H 1, construct

R(D) = r Gr : PN n=1 log Pr(o = on|τ n,0, τ n,1) PN n=1 log Pbr(o = on|τ n,0, τ n,1) ζ .

Ph(D) = Ph GPh :

i=0 log Ph(sn,i h+1|sn,i h , an,i h )

i=0 log b Ph(sn,i h+1|sn,i h , an,i h ) ζPh

Distributionally robust planning: return

bπ = argmaxπ Πhis minr R(D),Ph Ph(D) J π; r, {Ph}H 1 h=0 ) Eτ µref[r(τ)].

Theorem 2 indicates that learning in our setting is intrinsically hard due to trajectory-based feedback, even if we restrict the reward function class. In addition, we can show that scaling with Ctr is necessary in our setting: Theorem 3. Set πtar = π . Then for any C > 1 and H 1, there exists a dataset distribution µ1 such that we have

inf bπ sup (M,µ0) Θtr(C) ED[J(π ; r , P ) J(bπ; r , P )] min C 1,

where bπ is any mesurable function of the data D (and knows the information of µ1). Θtr is the set of all MDP, offline distribution (M, µ0) such that Ctr C. Note ED is taken with respect to the randomness in D. Note that when µ1 is known, we can set µref = µ1 in Algorithm 1 and then Cr(Gr, πtar, µ1) Ctr, which implies the sample complexity in Theorem 1 indeed nearly matches this lower bound with respect to Ctr and N when N is sufficiently large.

In summary, Theorem 2 and Theorem 3 imply that the scaling with the per-trajectory concentrability coefficient is essential in the trajectory-based pairwise-comparison setting, and it cannot be relaxed to the per-step concentrability without additional assumptions, such as on the reward structure. To the best of our knowledge, this is the first theoretical result indicating that trajectory-wise feedback is intrinsically harder to learn than step-wise feedback in offline Pb RL.

5 TRAJECTORY-BASED COMPARISON WITH UNKNOWN TRANSITION We extend the setting presented in Section 4 to the scenario where the transition function P is unknown. The algorithm is described in Algorithm 2. Compared to Algorithm 1, we simply added a similar step to handle unknown transitions. Hereafter, we use the convention P0( | s, a) := P0( ).

Our sample complexity will depend on the following additional concentration coefficient: Definition 3 (Concentrability coefficient for the transition). The concentrability coefficient w.r.t. transition classes {GPh} and a target policy πtar is defined as

CP ({GPh}, πtar) := max h:0 h H 1 sup Ph GPh

E(s,a) dπtar h [ Ph( | s, a) P h( | s, a) 1] q

E(s,a) (µ0,h/2+µ1,h/2)[ Ph( | s, a) P h( | s, a) 2 1] .

Note this is always upper-bounded by the density-ratio-based concentrability coefficient, CP ({GPh}, πtar) sup(s,a,h) S A [H] dπtar h (s,a) µ0,h(s,a)/2+µ1,h(s,a)/2.

We also assume the transition classes {GPh}H 1 h=0 are realizable: Assumption 4 (Realizability). Suppose that we have P h GPh for all h where 0 h H 1. In addition, any choice Ph GPh for 0 h H 1 are valid transition distributions.

Then with the above assumptions, we have the following theorem to characterize the sample complexity when the transition is unknown:

Published as a conference paper at ICLR 2024

Algorithm 3 FREEHAND-action

1: Input: offline datset D. 2: MLE: compute b Ah = argmax Ah GAh PN n=1 log PAh(o = on h | sn h, an,0 h , an,1 h ) h [H]

3: Greedy policy: return bπh(s) = argmaxa A b Ah(s, a)

Theorem 4. For any δ (0, 1], let ζ = c MLE log(NGr(1/N)/δ),ζPh = c P log(HNGPh(1/N)/δ) where c MLE, c P > 0 are universal constants, then under Assumption 1,2,3 and 4, we have

J(πtar; r , P ) J(bπ; r , P )

c C2r(Gr, πtar, µref)κ2 log(NGr(1/N)/δ)

c C2 P ({GPh}, πtar) log(HNP (1/N)/δ)

where c > 0 and κ are the same as Theorem 1 and NP := max0 h H 1 NGPh.

Compared to Theorem 1, we introduce an additional term in our guarantee to account for the unknown transitions. Once again, our result demonstrates that the learned policy can achieve performance comparable to any target policy πtar covered by the offline data, i.e., Cr(Gr, πtar, µref) < , CP ({GPh}, πtar) < . Remark 3 (Comparison to Uehara and Sun (2021) ). Like us, Uehara and Sun (2021) proposed a model-based RL algorithm that works under partial coverage, but in the standard RL setting and with a known state-action-wise reward function. See the detailed comparison in Appendix D.3

6 ACTION-BASED COMPARISON

Next, we turn our attention to the action-based comparison setting (Ramachandran and Amir, 2007; Zhu et al., 2023), where human evaluators provide preferences between pairs of actions instead of pairs of trajectories. In this section, we assume that the reward function r is state-action-wise: r (τ) = PH h=1 r h(sh, ah) for τ = (s1, a1, , s H, a H). And, we consider a preference model based on Q .

Setting. We have datasets D = {Dh}H h=1 with Dh = {(sn h, an,0 h , an,1 h , on h)}N n=1 for each h [H], where each sample is drawn i.i.d. from the distribution sn h µh, an,0 h µ0,h( | sn h), an,1 h µ1,h( | sn h) and on h {0, 1} indicates preference for a1,n h over a0,n h in the state sn h. We assume it satisfies the following preference model: Assumption 5 (Action-based comparison model). Given a pair of actions a0 h, a1 h and state sh, o {0, 1} satisfies

P(oh = 1 | sh, a0 h, a1 h) = Φ(Q h(sh, a1 h) Q h(sh, a0 h)).

Here, the aforementioned distribution can be equivalently expressed as P(on h = 1 | sn h, an,0 h , an,1 h ) = Φ(A h(sn h, an,1 h ) A h(sn h, an,0 h )), where A denotes the optimal advantage function. Consequently, we introduce general function classes GAh to estimate the optimal advantage function A h. In addition, for each Ah GAh and (s, a0, a1) S A A, we use PAh( | s, a0, a1) to represent the human preference model with respect to Ah, defined as PAh(o = 1 | s, a0, a1) := Φ(Ah(s, a1) Ah(s, a0)).

We again use the ϵ-bracket number of such advantage function classes to quantify their complexity, denoted as NGAh. The full formal definition is provided in Appendix E.

Algorithm. Our algorithm comprises two steps. In the first step (Line 2), our objective is to estimate the optimal advantage function using MLE. In the second step (Line 3), we determine the policy by selecting the action with the highest advantage value based on the learned advantage function.

Analysis. Now we show that FREEHAND-action can learn a near-optimal policy as long as offline data covers the optimal policy. Our analysis depends on the assumption on the margin of Q : Assumption 6 (Soft margin). There exists α0 R+, β (0, ] such that for all a A, h [H], α > 0, we have Pπ ,P (0 < |Q h(sh, π (sh)) Q h(sh, a)| < α) (α/α0)β.

The soft margin is widely used in the literature on classification, decision making, and RL (Audibert and Tsybakov, 2007; Perchet and Rigollet, 2013; Luedtke and Chambaz, 2020; Hu et al., 2021;

Published as a conference paper at ICLR 2024

2022; Uehara et al., 2023). Note, when the optimal Q function satisfies a gap (as in Simchowitz and Jamieson, 2019; Wu et al., 2022), the soft margin assumption holds with β = .

Next, we introduce the concentrability coefficient for the action-based comparison setting, which is defined as follows. Definition 4 (concentrability coefficient for action-based comparison).

Cact := sup h [H],Ah GAh

E(s,a0) dπ h ,a1 Unif( |s)[l(Ah, s, a0, a1)]

Es µh,a0 µ0,h( |s),a1 µ1,h( |s)[l(Ah, s, a0, a1)],

where l(Ah, s, a0, a1) := |A h(s, a0) A h(s, a1) Ah(s, a0) + Ah(s, a1)|2 and Unif( | s) is the uniform policy over A.

We observe that

Cact sup h [H],s S

dπ h (s) µh(s)

sup h [H],s S,a0 A

π h(a0 | s) µ0,h(a0 | s)

|A| sup h [H],s S,a1 A

1 µ1,h(a1 | s)

Based on this bound, we can consider simple sufficient conditions for Cact to be finite. Firstly, regarding the first term, it is sufficient for the dataset distribution µh to cover the states visited by the optimal policy π , denoted as dπ h . Regarding the second term, we require µ0,h to cover π h. Additionally, the third term can be upper bounded when µ1,h can cover the whole action space. This is mild because s S; µh(s) > 0 is not controllable to the learner; but (s, a) S A; µ1,h(a | s) > 0 is controllable to the learner in the data-collection process. To summarize, Cact < primarily requires partial coverage over the state space with respect to the optimal policy, which is preferable in practical applications where S can be very large.

Additionally, we introduce several assumptions on the function classes similar to those in Section 4. Assumption 7. For all h [H], we have A h GAh. Assumption 8. For all h [H] and Ah GAh, we have |Ah(s, a)| bmax for all (s, a) S A.

With the aforementioned assumptions, we can establish the sample complexity. Theorem 5. Under Assumption 5,6,7 and 8, we have with probability at least 1 δ that

J(π ; r , P ) J(bπ; r , P ) c H|A| 2

β+2 κ2 ACact log(HNGA(1/N)/δ)

where NGA := maxh [H] NGAh and κA = 1 infx [ bmax,bmax] Φ (x).

Theorem 5 suggests that FREEHAND-action can learn a near-optimal policy as long as Cact takes a finite value under a soft margin. Specifically, when a hard margin is imposed (i.e., β = ), FREEHAND-action can learn an ϵ-optimal policy with a sample complexity of N = e O(1/ϵ), which is faster than a typical rate e O(1/ϵ2). As mentioned earlier, the quantity Cact represents the extent to which the distribution induced by the optimal policy is covered by the offline data. Therefore, there is no need for a potentially stringent condition that requires the offline data to cover the entire state space like Zhu et al. (2023).

Furthermore, our guarantee is designed to overcome the limitations of existing approaches. In Theorem 1, our upper-bound is influenced by the parameter κ. When using a common sigmoid link function, this parameter scales with Θ(exp(rmax)). As a result, in dense reward settings where rmax scales with H, this scaling factor may lead to an explicit dependence of Θ(exp(H)). Similar observations have been made in previous works (Zhu et al., 2023; Pacchiano et al., 2021; Chen et al., 2022). However, even if rmax scales with H, it is known that the ℓ -norm of the advantage function, denoted as bmax, can take much smaller values (Ross et al., 2011; Agarwal et al., 2019) Hence, we can avoid the explicit dependence on Θ(exp(H)).

7 CONCLUSIONS We propose the first algorithm for trajectory-wise Pb RL with general function approximation and under partial coverage. We establish lower bounds that explain the differences between our Pb RL model and standard RL with direct reward feedback. Moreover, we extend our algorithm to unknown transitions and to preference feedback over actions, all while maintaining strong guarantees under partial coverage.

Published as a conference paper at ICLR 2024

Abdelkareem, Y., Shehata, S., and Karray, F. (2022). Advances in preference-based reinforcement learning: A review. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2527 2532. IEEE.

Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019). Reinforcement learning: Theory and algorithms. Technical report.

Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608 633.

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. ar Xiv preprint ar Xiv:2204.05862.

Bertsekas, D. P. (2017). Dynamic programming and optimal control (4th edition). Athena Scientific.

Brown, D., Goo, W., Nagarajan, P., and Niekum, S. (2019). Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783 792. PMLR.

Chen, J. and Jiang, N. (2019a). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042 1051. PMLR.

Chen, J. and Jiang, N. (2019b). Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 1042 1051.

Chen, X., Zhong, H., Yang, Z., Wang, Z., and Wang, L. (2022). Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pages 3773 3793. PMLR.

Cheng, C.-A., Xie, T., Jiang, N., and Agarwal, A. (2022). Adversarially trained actor critic for offline reinforcement learning. In International Conference on Machine Learning, pages 3852 3878. PMLR.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Glaese, A., Mc Aleese, N., Tr ebacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., et al. (2022). Improving alignment of dialogue agents via targeted human judgements. ar Xiv preprint ar Xiv:2209.14375.

Hu, Y., Kallus, N., and Mao, X. (2022). Fast rates for contextual linear optimization. Management Science, 68(6):4236 4245.

Hu, Y., Kallus, N., and Uehara, M. (2021). Fast rates for the regret of offline reinforcement learning. In Conference on Learning Theory. PMLR.

Jin, Y., Yang, Z., and Wang, Z. (2020). Is pessimism provably efficient for offline rl? ar Xiv preprint ar Xiv:2012.15085.

Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084 5096. PMLR.

Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. (2020). Morel: Model-based offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pages 21810 21823. Curran Associates, Inc.

Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. In Conference on Neural Information Processing Systems (Neur IPS).

Published as a conference paper at ICLR 2024

Li, G., Ma, C., and Srebro, N. (2022a). Pessimism for offline linear contextual bandits using lp confidence sets. ar Xiv preprint ar Xiv:2205.10671.

Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2022b). Settling the sample complexity of model-based offline reinforcement learning. ar Xiv preprint ar Xiv:2204.05275.

Liu, H., Sferrazza, C., and Abbeel, P. (2023). Languages are rewards: Hindsight finetuning using human feedback. ar Xiv preprint ar Xiv:2302.02676.

Liu, Q., Chung, A., Szepesvári, C., and Jin, C. (2022). When is partially observable reinforcement learning not scary? ar Xiv preprint ar Xiv:2204.08967.

Liu, Y., Swaminathan, A., Agarwal, A., and Brunskill, E. (2020). Provably good batch reinforcement learning without great exploration. ar Xiv preprint ar Xiv:2007.08202.

Luedtke, A. and Chambaz, A. (2020). Performance guarantees for policy learning. In Annales de l IHP Probabilites et statistiques, volume 56, page 2162. NIH Public Access.

Mac Glashan, J., Ho, M. K., Loftin, R., Peng, B., Wang, G., Roberts, D. L., Taylor, M. E., and Littman, M. L. (2017). Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning, pages 2285 2294. PMLR.

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. ar Xiv preprint ar Xiv:2112.09332.

Novoseller, E., Wei, Y., Sui, Y., Yue, Y., and Burdick, J. (2020). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 1029 1038. PMLR.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744.

Pacchiano, A., Saha, A., and Lee, J. (2021). Dueling rl: reinforcement learning with trajectory preferences. ar Xiv preprint ar Xiv:2111.04850.

Perchet, V. and Rigollet, P. (2013). The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693 721.

Ramachandran, D. and Amir, E. (2007). Bayesian inverse reinforcement learning. Urbana, 51:61801.

Ramamurthy, R., Ammanabrolu, P., Brantley, K., Hessel, J., Sifa, R., Bauckhage, C., Hajishirzi, H., and Choi, Y. (2022). Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. ar Xiv preprint ar Xiv:2210.01241.

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021a). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. ar Xiv preprint ar Xiv:2103.12021.

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021b). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702 11716.

Rigter, M., Lacerda, B., and Hawes, N. (2022). Rambo-rl: Robust adversarial model-based offline reinforcement learning. ar Xiv preprint ar Xiv:2204.12581.

Ross, S., Gordon, G., and Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627 635.

Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. (2022). Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. ar Xiv preprint ar Xiv:2202.13890.

Published as a conference paper at ICLR 2024

Shin, D., Dragan, A. D., and Brown, D. S. (2023). Benchmarks and algorithms for offline preferencebased reward learning. ar Xiv preprint ar Xiv:2301.01392.

Simchowitz, M. and Jamieson, K. G. (2019). Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32.

Song, Y., Zhou, Y., Sekhari, A., Bagnell, J. A., Krishnamurthy, A., and Sun, W. (2022). Hybrid rl: Using both offline and online data can make rl efficient. ar Xiv preprint ar Xiv:2210.06718.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. (2020). Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008 3021.

Uehara, M., Kallus, N., Lee, J. D., and Sun, W. (2023). Refined value-based offline rl under realizability and partial coverage. ar Xiv preprint ar Xiv:2302.02392.

Uehara, M. and Sun, W. (2021). Pessimistic model-based offline rl: Pac bounds and posterior sampling under partial coverage. In ar Xiv preprint ar Xiv:2107.06226.

van de Geer, S. (2000). Empirical Processes in M-estimation, volume 6. Cambridge university press.

Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint, volume 48 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.

Warnell, G., Waytowich, N., Lawhern, V., and Stone, P. (2018). Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI conference on artificial intelligence, volume 32.

Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J., et al. (2017). A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1 46.

Wu, J., Braverman, V., and Yang, L. (2022). Gap-dependent unsupervised exploration for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 4109 4131. PMLR.

Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. (2021). Recursively summarizing books with human feedback. ar Xiv preprint ar Xiv:2109.10862.

Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for offline reinforcement learning. ar Xiv preprint ar Xiv:2106.06926.

Xu, Y., Wang, R., Yang, L., Singh, A., and Dubrawski, A. (2020). Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33:18784 18794.

Yin, M. and Wang, Y.-X. (2021). Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34:4065 4078.

Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. (2020). Mopo: Model-based offline policy optimization. In Advances in Neural Information Processing Systems, volume 33, pages 14129 14142.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. (2012). The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556.

Zhan, W., Huang, B., Huang, A., Jiang, N., and Lee, J. (2022a). Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730 2775. PMLR.

Zhan, W., Uehara, M., Sun, W., and Lee, J. D. (2022b). Pac reinforcement learning for predictive state representations. ar Xiv preprint ar Xiv:2207.05738.

Zhu, B., Jiao, J., and Jordan, M. I. (2023). Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. ar Xiv preprint ar Xiv:2301.11270.

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593.

Published as a conference paper at ICLR 2024

A PROOF OF PROPOSITION 1

Let F denote the function class {fr : fr(τ 0, τ 1) = Pr(o = 1|τ 0, τ 1), r Gr}. Let IF(ϵ) denote the ϵ-bracket number with respect to ℓ -norm, i.e., the minimum integer M such that there exist M functions {f i}M i=1 such that for each fr F, we have supτ 0,τ 1 |fr(τ 0, τ 1) f i(τ 0, τ 1)| ϵ for some i [M]. Then we know there exists a set of function F with |F| = IF(ϵ/4) such that for each fr F, there exists f F satisfying

sup τ 0,τ 1 |fr(τ 0, τ 1) f(τ 0, τ 1)| ϵ/4.

Now we construct a bracket (g1

f) defined as follows:

f(o = 1|τ 0, τ 1) = f(τ 0, τ 1) ϵ/4, g1

f(o = 0|τ 0, τ 1) = 1 f(τ 0, τ 1) ϵ/4,

f(o = 1|τ 0, τ 1) = f(τ 0, τ 1) + ϵ/4, g2

f(o = 0|τ 0, τ 1) = 1 f(τ 0, τ 1) + ϵ/4.

Then clearly we have g1

f( |τ 0, τ 1) Pr( |τ 0, τ 1) g2

f( |τ 0, τ 1) and g1

f( |τ 0, τ 1)

f( |τ 0, τ 1) 1 ϵ. This implies that NGr(ϵ) IF(ϵ/4).

Now we only need to bound IF(ϵ/4). Consider θ and θ with θ θ 2 ϵ1 and let r (r ) denote the reward ϕ, θ ( ϕ, θ ). Then we know for all τ,

|r(τ) r (τ)| Rϵ1.

Fix the trajectory pair (τ 0, τ 1). Without loss of generality, we assume exp(r(τ 0)) + exp(r(τ 1)) exp(r (τ 0)) + exp(r (τ 1)). Then we have

exp(r(τ 0)) + exp(r(τ 1)) exp(r (τ 0)) + exp(r (τ 1)) exp(Rϵ1) exp(r(τ 0)) + exp(r(τ 1)) .

On the other hand, we have

|fr(τ 0, τ 1) fr (τ 0, τ 1)|

exp(r(τ 1)) exp(r (τ 0)) + exp(r (τ 1)) exp(r (τ 1)) exp(r(τ 0)) + exp(r(τ 1)) exp(r (τ 0)) + exp(r (τ 1)) exp(r(τ 0)) + exp(r(τ 1)) .

Therefore, if exp(r(τ 1)) exp(r (τ 0))+exp(r (τ 1)) exp(r (τ 1)) exp(r(τ 0))+exp(r(τ 1)) 0, then we have exp(r(τ 1)) exp(r (τ 0)) + exp(r (τ 1)) exp(r (τ 1)) exp(r(τ 0)) + exp(r(τ 1))

exp(Rϵ1) exp(r(τ 1)) exp(r(τ 0)) + exp(r(τ 1)) exp( Rϵ1) exp(r(τ 1)) exp(r(τ 0)) + exp(r(τ 1))

=(exp(Rϵ1) exp( Rϵ1)) exp(r(τ 1)) exp(r(τ 0)) + exp(r(τ 1)) .

Otherwise, we have exp(r(τ 1)) exp(r (τ 0)) + exp(r (τ 1)) exp(r (τ 1)) exp(r(τ 0)) + exp(r(τ 1))

exp(Rϵ1) exp(r(τ 1)) exp(r(τ 0)) + exp(r(τ 1)) exp(r(τ 1)) exp(r(τ 0)) + exp(r(τ 1))

=(exp(Rϵ1) 1) exp(r(τ 1)) exp(r(τ 0)) + exp(r(τ 1)) .

Therefore we have

|fr(τ 0, τ 1) fr (τ 0, τ 1)|

Published as a conference paper at ICLR 2024

(exp(Rϵ1) exp( Rϵ1)) exp(r(τ 1)) exp(r(τ 0)) + exp(r(τ 1))

exp(r (τ 0)) + exp(r (τ 1)) exp(r(τ 0)) + exp(r(τ 1)) exp(2Rϵ1) 1.

This implies that for any ϵ 1,

log IF(ϵ/4) log Id,B 2 ln 2

R ϵ O d log BR

where Id,B( ) is the covering number of a d-dimensional ball centered at the origin with radius B with respect to ℓ2-norm and the last step is from Wainwright (2019). This concludes our proof.

B FEASIBLE IMPLEMENTATION OF FREEHAND

In this section we show how to implement the robust optimization step (Line 4) of FREEHAND in practice. Our idea is inspired by standard offline RL (Rigter et al., 2022) where the authors rely on Lagrangian formulation to make the theoretical algorithm CPPO (Uehara and Sun, 2021) practical enough to achieve good performance on the D4RL datasets. We believe the empirical insights provided in (Rigter et al., 2022) can be applied here as well.

First for the Lagrangian relaxation, the original inner minimization problem in Line 4 of FREEHAND is

min r R(D) J(π; r, P ) Eτ µref[r(τ)].

Note that the only constraint is r R(D). Then by introducing a Lagrangian multiplier β, we can convert such constrained minimization problem into an unconstrained regularized minimization problem:

min r J(π; r, P ) Eτ µref[r(τ)] β

n=1 log Pr(o = on|τ n,0, τ n,1).

Consequently, Line 4 in FREEHAND can be converted to the following unconstrained regularized max-min problem:

max π min r L(π, r) := J(π; r, P ) Eτ µref[r(τ)] β

n=1 log Pr(o = on|τ n,0, τ n,1).

Since now we are facing an unregularized problem, the most common way to solve L(π, r) in practice is gradient ascent-descent. Suppose π and r are parametrized by θ and λ (usaully neural networks). Then gradient ascent-descent requires us to compute an unbiased stochastic gradient with respect to θ and λ respectively. Fortunately, this can be easy to achieve in practice. On the one hand, for the gradient of θ, we only need to compute θJ(πθ; r, P ). This task has been thoroughly discussed in the literature of policy gradient and one example is REINFORCE, which samples a trajectory τ by executing πθ in P and then the estimated graidient can be expressed as

h=1 θπθ,h(ah|sh),

where (sh, ah) is the h-step of τ.

On the other hand, for the gradient of λ, we only need to sample independent trajecotories τ by executing πθ in P and τ from µref and an index i [N]. Then the unbiased estimated gradient can be directly written as

λrλ(τ ) λrλ(τ ) β λ log Prλ(o = oi|τ i,0, τ i,1).

Therefore, with the above estimated gradients, we can then run graident ascent-descent happily to solve maxπ minr L(π, r) in practice.

Published as a conference paper at ICLR 2024

Furthermore, to compute Eτ µref[r(τ)] in Line 3 in practice, we can sample N0 trajectories {τ i}N0 i=1 from µref and then use 1 N0 PN0 i=1 r(τ i) as a surrogate of Eτ µref[r(τ)]. From Azuma-Hoeffding inequality, this will only incur an error scaling with 1 N0 . Therefore as long as we choose N0 = e O(1/ϵ2), this error can be neglected while line 4 becomes more computationally efficient.

C PROOF OF THEOREM 1

The proof of Theorem 1 consists of two steps, deriving the guarantee of MLE and analyzing the performance of pessimistic offline RL.

Step 1: MLE guarantee. We first need to show that the confidence set R(D) contains the true reward r with high probability. This can be proved via the following lemma which characterizes the guarantee of MLE:

Lemma 1 (Performance of MLE). Fix any δ (0, 1]. Then with probability at least 1 δ/2 we have that for all reward function r Gr,

n=1 log Pr(on|τ n,0, τ n,1)

Pr (on|τ n,0, τ n,1)

c MLE log(NGr(1/N)/δ),

where c MLE > 0 is a universal constant.

We defer the proof to Appendix C.1. Denote the event in Lemma 1 by E1, then we know P(E1) 1 δ/2. Under the event E1, we have

n=1 log Pr (on|τ n,0, τ n,1)

n=1 log Pbr(on|τ n,0, τ n,1) c MLE log(NGr(1/N)/δ),

which implies that r R(D) since we know r Gr from Assumption 2.

Nevertheless, the confidence set R(D) is constructed via loglikelihood and we indeed prefer a bound on the total variation (TV) distance between Pr and Pr where r R(D) to facilitate our subsequent analysis. We can obtain such a bound as shown in the following lemma from the literature (Liu et al. (2022)[Proposition 14],Zhan et al. (2022b)[Lemma 9]):

Lemma 2. With probability at least 1 δ/2, we have for all reward function r Gr that

Eτ 0 µ0,τ 1 µ1

Pr( |τ 0, τ 1) Pr ( |τ 0, τ 1) 2

n=1 log Pr (on|τ n,0, τ n,1)

Pr(on|τ n,0, τ n,1)

+ log(NGr(1/N)/δ) ,

where c TV > 0 is a universal constant.

Denote the event in Lemma 2 by E2 and then we know P(E2) 1 δ/2. Then from Lemma 1 and Lemma 2 we know that under event E1 E2, we have for all r R(D):

Eτ 0 µ0,τ 1 µ1

Pr( |τ 0, τ 1) Pr ( |τ 0, τ 1) 2

c log(NGr(1/N)/δ)

where c > 0 is a universal constant.

Then under Assumption 3, we can apply the mean value theorem between r (τ1) r (τ0) and r(τ1) r(τ0) to (4) and ensure for all r R(D) that

Eτ 0 µ0,τ 1 µ1[|(r (τ1) r (τ0)) (r(τ1) r(τ0))|2] cκ2 log(NGr(1/N)/δ)

where κ := 1 infx [ rmax,rmax] Φ (x) measures the non-linearity of the link function Φ.

Published as a conference paper at ICLR 2024

Step 2: Pessimistic offline RL. Let rinf π denote argminr R(D) J(π; r, P ) Eτ µref[r(τ)]. Then we can bound the suboptimality of bπ as follows:

J(πtar; r , P ) J(bπ; r , P )

= J(πtar; r , P ) Eτ µref[r (τ)] J(bπ; r , P ) Eτ µref[r (τ)]

J(πtar; r , P ) Eτ µref[r (τ)] J(πtar; rinf πtar, P ) Eτ µref[rinf πtar(τ)]

J(bπ; r , P ) Eτ µref[r (τ)] J(bπ; rinf bπ , P ) Eτ µref[rinf bπ (τ)]

J(πtar; r , P ) Eτ µref[r (τ)] J(πtar; rinf πtar, P ) Eτ µref[rinf πtar(τ)]

=Eτ 0 πtar,τ 1 µref[(r (τ 0) r (τ 1)) (rinf πtar(τ 0) rinf πtar(τ 1))]

Cr(Gr, πtar, µref) q

Eτ0 µ0,τ1 µ1[|r (τ 0) r (τ 1) rinf πtar(τ 0) + rinf πtar(τ 1)|2]

c C2r(Gr, πtar, µref)κ2 log(NGr(1/N)/δ)

where the second step is due to bπ = argmaxπ Πhis minr R(D) J(π; r, P ) Eτ µref[r(τ)], the third step is due to rinf bπ = argminr R(D) J(bπ; r, P ) Eτ µref[r(τ)], the fifth step comes from the definition of Cr(Gr, πtar, µref) (Definition 2) and the last step leverages (5). This concludes our proof.

C.1 PROOF OF LEMMA 1

The proof largely follows Zhan et al. (2022b). Suppose F is a 1/N-bracket of Gr with |F| = NGr(1/N) and we denote the set of all right brackets in F by e F, i.e., e F := {f : f , such that [f , f] F}. Then fix any f e F, we have:

n=1 log f(on|τ n,0, τ n,1)

Pr (on|τ n,0, τ n,1)

n=1 E exp log f(on|τ n,0, τ n,1)

Pr (on|τ n,0, τ n,1)

n=1 E f(on|τ n,0, τ n,1)

Pr (on|τ n,0, τ n,1)

o f(o|τ n,0, τ n,1) 1 + 1

where the first step is due to each sample in D is i.i.d., the third step uses Tower property and the fourth step is from the fact that F is a minimum 1/N-bracket.

Then by Markov s inequality we have for any δ (0, 1],

n=1 log f(on|τ n,0, τ n,1)

Pr (on|τ n,0, τ n,1)

n=1 log f(on|τ n,0, τ n,1)

Pr (on|τ n,0, τ n,1)

exp[ log(1/δ)] eδ.

By union bound, we have for all f e F,

n=1 log f(on|τ n,0, τ n,1)

Pr (on|τ n,0, τ n,1)

> c MLE log(NGr(1/N)/δ) δ/2,

where c MLE > 0 is a universal constant.

Therefore from the definition of 1/N-bracket net, we know for all r Gr, there exists f e F such that Pr( |τ 0, τ 1) f( |τ 0, τ 1) for any trajectories (τ 0, τ 1). This implies that for all r Gr,

n=1 log Pr(on|τ n,0, τ n,1)

Pr (on|τ n,0, τ n,1)

> c MLE log(NGr(1/N)/δ) δ/2,

This concludes our proof.

Published as a conference paper at ICLR 2024

D COMPARISON WITH RELATED WORKS

D.1 COMPARISON WITH ZHU ET AL. (2023)

Zhu et al. (2023) considers the linear reward setting under BTL model and they can achieve the following sample complexity:

N = O C2 lin exp(4BR)d log(1/δ)

where R and B are the norm bounds on the feature vectors ϕ and parameter θ (defined in Proposition 1).The concentrability coefficient Clin is defined as

Clin := Eτ 0 πtar,τ 1 µref[ϕ(τ 0) ϕ(τ 1)] Σ 1 D ,

and ΣD is the empirical covariance matrix of the dataset 1

N PN n=1(ϕ(τ n,0) ϕ(τ n,1))(ϕ(τ n,0) ϕ(τ n,1)) .

Note that all the analysis and proofs in this paper still hold when we define the concentrability coefficient as

C r(Gr, πtar, µref) := max 0, sup r Gr

Eτ 0 πtar,τ 1 µref[r (τ 0) r (τ 1) r(τ 0) + r(τ 1)] q

1 N PN n=1 |r (τ n,0) r (τ n,1) r(τ n,0) + r(τ n,1)|2

Then when specializing the result in Theorem 1 to the linear reward setting under BTL model with this version of concentrability coefficient, the sample complexity is

N = e O (C r(Gr, πtar, µref))2 exp(2rmax)d log(BR/δ)

We know that BR rmax. In addition, note that in this case, we have Clin 0 and for any r Gr, Eτ 0 πtar,τ 1 µref[r (τ 0) r (τ 1) r(τ 0) + r(τ 1)]

= Eτ 0 πtar,τ 1 µref[ϕ(τ 0) ϕ(τ 1)], θ θ

Eτ 0 πtar,τ 1 µref[ϕ(τ 0) ϕ(τ 1)] Σ 1 D θ θ ΣD

= Eτ 0 πtar,τ 1 µref[ϕ(τ 0) ϕ(τ 1)] Σ 1 D

n=1 |r (τ n,0) r (τ n,1) r(τ n,0) + r(τ n,1)|2,

where we suppose r (τ) = ϕ(τ), θ and r(τ) = ϕ(τ), θ . Therefore we have

C r(Gr, πtar, µref) Clin.

This implies that Theorem 1 can recover the sample complexity for linear reward setting under BTL model in Zhu et al. (2023) with only some additional log factors.

D.2 COMPARISON AGAINST PRACTICAL WORKS

Our algorithm is different from the practical algorithms presented in Wirth et al. (2017). Although they also use MLE to estimate the reward model, they typically run RL algorithms with respect to the learned MLE rewards without any pessimism. The lower bound in Zhu et al. (2023, Theorem 3.9) has shown that this kind of algorithm can fail in the worst cases. In contrast, we use pessimism to circumvent such failure. More specifically, we construct a confidence set of the reward model and then evaluate each policy with the most pessimistic reward model inside the confidence set. This is the key difference between our algorithm and existing commonly used algorithms in Wirth et al. (2017), and this modification is crucial to overcome their limit.

D.3 COMPARISON WITH UEHARA AND SUN (2021)

Like us, Uehara and Sun (2021) proposed a model-based RL algorithm that works under partial coverage, but in the standard RL setting and with a known state-action-wise reward function. In

Published as a conference paper at ICLR 2024

addition to the difference in settings, which is the primary difference, our approach moreover differs from their approach because while they construct confidence intervals by defining a confidence ball around the MLE solution based on the total variation distance, we use the Kullback-Leibler (KL) distance. This may be preferable as computing the KL distance is generally easier than the total variation distance as it arises directly from the MLE objective, as practically done in Rigter et al. (2022).

E OMITTED DETAILS

In this section we supplement the definition of bracket number for the transition class and advantage function class. Definition 5 (ϵ-bracket number of transition probability classes). Suppose f 1, f 2 is a function with f 1( |s, a), f 2( |s, a) R|S| for all (s, a) S A. Then we say (f 1, f 2) is a ϵ-bracket if f 1( |s, a) f 2( |s, a) and f 1( |s, a) f 2( |s, a) 1 ϵ for all (s, a). The ϵ-bracket number of a transition probability class GPh where h [H 1] is the minimum integer N satisfying that there exist N ϵ-brackets (f n,1, f n,2)N n=1 such that for any function Ph GPh there is a bracket (f i,1, f i,2) where i [N] containing it, i.e., f i,1( |s, a) Ph( |s, a) f i,2( |s, a) for all (s, a).

Definition 6 (ϵ-bracket number of initial state distribution classes). Suppose f 1, f 2 R|S|. Then we say (f 1, f 2) is a ϵ-bracket if f 1 f 2 and f 1 f 2 1 ϵ. The ϵ-bracket number of a initial state distribution class GP0 is the minimum integer N satisfying that there exist N ϵ-brackets (f n,1, f n,2)N n=1 such that for any P0 GP0 there is a bracket (f i,1, f i,2) where i [N] containing it, i.e., f i,1 P0 f i,2. Definition 7 (ϵ-bracket number of advantage function classes). Suppose g1, g2 is a function with g1( |s, a0, a1), g2( |s, a0, a1) R2 for all (s, a0, a1) S A A. Then we say (g1, g2) is a ϵ-bracket if g1( |s, a0, a1) g2( |s, a0, a1) and g1( |s, a0, a1) g2( |s, a0, a1) 1 ϵ for all (s, a0, a1) S A A. The ϵ-bracket number of a reward class GAh where h [H] is the minimum integer N satisfying that there exist N ϵ-brackets (gn,1, gn,2)N n=1 such that for any function Ah GAh there is a bracket (gi,1, gi,2) where i [N] containing it, i.e., gi,1( |s, a0, a1) PAh( |s, a0, a1) gi,2( |s, a0, a1) for all (s, a0, a1) S A A.

We use NGPh(ϵ) and NGAh(ϵ) to denote the ϵ-bracket number of GPh and GAh. Similarly, when the transition probability or the advantage function possesses a low-dimension embedding, we can also bound the ϵ-bracket number efficiently.

F PROOFS OF LOWER BOUNDS

F.1 PROOF OF PROPOSITION 2

Given any S, A, H, consider a MDP with horizon H, state space S = {s1, s2, , s S} and action space A = {a1, a2, , a A}. In the following discussion we consider the case C 2 and 1 < C < 2 respectively.

Case 1: C 2. Consider the case where the state is fixed throughout an episode. We suppose the initial state distribution P 0 is P 0 (s1) = 1 2 and P 0 (si) = 1 2(S 1) for all 2 i S. Let πtar,h(a1|s) = 1 for all h [H] and s S. Then we can set the dataset distribution µ0 as

1 2C , if the state is s1 and all actions in τ are a1 except a H 1 = a2, 1 2 1 2C , if the state is s1 and all actions in τ are a1 except a H = a2, 1 2(S 1), if the state is not s1 and all actions in τ are a1, 0, otherwise,

where ah is the action at step h in τ. Then we know

µ0,h(s, a1) = 1 2(S 1), h [H], s S \ {s1},

µ0,h(s1, a1) = 1

2, µ0,H 1(s, a1) = 1

2C , µ0,H(s, a1) = 1

Published as a conference paper at ICLR 2024

It is obvious we have Cst C in this setting. On the other hand, since the trajectory whose state is s1 and all actions are a1 is covered by πtar but not by µ0, we have Ctr = .

Case 2: 1 < C < 2. Consider the case where the state is fixed throughout an episode. We suppose the initial state distribution of P 0 is P 0 (s1) = C 1

2 , P 0 (s2) = 2 C

2 and P 0 (si) = 1 2(S 2) for all 3 i S. Note that here we require S 3. When S = 2, we can let P 0 (s1) = C 1 and P 0 (s2) = 2 C and the following analysis will still hold. Therefore here we assume S 3 without loss of generality. Let πtar,h(a1|s) = 1 for all h [H] and s S. Then we can set the dataset distribution µ0 as

2C , if the state of τ is s1 and all actions in τ are a1 except a H 1 = a2, C 1

2C , if the state of τ is s1 and all actions in τ are a1 except a H = a2, 2 C

2C , if the state of τ is s2 and the actions are all a1, 1 2(S 2), if the state of τ is not s1 or s2 and the actions are all a1, 0, otherwise.

Then we know

µ0,h(s, a1) = 1 2(S 2), h [H], s S \ {s1, s2},

µ0,h(s2, a1) = 2 C

2C , h [H],

µ0,h(s1, a1) = 2C 2

2C , h [H 2],

µ0,H 1(s1, a1) = µ0,H(s1, a1) = C 1

It is obvious we have Cst C in this setting. On the other hand, since the trajectory whose state is s1 and all actions are a1 is covered by πtar but not by µ0, we have Ctr = . This concludes our proof.

F.2 PROOF OF THEOREM 2

We consider the case C 2 and 1 < C < 2 respectively.

Case 1: C 2. Consider the case where there is only one state s and two actions a1, a2. Set the dataset distribution µ0 = µ1 where

1 C , if all actions in τ are a1 except a H 1 = a2, 1 1

C , if all actions in τ are a1 except a H = a2, 0, otherwise,

where ah is the action at step h in τ. In the following discussion we will use τ 1 to denote the trajectory where all actions are a1 except a H 1 = a2 and τ 2 to denote the trajectory where all actions are a1 except a H 1 = a2. Then we know

µ0,h(s, a1) = 1, h [H 2],

µ0,H 1(s, a1) = 1 1

C , µ0,H 1(s, a2) = 1

µ0,H(s, a1) = 1

C , µ0,H(s, a2) = 1 1

We consider two different reward function r1 and r2:

r1 h(s, a1) = r1 h(s, a2) = r2 h(s, a1) = r2 h(s, a2) = 0, h [H 2],

r1 H 1(s, a1) = r2 H 1(s, a2) = 1, r1 H 1(s, a2) = r2 H 1(s, a1) = 0,

r1 H(s, a1) = r2 H(s, a2) = 1, r1 H(s, a2) = r2 H(s, a1) = 0,

Published as a conference paper at ICLR 2024

Then we have two MDPs, M1 and M2 whose reward functions are r1 and r2 respectively. It can be easily verified that (M1, µ0) Θst(C), (M2, µ0) Θst(C).

Further, let L(π; M) denote the suboptimality of policy π in M, then we have for all policies π,

L(π; M1) + L(π; M2) 2.

Now we can apply Le Cam s method, which leads to the following inequality

inf bπ sup M {M1,M2} ED[L(π, M)] 1

2 exp( NKL µ0 µ1 Pr1 µ0 µ1 Pr2 ).

It can be observed that KL µ0 µ1 Pr1 µ0 µ1 Pr2 = 0 since r1(τ 1) = r1(τ 2) = r2(τ 1) =

r2(τ 2) = 1. Therefore we have

inf bπ sup M {M1,M2} ED[L(π, M)] 1

Case 2: 1 < C < 2. Consider the case where there are two one states s1, s2 and two actions a1, a2. We suppose the initial state distribution of P 0 is fixed as P 0 (s1) = C 1 and P 0 (s2) = 2 C. In addition, the state will stay the same throughout the whole episode. Then we can set the dataset distribution µ0 = µ1 where

C , if the state of τ is s1 and all actions in τ are a1 except a H 1 = a2, C 1

C , if the state of τ is s1 and all actions in τ are a1 except a H = a2, 2 C

C , if the state of τ is s2 and the actions are all a1, 0, otherwise.

In the following discussion we will use τ 3 to denote the trajectory where state is s1 and all actions are a1 except a H 1 = a2; τ 4 to denote the trajectory where state is s1 and all actions are a1 except a H 1 = a2; τ 5 to denote the trajectory where state is s2 and all actions are a1. Then we know

µ0,h(s2, a1) = 2 C

µ0,h(s1, a1) = 2C 2

C , h [H 2],

µ0,H 1(s1, a1) = µ0,H 1(s1, a2) = C 1

µ0,H(s1, a1) = µ0,H(s1, a2) = C 1

We consider two different reward function r1 and r2:

r1 h(s1, a1) = r1 h(s1, a2) = r2 h(s1, a1) = r2 h(s1, a2) = 0, h [H 2],

r1 H 1(s1, a1) = r2 H 1(s1, a2) = 1, r1 H 1(s1, a2) = r2 H 1(s1, a1) = 0,

r1 H(s1, a1) = r2 H(s1, a2) = 1, r1 H(s1, a2) = r2 H(s1, a1) = 0,

r1 h(s2, a1) = r1 h(s2, a2) = r2 h(s2, a1) = r2 h(s2, a2) = 0, h [H 1]

r1 H(s2, a1) = r2 H(s2, a1) = 1, r1 H(s2, a2) = r2 H(s2, a2) = 0.

Then we have two MDPs, M1 and M2 whose reward functions are r1 and r2 respectively. It can be easily verified that (M1, µ0) Θst(C), (M2, µ0) Θst(C).

In addition, we have for all policies π,

L(π; M1) + L(π; M2) 2(C 1).

Published as a conference paper at ICLR 2024

Therefore by Le Cam s method, we have

inf bπ sup M {M1,M2} ED[L(π, M)] (C 1)

2 exp N KL µ0 µ1 Pr1 µ0 µ1 Pr2 ,

where the KL divergence is 0 since r(τ) = 1 for all r {r1, r2} and τ {τ 3, τ 4, τ 5}. Therefore, we have

inf bπ sup M {M1,M2} ED[L(π, M)] C 1

In conclusion, we have for any C > 1 and H 2,

inf bπ sup (M,µ0) Θst(C) ED[J(π ; r , P ) J(bπ; r , P )] min C 1, 1 .

F.3 PROOF OF THEOREM 3

The proof is inspired by the hard instances in Rashidinejad et al. (2021b). We consider the case C 2 and 1 < C < 2 respectively.

Case 1: C 2. Consider the case where there is only one state s and two actions a1, a2. Set the dataset distribution µ0 = µ1 where

C , µ0(τ ) = 1 1

where τ is the trajecotry where the actions are all a1 and τ is the trajecotry where the actions are all a2.

We consider two different reward function r1 and r2:

2 + x, if all the actions in τ are a1, 1 2, otherwise.

2 x, if all the actions in τ are a1, 1 2, otherwise.

Here 0 < x < 1 2 is a quantity we will specify later. Then we have two MDPs, M1 and M2 whose reward functions are r1 and r2 respectively. It can be easily verified that (M1, µ0) Θtr(C), (M2, µ0) Θtr(C).

Further, let L(π; M) denote the suboptimality of policy π in M, then we have for all policies π,

L(π; M1) + L(π; M2) x.

Now we can apply Le Cam s method, which leads to the following inequality

inf bπ sup M {M1,M2} ED[L(π, M)] x

4 exp N KL µ0 µ1 Pr1 µ0 µ1 Pr2 .

Now we only need to bound KL µ0 µ1 Pr1 µ0 µ1 Pr2 , which can be computed as follows:

KL µ0 µ1 Pr1 µ0 µ1 Pr2

τ 0=τ ,τ 1=τ µ0(τ 0)µ1(τ 1)KL Bern(σ(x)) Bern(σ( x))

2 exp(1/2)x2

Then by letting x = min 1 2, q

C 2 exp(1/2)N

inf bπ sup M {M1,M2} ED[L(π, M)] exp( 1)

4 x = exp( 1)

C 2 exp(1/2)N

Published as a conference paper at ICLR 2024

Case 2: 1 < C < 2. Consider the case where there are two one states s1, s2 and two actions a1, a2. We suppose the initial state distribution of P 0 is fixed as P 0 (s1) = C 1 and P 0 (s2) = 2 C. In addition, the state will stay the same throughout the whole episode. Then we can set the dataset distribution µ0 = µ1 where

2, if the state of τ is s1 and the actions are all a1 or all a2, 2 C

C , if the state of τ is s2 and the actions are all a1, 0, if the state of τ is s2 and the actions contain a2.

Let τ be the trajectory where the state is s1 and the actions are all a1.

We further consider two different reward function r1 and r2:

2 + x, if the state is s1 and all the actions in τ are a1, 1 2, otherwise.

2 x, if the state is s1 and all the actions in τ are a1, 1 2, otherwise.

Here 0 < x < 1 2 is a quantity we will specify later. Then we have two MDPs, M1 and M2 whose reward functions are r1 and r2 respectively. It can be easily verified that (M1, µ0) Θtr(C), (M2, µ0) Θtr(C).

In addition, we have for all policies π,

L(π; M1) + L(π; M2) (C 1)x.

Therefore by Le Cam s method, we have

inf bπ sup M {M1,M2} ED[L(π, M)] (C 1)x

4 exp N KL µ0 µ1 Pr1 µ0 µ1 Pr2 ,

where the KL divergence can be computed as follows:

KL µ0 µ1 Pr1 µ0 µ1 Pr2

τ 0=τ ,τ 1 =τ µ0(τ 0)µ1(τ 1)KL Bern(σ(x)) Bern(σ( x))

2(C 1) exp(1/2)x2

Then by letting x = min 1 2, q

C 2 exp(1/2)(C 1)N

inf bπ sup M {M1,M2} ED[L(π, M)] (C 1) exp( 1)

4 x = exp( 1)

(C 1) 2 exp(1/2)N

In conclusion, we have for any C > 1 and H 1,

inf bπ sup (M,µ0) Θst(C) ED[J(π ; r , P ) J(bπ; r , P )] min C 1,

G PROOF OF THEOREM 4

The proof still consists of two steps, deriving the guarantee of MLE and analyzing the performance of pessimistic offline RL.

Step 1: MLE guarantee. Note that Lemma 1 and Lemma 2 still applies here. Let E1 and E2 denote the event in Lemma 1 and Lemma 2 respectively. Following almost the same arguments, we have the following guarantee for the estimation of the system dynamics:

Published as a conference paper at ICLR 2024

Lemma 3. Under Assumption 4, with probability at least 1 δ/2, the following event holds true:

(1)P h Ph(D), P 0 Pini(D), h [H 1],

(2)E(sh,ah) µ0,h

Ph( |s, a) P h( |s, a) 2

+ E(sh,ah) µ1,h

Ph( |s, a) P h( |s, a) 2

c log(HNGPh(1/N)/δ)

N , h [H 1], Ph Ph(D),

P0(s) P 0 (s) 2

P0(s) P 0 (s) 2

c log(HNGP0(1/N)/δ)

N , P0 P0(D).

The proof is omitted here. Let E3 denote the event in Lemma 3.

Step 2: Pessimistic offline RL. We first introduce the following lemma which suggests that under event E3, we can evaluate the expected cumulative reward of πtar with respect to any reward function r Gr via the system dynamics Ph Ph(D):

Lemma 4. Suppose Asusmption 3 is true. Then under E3, we have for all reward function r Gr and P = ({Ph}H 1 h=0 ) where Ph Ph(D) that

J(πtar; r, P ) J(πtar; r, P) Hrmax

c C2 P ({GPh}, πtar) log(HNP (1/N)/δ)

where NP = max0 h H 1{NGPh}.

The proof is deferred to Appendix G.1.

Let (rinf π , P inf π ) denote argminr R(D),P Pini(D) QH 1 h=1 Ph(D) J(π; r, P) Eτ µref[r(τ)]. Then under the event E3, we can bound the suboptimality of bπ as follows:

J(πtar; r , P ) J(bπ; r , P )

= J(πtar; r , P ) Eτ µref[r (τ)] J(bπ; r , P ) Eτ µref[r (τ)]

= J(πtar; r , P ) Eτ µref[r (τ)] J(πtar; rinf πtar, P ) Eτ µref[rinf πtar(τ)]

+ J(πtar; rinf πtar, P ) Eτ µref[rinf πtar(τ)] J(πtar; rinf πtar, P inf πtar) Eτ µref[rinf πtar(τ)]

+ J(πtar; rinf πtar, P inf πtar) Eτ µref[rinf πtar(τ)] J(bπ; rinf bπ , P inf bπ ) Eτ µref[rinf bπ (τ)]

+ J(bπ; rinf bπ , P inf bπ ) Eτ µref[rinf bπ (τ)] J(bπ; r , P ) Eτ µref[r (τ)]

J(πtar; r , P ) Eτ µref[r (τ)] J(πtar; rinf πtar, P ) Eτ µref[rinf πtar(τ)]

+ J(πtar; rinf πtar, P ) Eτ µref[rinf πtar(τ)] J(πtar; rinf πtar, P inf πtar) Eτ µref[rinf πtar(τ)]

+ J(bπ; rinf bπ , P inf bπ ) Eτ µref[rinf bπ (τ)] J(bπ; r , P ) Eτ µref[r (τ)]

J(πtar; r , P ) Eτ µref[r (τ)] J(πtar; rinf πtar, P ) Eτ µref[rinf πtar(τ)]

+ J(πtar; rinf πtar, P ) Eτ µref[rinf πtar(τ)] J(πtar; rinf πtar, P inf πtar) Eτ µref[rinf πtar(τ)]

c C2r(Gr, πtar, µref)κ2 log(NGr(1/N)/δ)

c C2 P ({GPh}, πtar) log(HNP (1/N)/δ)

where the third and fourth step are due to the definition of bπ, (rinf bπ , P inf bπ ) and (1) in Lemma 3. The last step comes from Lemma 4 and the proof of Theorem 1. This concludes our proof.

Published as a conference paper at ICLR 2024

G.1 PROOF OF LEMMA 4

Let P h be the system dynamics (P 0 , {P t }h t=1, {Pt}H 1 t=h+1) for all 0 h H 1. Then we have

J(πtar; r, P ) J(πtar; r, P) =

h=1 (J(πtar; r, P h) J(πtar; r, P h 1)) + (J(πtar; r, P 0) J(πtar; r, P)).

For any h [H 1], we have

J(πtar; r, P h) J(πtar; r, P h 1)

=E(s1,a1, ,sh,ah) (πtar,P ) h X

sh+1 P h(sh+1|sh, ah)E(πtar,P ) r(τ)|s1, a1, , sh+1

sh+1 Ph(sh+1|sh, ah)E(πtar,P ) r(τ)|s1, a1, , sh+1 i

=E(s1,a1, ,sh,ah) (πtar,P ) h X

sh+1 (P h(sh+1|sh, ah) Ph(sh+1|sh, ah))E(πtar,P ) r(τ)|s1, a1, , sh+1 i

rmax E(sh,ah) (πtar,P ) h P h( |sh, ah) Ph( |sh, ah) 1

c C2 P (πtar) log(HNGPh(1/N)/δ)

where E(πtar,P ) |s1, a1, , sh+1 is the distribution of the trajectory τ when executing policy πtar under the transition probability {Pt}H 1 t=h+1 while fixing the history to be s1, a1, , sh+1. Here the first step utilizes the Tower property, the third and fourth step uses Cuachy-Schwartz inequality and the last step comes from Lemma 3.

For J(πtar; r, P 0) J(πtar; r, P), similarly we have

J(πtar; r, P 0) J(πtar; r, P) rmax

c C2 P (πtar) log(HNGP0(1/N)/δ)

Therefore we conclude that

J(πtar; r, P ) J(πtar; r, P) Hrmax

c C2 P (πtar) log(HNP (1/N)/δ)

H PROOF OF THEOREM 5

We first derive the guarantee of MLE for estimating A . Similar to Lemma 1 and Lemma 2, we have the following lemma in the action-based comparison setting: Lemma 5. Under Assumption 7, with probability at least 1 δ, the following event holds true:

Es µh,a0 µ0,h( |s),a1 µ1,h( |s)

P b Ah( |s, a0, a1) PA h( |s, a0, a1) 2

c log(HNGAh(1/N)/δ)

The proof is omitted here. Let E4 denote the event in Lemma 5. Then under Assumption 8, we can apply the mean value theorem and obtain that under E4, we have for all h [H] that

Es µh,a0 µ0,h( |s),a1 µ1,h( |s)

|A h(s, a0) A h(s, a1) b Ah(s, a0) + b Ah(s, a1)|2

cκ2 log(HNGAh(1/N)/δ)

N , h [H]. (6)

Recall that κ = 1 infx [ rmax,rmax] Φ (x).

On the other hand, note that we have the following performance lemma:

Published as a conference paper at ICLR 2024

Lemma 6. For any deterministic Markovian policies π and π , we have

J(π; r , P ) J(π ; r , P ) =

h=1 Es dπ h

h Qπ h(s, π(s)) Qπ h(s, π (s)) i

The proof is deferred to Appendix H.1.

The rest of the proof largely follows Uehara et al. (2023). Under the event E4, we can bound the suboptimality of bπ as follows:

J(π ; r , P ) J(bπ; r , P ) rmax

h=1 Es dπ h

h 1(π h(s) = bπh(s)) 1(Q h(s, bπh(s)) < Q h(s, π h(s))) i

h=1 Es dπ h

a A 1 b Ah(s, a) b Ah(s, π h(s)) 1 Q h(s, a) < Q h(s, π h(s)) ,

where the first step comes from Lemma 6 and the second step is due to the definition of bπ. Then for any α > 0, we have

a A 1 b Ah(s, a) b Ah(s, π h(s)) 1 Q h(s, a) < Q h(s, π h(s))

a A 1 Q h(s, π h(s)) > Q h(s, a) Q h(s, π h(s)) α

a A 1 Q h(s, π h(s)) Q h(s, a) b Ah(s, π h(s)) + b Ah(s, a) α .

By Assumption 6, we have

a A 1 Q h(s, π h(s)) > Q h(s, a) Q h(s, π h(s)) α |A|(α/α0)β.

For the second term, we have

a A 1 Q h(s, π h(s)) Q h(s, a) b Ah(s, π h(s)) + b Ah(s, a) α

a A α21 A h(s, π h(s)) A h(s, a) b Ah(s, π h(s)) + b Ah(s, a) α

A h(s, π h(s)) A h(s, a) b Ah(s, π h(s)) + b Ah(s, a) 2

c|A|Cactκ2 log(HNGAh(1/N)/δ)

where the last step comes from the definition of Cact and (6).

Therefore by picking appropriate α, we have with probability at least 1 δ that

J(π ; r , P ) J(bπ; r , P ) c H|A| 2

β+2 κ2Cact log(HNGA(1/N)/δ)

H.1 PROOF OF LEMMA 6

For any two policies π and π , we have that

J(π ; r , P ) J(π; r , P )

=Eπ h r 1(s1, a1) + V π 2 (s2) i Eπ [V π 1 (s1)]

Published as a conference paper at ICLR 2024

=Eπ h V π 2 (s2) (V π 1 (s1) r 1(s1, a1)) i

=Eπ h V π 2 (s2) V π 2 (s2) i + Eπ [Qπ 1(s1, a1) V r,π 1 (s1)]

=Eπ h V π 2 (s2) V π 2 (s2) i + Eπ [ Qπ 1(s1, ), π 1( |s1) π1( |s1) ]

h=1 Eπ [ Qπ h(sh, ), π h( |s) πh( |s) ] .

This concludes our proof.