# provable_rewardagnostic_preferencebased_reinforcement_learning__078ad02e.pdf

Published as a conference paper at ICLR 2024

PROVABLE REWARD-AGNOSTIC PREFERENCE-BASED REINFORCEMENT LEARNING

Wenhao Zhan Princeton University wenhao.zhan@princeton.edu

Masatoshi Uehara Genentech uehara.masatoshi@gene.com

Wen Sun Cornell University ws455@cornell.edu

Jason D. Lee Princeton University jasonlee@princeton.edu

Preference-based Reinforcement Learning (Pb RL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While Pb RL has demonstrated practical success in ﬁne-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In this study, we ﬁll in such a gap between theoretical Pb RL and practical algorithms by proposing a theoretical reward-agnostic Pb RL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature. Speciﬁcally, our framework can incorporate linear and low-rank MDPs with efﬁcient sample complexity. Additionally, we investigate reward-agnostic RL with action-based comparison feedback and introduce an efﬁcient querying algorithm tailored to this scenario.

1 INTRODUCTION

Reinforcement learning algorithms train agents to optimize rewards of interests. However, setting an appropriate numerical reward can be challenging in practical applications (e.g., design a reward function for a robot arm to learn to play table tennis), and optimizing hand-crafted reward functions can lead to undesirable behavior when the reward function does not align with human intention. To overcome this challenge, there has been a recent surge of interest in Preference-based Reinforcement Learning (Pb RL) with human feedback. In Pb RL, the agent does not receive a numerical reward signal, but rather receives feedback from a human expert in the form of preferences, indicating which state-action trajectory is preferred in a given pair of trajectories. Pb RL has gained considerable attention in various domains, including NLP (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021; Nakano et al., 2021; Ouyang et al., 2022; Glaese et al., 2022; Ramamurthy et al., 2022; Liu et al., 2023), robot learning (Christiano et al., 2017; Brown et al., 2019; Shin et al., 2023), and recommender systems (Xue et al., 2022).

Despite the promising applications of Pb RL in various areas, there are only a few provably efﬁcient algorithms (also known as PAC RL) for this purpose (Pacchiano et al., 2021; Chen et al., 2022b). These algorithms focus on regret minimization and iterate through the following processes: collecting new trajectories from the environment, obtaining human feedback on the trajectories, and learning hidden reward functions as well as the dynamic model from the human feedback. However, this approach can be slow and expensive in practice as it requires humans in every iteration of dynamic model selection process, which is not as easy as it may sound. For example, interactive decisionmaking algorithms that put human in the loop of model learning process such as DAgger (Ross et al., 2011) can become impractical when applied to some real-world robotics applications, as has been observed in prior works (Ross et al., 2013; Laskey et al., 2016). In contrast, in practical Pb RL

This work was done at Cornell University.

Published as a conference paper at ICLR 2024

applications like Instruct GPT (Ouyang et al., 2022) and PEBBLE (Lee et al., 2021), the majority of preference data are collected by crowdsourcing prompts from the entire world and the supervised or heuristic policies, therefore most of the human labeling process does not depend on the training steps afterward. Another line of work (Zhu et al., 2023) focuses on purely ofﬂine RL algorithms to learn a near-optimal policy from ofﬂine trajectories with good coverage (e.g., ofﬂine data that covers some high-quality policies traces). Nevertheless, it is unclear how to obtain such high-quality ofﬂine data a priori (Chen and Jiang, 2019).

To ﬁll in such a gap between theoretical work and practical applications in Pb RL, we propose a new theoretical method that lies in between purely online and purely ofﬂine methods for Pb RL, resembling the framework of Instruct GPT and PEBBLE. Our algorithm ﬁrst collects state-action trajectories from the environment without human feedback. In this step, we design a novel sampling procedure to acquire exploratory trajectories that facilitate the subsequent learning of reward functions which is fully reward-agnostic. In the second step, we collect preference feedback on the collected trajectories from human experts. In the third step, we aim to learn the underlying hidden reward functions using the collected trajectories in the ﬁrst step and preference feedback in the second step. In the fourth step, we learn the optimal policy by solving the ofﬂine RL problem under the learned reward function. Our approach can be understood as performing experimental design for Pb RL, which allows us to separate the data-collection process from the process of querying human feedback, eliminating the need for constantly keeping human in the training loop. For instance, we only need to keep human experts in step 2 above, while we can freely perform hyperparameter tuning / model selection for the rest steps without requiring human experts sitting next to the computers. This process can signiﬁcantly reduce the burden from human experts.

Our contributions can be summarized as follows:

We propose an efﬁcient experimental design algorithm for Pb RL. Our algorithm is speciﬁcally designed for linear reward parametrization, which is commonly used in models such as the Bradley Terry-Luce model, and can handle unknown transitions. This ﬂexibility allows us to handle non-tabular transition models like low-rank MDPs (Agarwal et al., 2020a) and linear MDPs (Jin et al., 2019). To the best of the our knowledge, existing works with statistical guarantees cannot incorporate these models efﬁciently. Notably, our experimental design algorithm does not depend on any information of the reward and is reward-agnostic. Therefore, the collected trajectories can indeed be reused for learning many reward functions at the same time.

Our key idea is to decouple the interaction with the environment and the collection of human feedback. This decoupling not only simpliﬁes the process of obtaining human feedback in practice but also results in a signiﬁcant reduction in the sample complexity associated with human feedback compared to existing works (Pacchiano et al., 2021; Chen et al., 2022b). This improvement is particularly valuable as collecting human feedback is often a resource-intensive process.

To circumvent the scaling with the maximum per-trajectory reward in the trajectory-based comparison setting, we further investigate preference-based RL with action-based comparison and propose a provably efﬁcient algorithm for this setting. We show that in this case the sample complexity only scales with the bound of the advantage functions of the optimal policy, which can be much smaller than the maximum per-trajectory reward (Ross et al., 2011; Agarwal et al., 2019).

1.1 RELATED WORKS We refer the readers to Wirth et al. (2017) for an overview of Preference-based RL (Pb RL). Pb RL has been well-explored in bandit setting under the notion of dueling bandits (Yue et al., 2012; Zoghi et al., 2014; Dudík et al., 2015), where the goal is to ﬁnd the optimal arm in the bandit given human preference over action pairs. For MDPs, in addition to Pacchiano et al. (2021); Chen et al. (2022b), which we compare in the introduction, Novoseller et al. (2020); Xu et al. (2020) have also developed algorithms with sample complexity guarantees. Novoseller et al. (2020) proposes a double posterior sampling algorithm with an asymptotic regret sublinear in the horizon H. Xu et al. (2020) proposes a PAC RL algorithm but relies on potentially strong assumptions such as Strong Stochastic Transitivity. Note both of Novoseller et al. (2020); Xu et al. (2020) are limited to the tabular setting.

Our algorithm shares a similar concept with reward-free RL which focuses on exploration in the state-action space without using explicit rewards. Reward-free RL has been studied in many MDPs such as tabular MDPs (Jin et al., 2020a), linear MDPs (Wang et al., 2020), low-rank MDPs (Agarwal

Published as a conference paper at ICLR 2024

et al., 2020a) and several other models (Chen et al., 2022a; Zanette et al., 2020; Qiu et al., 2021). The goal of reward-free RL is to gather exploratory state-action data to address the challenge of unknown transitions before observing rewards. In contrast, our approach aims to design a single exploration distribution from which we can draw trajectory pairs to solicit human feedback for learning reward functions. Our setting can be considered as an experimental design for Pb RL.

2 PRELIMINARIES

We introduce our formulation of Markov decision processes (MDPs) and Pb RL.

2.1 MDPS WITH LINEAR REWARD PARAMETRIZATION We consider a ﬁnite-horizon MDP M = (S, A, P , r , H), where S is the state space, A is the action space, P = {P h}H h=1 is the ground-truth transition dynamics, r = {r h}H h=1 is the groundtruth reward function, and H is the horizon. Speciﬁcally, for each h [H] ([H] := (1, , H)), P h : S A (S) and r h : S A [0, 1] represent the transition and reward function at step h, respectively. Moreover, we use P1( ) to denote the initial state distribution. Here, both r , P are unknown to the learner. In this work, we assume that the cumulative reward of any trajectory τ = (sh, ah)H h=1 does not exceed rmax, i.e., PH h=1 rh(sh, ah) rmax.

Policies and value functions. A policy π = {πh}H h=1 where πh : S (A) for each h [H] characterizes the action selection probability for every state at each step. In this paper, we assume the policy belongs to a policy class Π, which can be inﬁnite. Given a reward function r and policy π, the associated value function and Q function at time step h are deﬁned as follows: V r,π h (s) = Eπ,P [PH h =h rh(sh, ah)|sh = s], Qr,π h (s, a) = Eπ,P [PH h =h rh(sh, ah)|sh = s, ah = a]. Here, Eπ,P [ ] represents the expectation of the distribution of the trajectory induced by a policy π and the transition P . We use V r,π to denote the expected cumulative rewards of policy π with respect to reward function r under P , i.e., V r,π := Es P 1 V r,π 1 (s), and use V r, to denote the maximal expected cumulative rewards with respect to reward function r under P , i.e., V r, := maxπ Π V r,π. In particular, let π denote the best policy in Π with respect to r , i.e., arg maxπ Π V r ,π. In contrast, we denote the globally optimal policy by πg := arg maxπ ΠMar V r ,π where ΠMar is the set of all Markovian policies. Note that when Π = ΠMar, π might not be optimal compared to πg.

Linear reward parametrization. To learn the unknown reward function, it is necessary to make structural assumptions about the reward. We consider a setting where the true reward function possesses a linear structure:

Assumption 1 (Linear Reward Parametrization). We assume MDP has a linear reward parametrization with respect to (w.r.t.) known feature vectors φh(s, a) Rd. Speciﬁcally, for each h [H], there exists an unknown vector θ h Rd such that r h(s, a) = φh(s, a) θ h for all (s, a) S A. For technical purposes, we suppose for all s S, a A, h [H], we have φh(s, a) R, θ h B.

Note when d = |S||A| and setting φh(s, a) as one-hot encoding vectors, we can encompass the tabular setting. Linear reward parametrization is commonly used in the literature of preference-based RL with statistical guarantees (Pacchiano et al., 2021; Zhu et al., 2023). See Appendix A for more details.

Notation. We use r (τ) := PH h=1 r h(sh, ah) to denote the ground-truth cumulative rewards of trajectory τ. In particular, r (τ) = φ(τ), θ where φ(τ) := [φ1(s1, a1) , , φH(s H, a H) ] , θ := [θ 1 , , θ H ] . We use φ(π) to denote Eτ (π,P )[φ(τ)] for simplicity. We also use Θ(B) to denote the set {θ Rd : θ B} and Θ(B, H) to denote the set {θ RHd : θ = [θ 1 , , θ H] , θh Θ(B), h [H]} {θ RHd : φ(τ), θ rmax, τ}. We use the notation f = O(g) when there exists a universal constant C > 0 such that f Cg and e O(g) := O(g log g).

2.2 PREFERENCE-BASED REINFORCEMENT LEARNING In this paper, we consider a framework for Pb RL that mainly consists of the following four steps:

Step 1: Collect a dataset of trajectory pairs Dreward = (τ n,0, τ n,1) N n=1 in a reward-agnostic fashion, where τ n,i = {sn,i h , an,i h , sn,i h+1}H h=1 for n [N] and i (0, 1).

Step 2: Obtain preference feedback from human experts for each pair of trajectories in Dreward. Namely, if trajectory τ n,1 is preferred over τ n,0, then assign on = 1, otherwise assign on = 0.

Published as a conference paper at ICLR 2024

Algorithm 1 REGIME: Experimental Design for Querying Human Preference

1: Input: Regularization parameter λ, model estimation accuracy ϵ , parameters ϵ, δ. 2: Initialize Σ1 = λI 3: Estimate model b P P(Π, ϵ , δ/4) (Possibly, requires the interaction with the enviroment.) 4: for n = 1, , N do 5: Compute (πn,0, πn,1) arg maxπ0,π1 Π bφ(π0) bφ(π1) bΣ 1 n .

6: Update bΣn+1 = bΣn + (bφ(π0) bφ(π1))(bφ(π0) bφ(π1)) . 7: end for 8: for n = 1, , N do 9: Collect a pair of trajectories τ n,0, τ n,1 from the enviroment by πn,0, πn,1, respectively. 10: Add it to Dreward. 11: end for. 12: Obtain the preference labels {on}N n=1 for Dreward from human experts. 13: Run MLE bθ arg maxθ Θ(B,H) L(θ, Dreward, {on}N n=1) where L(θ, Dreward, {on}N n=1) is deﬁned in (1). 14: Return bπ = arg maxπ Π bφ(π), bθ .

Step 3: Estimate the ground truth reward using the dataset Dreward and preference labels {on}N n=1.

Step 4: Run RL algorithms (either online or ofﬂine) using the learned rewards and obtain a policy bπ that maximizes the cumulative learned rewards.

The above framework has been applied in practical applications, such as PEBBLE (Lee et al., 2021). However, these algorithms lack provable sample efﬁciency guarantees. In particular, it remains unclear in the literature how to collect the trajectories in Step 1 to enable accurate estimation of the ground truth reward. In our work, we strive to develop a concrete algorithm that adheres to the above framework while ensuring theoretical sample efﬁciency. We also emphasize that step 1 is reward-agnostic, and the collected dataset can be re-used for learning many different rewards as long as they are linear in the feature φ.

Preference model. In this work, we assume the preference label follows the Bradley-Terry-Luce (BTL) model (Bradley and Terry, 1952) in Step 2, i.e., we have the following assumption:

Assumption 2. Suppose for any pair of trajectory (τ 0, τ 1), we have

P(o = 1) = P(τ 1 τ 0) = σ(r (τ 1) r (τ 0)) = exp(r (τ 1)) exp(r (τ 0))+exp(r (τ 1)),

where o is the human preference over (τ 0, τ 1) and σ( ) is the sigmoid function.

Our analysis will leverage the quantity κ := sup|x| rmax |1/σ (x)| = 2+exp(2rmax)+exp( 2rmax) to measure the difﬁculty of estimating the true reward from the BTL preference model.

3 ALGORITHM: REGIME

We propose an algorithm speciﬁcally designed for the Pb RL setting when the transitions are unknown. In order to handle unknown transitions, we use the following mild oracle:

Deﬁnition 1 (Reward-free RL oracle). A reward-free learning oracle P(Π, ϵ, δ) can return an estimated model b P such that with probability at least 1 δ, we have for all policy π Π and h [H], s S, a A, b P1( ) P 1 ( ) 1 ϵ , Eπ,P [ b Ph( |s, a) P h( |s, a) 1] ϵ where 1 denotes total variation distance (i.e., ℓ1-norm).

This oracle necessitates accurate model learning through interactions with the environment. The required guarantee is relatively mild since we do not require a point-wise error guarantee, but rather an expectation-based guarantee under the ground truth transition. This oracle holds true not only in tabular MDPs (Jin et al., 2020a), but also in low-rank MDPs (Agarwal et al., 2020a; 2022), where the only assumption is the low-rank property of the transition dynamics, and features could be unknown to the learner. Low-rank MDPs ﬁnd wide application in practical scenarios, including blocked MDPs (Du et al., 2019; Zhang et al., 2020a;b; Sodhani et al., 2021; 2022).

Published as a conference paper at ICLR 2024

3.1 ALGORITHM The algorithm is described in Algorithm 1. Given a learned model ˆP, we use ˆφ(π) = Eτ (π, ˆ P )[φ(τ)] to estimate φ(π) := Eτ (π,P )[φ(τ)]. The algorithm mainly consists of four steps as follows.

Step 1: Collection of state-action trajectories by interacting with the environment (Line 4 11). To learn the ground truth reward function, we collect exploratory state-action trajectories that cover the space spanned by φ( ) before collecting any human feedback. To achieve this, at each iteration, we identify a set of explorative policy pairs that are not covered by existing data. We measure the extent to which the trajectory generated by (π0, π1) can be covered by computing the norm of bφ(π0) bφ(π1) on the metric induced by the inverse covariance matrix Σ 1 n at time step n. After iterating this procedure N times and obtaining sets of policies {(πn,0, πn,1)}N n=1, we sample N exploratory trajectory pairs by executing the policy pairs (πn,0, πn,1) for n [N]. Notably, this trajectory-collection process is reward-agnostic and thus the collected samples can be used to learn multiple rewards in multi-task RL.

Step 2: Collection of preference feedback by interacting with human experts (Line 12). If trajectory τ n,1 is preferred over τ n,0, then assign on = 1, otherwise assign on = 0.

Step 3: Reward learning via MLE (Line 13). We adopt the widely-used maximum likelihood estimation (MLE) approach to learn the reward function, which has also been employed in other works Ouyang et al. (2022); Christiano et al. (2017); Brown et al. (2019); Shin et al. (2023); Zhu et al. (2023). Speciﬁcally, we learn the reward model by maximizing the log-likelihood L(θ, Dreward, {on}N n=1): PN n=1 log on σ( θ, φ(τ n,1) φ(τ n,0) ) + (1 on) σ( θ, φ(τ n,0) φ(τ n,1) ) . (1)

Step 4: RL with respect to learned rewards (Line 14). We obtain the near-optimal policy that maximizes the cumulative learned rewards.

Our algorithm differs signiﬁcantly from the algorithms proposed in Pacchiano et al. (2021); Chen et al. (2022b). In their algorithms, they repeat the following steps: (a) collect new trajectories from the environment using policies based on the current learned reward and transition models, (b) collect human feedback for the obtained trajectories, (c) update the reward and transition models. A potential issue with this approach is that every time human feedback is collected, agents need to interact with the environment, causing a wait time for humans. In contrast, our algorithm ﬁrst collects exploratory trajectories without collecting any human feedback in Step 1. Then, we query human feedback and learn the reward model in Step 2-3. As a result, we decouple the step of collecting exploratory data from that of collecting human feedback. Hence, in our algorithm, we can efﬁciently query human feedback in parallel, mirroring common practice done in Instruct GPT. Moreover, our algorithm s design leads to lower sample complexity for both trajectory pairs and human feedback than Pacchiano et al. (2021); Chen et al. (2022b). See Appendix A for our technical novelty. Remark 1. Our collection method in Step 1 shares a similar idea to active learning. See Appendix A. Remark 2. The majority of computational cost lies in line 5 in Algorithm 1. To implement the algorithm, gradient ascent can be applied here to solve the optimization problem. See Appendix A.

Remark 3. In Step 4 (Line 14), it is not necessary to use the same ˆP as in Line 3. Instead, any sample-efﬁcient RL algorithm can be employed w.r.t. the learned reward such as Lee et al. (2021).

3.2 ANALYSIS Now we provide the sample complexity of Algorithm 1 as shown in the following theorem. Theorem 1. Let

λ 4HR2, N e O λκ2B2R2H4d2 log(1/δ)

H5d log N ,

Then under Assumption 1 and 2, with probability at least 1 δ, we have

V r ,bπ V r , ϵ.

Note the sample complexity in Theorem 1 does not depend on the complexity of Π and thus we can learn arbitrary policy classes. When Π = ΠMar, we have π = πg and thus we can compete against the global optimal policy.

Published as a conference paper at ICLR 2024

Since the sample complexity of human feedback, denoted by Nhum, is equal to N, Theorem 1 shows that the sample complexity of human feedback required to learn an ϵ-optimal policy scales with O(1/ϵ2) and is polynomial in the norm bounds B, R, the horizon H, and the dimension of the feature space d. Notably, the sample complexity of human feedback Nhum only depends on the structural complexity of the reward function, regardless of the underlying transition model. This is because while our theorem requires that the learned transition model is accurate enough (ϵ ϵ 6BRH2 ), we do not need human feedback to learn the transition model for this purpose. This property of our algorithm is particularly desirable when collecting human feedback is much more expensive than collecting trajectories from the environment. Existing works with sample-efﬁcient guarantees, such as Pacchiano et al. (2021); Chen et al. (2022b), do not have this property. Our algorithm s favorable property can be attributed to the careful design of the algorithm, where the step of collecting trajectories and learning transitions is reward-agnostic and thus separated from the step of collecting human feedback and learning rewards. Furthermore, note that our results algorithm indeed works beyond low-rank MDPs, as long as there exists a suitable reward-free model-learning oracle. See Appendix A for more details.

As the most relevant work, we compare our results with Pacchiano et al. (2021), which considers online learning in Pb RL with unknown tabular transition models and linear reward parameterization. Let Ntra and Nhum denote the number of required trajectory pairs and human feedback, respectively. Then, to obtain an ϵ-optimal policy, the algorithm in Pacchiano et al. (2021, Theorem 2) requires:

Ntra = Nhum = e O |S|2|A|d + κ2d2

Here we omit the dependence on B, R, H to facilitates the comparison. In contrast, in the setting considered in Pacchiano et al. (2021), by leveraging the reward-free learning oracle from Jin et al. (2020a), our algorithm achieves the following sample complexity:

Ntra e O |S|2|A|d + κ2d2

, Nhum = e O κ2d2

where the number of required trajectory-pairs comes from Jin et al. (2020a)[Lemma 3.6]. We observe that our algorithm achieves a better sample complexity for human feedbacks than the previous work while retaining the total trajectory complexity. In particular, our algorithm has the advantage that Nhum depends only on the feature dimension d and not on |S| or |A|. This improvement is signiﬁcant since obtaining human feedback is often costly. Lastly, we note that a similar comparison can be made to the work of Chen et al. (2022b), which considers reward and transition models with bounded Eluder dimension.

4 REGIME IN LINEAR MDPS

So far, we have considered Pb RL given reward-free RL oracle satisfying Deﬁnition 1. Existing works have shown the existence of such a model-based reward-free RL oracle in low-rank MDPs (Agarwal et al., 2020a; 2022). However, these results have not been extended to linear MDPs (Jin et al., 2020b) where model-free techiniques are necessary. Linear MDPs are relevant to our setting because linear reward parametrization naturally holds in linear MDPs. Unfortunately, a direct reduction from linear MDPs to low-rank MDPs may introduce a dependence on the cardinality of S without assuming strong inductive bias in the function class. In this section, we propose a model-free algorithm that can overcome this dependence by making slight modiﬁcations to Algorithm 1. We begin by providing the deﬁnition of linear MDPs.

Assumption 3 (Linear MDPs (Jin et al., 2020b)). We suppose MDP is linear with respect to some known feature vectors φh(s, a) Rd(h [H], s S, a A). More speciﬁcally, if for each h [H], there exist d unknown signed measures µ h = (ψ(1) h , , ψ(d) h ) over S and an unknown vector θ h Rd such that P h( |s, a) = φh(s, a) µ h( ) and r h(s, a) = φh(s, a) θ h for all (s, a) S A. For technical purposes, we suppose the norm bound µ h(s) 2

d for any s S.

In addition, we use NΠ(ϵ) to denote the covering number of Π, which is deﬁned as follows:

Deﬁnition 2 (ϵ-covering number). The ϵ-covering number of the policy class Π, denoted by NΠ(ϵ), is the minimum integer n such that there exists a subset Π Π with |Π | = n and for any π Π there exists π Π such that maxs S,h [H] πh( |s) π h( |s) 1 ϵ.

Published as a conference paper at ICLR 2024

Algorithm 2 REGIME-lin

Input: Regularization parameter λ, feature estimation sample complexity K. Call Algorithm 4 with generating K trajectories by interacting with the environment. Call Algorithm 5 with reward function (rh,j h )h [H] to estimate (bφ(π))h,j for all π Π, h [H], j [d] using K trajectories. Let bφ(π) = [bφ1(π), , bφH(π)] where the j-th entry of bφh(π) is (bφ(π))h,j. for n = 1, , N do

Compute (πn,0, πn,1) arg maxπ0,π1 Π bφ(π0) bφ(π1) bΣ 1 n .

Update bΣn+1 = bΣn + (bφ(π0) bφ(π1))(bφ(π0) bφ(π1)) . end for for n = 1, , N do

Collect a pair of trajectories τ n,0, τ n,1 from the environment by πn,0, πn,1, respectively. Add (τ n,0, τ n,1) to Dreward. end for Obtain the preference labels {o(n)}N n=1 from human experts. Run MLE bθ arg minθ Θ(B,H) Lλ(θ, Dreward, {o(n)}N n=1). Return bπ = arg maxπ Π b V π(br) where b V π(br) is obtained by calling Algorithm 5 with reward function br = {brh}H h=1 for all π where brh(s, a) = φh(s, a), bθ .

4.1 ALGORITHM

The reward-free RL oracle that satisﬁes Deﬁnition 1 for learning accurate transitions may be excessively strong for linear MDPs. Upon closer examination of Algorithm 1, it becomes apparent that the learned transition model is solely used for estimating φ(π). Therefore, our approach focuses on achieving a precise estimation of φ(π). Our main algorithm is described in Algorithm 2 with subroutines for estimating bφ(π). The overall structure of the primary algorithm resembles that of Algorithm 1. The key distinction lies in the part to accurately estimate bφ(π) within the subroutines, without relying on the abstract reward-free RL oracle (Deﬁnition 1). In the following, we provide a brief explanation of these subroutines. The detailed descriptions of these subroutines is deferred to Algorithm 4 and 5 in Appendix B.

Collecting exploratory data to learn transitions. Being inspired by the approach in Jin et al. (2020b); Wang et al. (2020), we construct an exploratory dataset by running LSVI-UCB (Jin et al., 2020b) with rewards equivalent to the bonus. Speciﬁcally, in the k-th iteration, we recursively apply the least square value iteration with a bonus term {bk h(s, a)}H h=1, which is introduced to induce exploration. This process yields an exploratory policy πk based on exploratory rewards {rk h}H h=1, where rk h = bk h/H. We then collect a trajectory by executing policy πk. By repeating this procedure for K iterations, we accumulate an exploratory dataset. The detailed algorithm is provided in Appendix B (Algorithm 4). It is important to note that this step involves generating K trajectories through interactions with the environment.

Estimating φ(π) using the exploratory data. Let (φ(π))h,j denote the j-th entry of φh(π) := Eπ[φh(sh, ah)]. Then to estimate φ(π), we only need to estimate (φ(π))h,j for all h [H], j [d]. Note that for all π Π, we have φ(π) = Eπ,P [φ1(s1, a1) ], , Eπ,P [φH(s H, a H) ] . Here, the key observation is that (φ(π))h,j is exactly the expected cumulative rewards with respect to the following reward function rh,j h (s, a) = φh (s, a) θh,j h for all h [H] (up to an R factor) where θh,j h = 1

R ej for h = h and θh,j h = 0, otherwise (h = h). Here ej is the one-hot encoding vector whose j-th entry is 1. Therefore, with the collected dataset, we can run the least square policy evaluation to estimate (φ(π))h,j. The detail is in Algorithm 5 in Appendix B.

4.2 ANALYSIS Now we present the sample complexity of Algorithm 2. The formal statement and proof are deferred to Appendix B and E.1.

Theorem 2 (Informal). By choosing parameters in an appropriate way and setting

Published as a conference paper at ICLR 2024

K e O H9B2R4d5 log(NΠ(ϵ )/δ)

ϵ2 , N e O λκ2B2R2H4d2 log(1/δ)

ϵ2 , ϵ = ϵ 72BR2H

under Assumption 1,2, and 3, with probability at least 1 δ, we have V r ,ˆπ V r , ϵ. Furthermore, by selecting a policy class Π properly, we have V r ,ˆπ V r ,πg 2ϵ by replacing log(NΠ(ϵ )/δ) =

Hd log 12W R

ϵ where W = B+(H+ϵ)

d H log |A| ϵ .

The ﬁrst statement says Algorithm 2 can learn an ϵ-optimal policy with the number of trajectory-pairs and human feedbacks as follows:

Ntra = K + N = e O d5 log NΠ(ϵ )+κ2d2

, Nhum = e O κ2d2

Since the sample complexity depends on the covering number of Π, we need to carefully choose the policy class. When we choose Π to be the log-linear policy class:

Π = n π = {πζ h}H h=1 : πζ h(a|s) = exp(ζ h φh(s, a)) P

a A exp(ζ h φh(s, a )), ζh B(d, W), s S, a A, h [H] o ,

although π = πg, we can show that the value of π is close to the value of πg up to ϵ by setting sufﬁciently large W. This immediately leads to the second statement in Theorem 2. Consequently, to learn an ϵ-global-optimal policy, it is concluded that the number of required trajectory pairs and human feedbacks for Algorithm 2 does not depend on |S| at all.

Finally, we compare our work to Chen et al. (2022b), as it is the only existing work that addresses provable Pb RL with non-tabular transition models. Their algorithm exhibits sample complexities that depend on the Eluder dimension associated with the transition models. However, in linear MDPs, it remains uncertain whether we can get upper-bound on the Eluder dimension without introducing a dependence on |S|. Consequently, our Algorithm 2 is the ﬁrst provable Pb RL algorithm capable of achieving polynomial sample complexity that is independent of |S| in linear MDPs.

5 REGIME WITH ACTION-BASED COMPARISON The drawback of the current results is that the sample complexity is dependent on κ, which can exhibit exponential growth in rmax under the BTL model. This is due to the fact that sup|x| rmax |1/σ (x)| = O(exp(rmax)). Such dependence on rmax is undesirable, especially when rewards are dense and rmax scales linearly with H. Similar limitations are present in existing works, such as Pacchiano et al. (2021); Chen et al. (2022b). To address this challenge, we consider the action-based comparison model (Zhu et al., 2023) in this section. Here, we assume that humans compare two actions based on their optimal Q-values. Given a tuple (s, a0, a1, h), the human provides feedback o following P(o = 1|s, a0, a1, h) = P(a1 a0|s, h) = σ(A h(s, a1) A h(s, a0)), (2) where A h is the advantage function of the optimal policy. Similar to trajectory-based comparisons with linear reward parametrization, we assume linearly parameterized advantage functions: Assumption 4 (Linear Advantage Parametrization). An MDP has linear advantage functions with respect to some known feature vectors φh(s, a) Rd(h [H], s S, a A) if for each h [H], there exists an unknown vector ξ h Rd such that A h(s, a) = φh(s, a) ξ h for all (s, a) S A. We assume for all s S, a A, h [H], we have φh(s, a) R, ξ h B.

Generally, the value of |A h(s, a)| tends to be much smaller than H since a large value of |A h(s, a)| implies that it may be difﬁcult to recover from a previous incorrect action even under the best policy π (Ross et al., 2011; Agarwal et al., 2019). Therefore, by deﬁning Badv = sup(s,a) |A h(s, a)|, we expect that Badv will be much smaller than H, even in scenarios with dense rewards.

In the following discussion, we will use Z(B, h) to denote the convex set {ζ Rd : ζ B, φh(s, a), ζ Badv, s S, a A}. We consider the setting where Π = ΠMar and assume the transition model is known for brevity. In the case of unknown transition models, we can employ the same approach as described in Section 3 with reward-free RL oracles.

We present our algorithm for action-based comparison models in Algorithm 3. In Line 19 we denote

L(ξ, Dh adv, {oh,n}N n=1) :=

n=1 log oh,n σ( ξ, φh(sh,n, ah,n,1) φ(sh,n, ah,n,0) )

Published as a conference paper at ICLR 2024

Algorithm 3 REGIME-action

1: Input: Regularization parameter λ. 2: for h = 1, , H do 3: Initialize Σh,1 = λI. 4: for n = 1, , N do 5: Compute: (πh,n,0, πh,n,1) arg maxπ0,π1 Π Esh π0[φh(sh, π0) φh(sh, π1)] Σ 1 h,n,

6: where φh(s, π) = Ea πh( |s)[φh(s, a)]. 7: Update:

Σh,n+1 = Σh,n+(Esh πh,n,0[φh(sh, πh,n,0) φh(sh, πh,n,1)])

(Esh πh,n,0[φh(sh, πh,n,0) φh(sh, πh,n,1)])

8: end for 9: end for 10: for h = 1, , H do 11: for n = 1, , N do 12: Sample sh,n at time step h by executing a policy πh,n,0 = {πh,n,0 k }H k=1.

13: Sample actions ah,n,0 πh,n,0 h ( |sh,n), ah,n,1 πh,n,1 h ( |sh,n). 14: Add (sh,n, ah,n,0, ah,n,1) to Dh adv. 15: (These steps involve the interaction with environment) 16: end for 17: end for 18: Obtain the preference labels {oh,n}N n=1 for Dh adv from human experts.

19: Run MLE bξh arg minξ Z(B,h) L(ξ, Dh adv, {oh,n}N n=1). 20: Compute: for all s S, a A, h [H]: 21: b Ah(s, a) φh(s, a) bξh, bπh(s) arg maxa A b Ah(s, a). 22: Return bπ = {bπ}H h=1.

+ (1 oh,n) σ( ξ, φh(sh,n, ah,n,0) φ(sh,n, ah,n,1) ) ,

where Dh adv = {sh,n, ah,n,0, ah,n,1}N n=1.

5.1 ANALYSIS Theorem 3. Let

λ 4R2, N e O(λκ2 adv B2R2H2d2 log(1/δ)/ϵ2)

where κadv = sup|x| Badv |1/σ (x)| in REGIME-action. Then under Assumption 4, with probability at least 1 δ, we have V r ,bπ V r , ϵ. Theorem 3 demonstrates that for the action-based comparison model, the number of required human feedbacks scales with κadv instead of κ. This implies that when σ is a commonly used sigmoid function, the sample complexity is exponential in Badv rather than rmax. Crucially, Badv is always less than or equal to rmax, and as mentioned earlier, Badv can be o(H) even in dense reward settings where rmax = Θ(H). Consequently, we achieve superior sample complexity compared to the trajectory-based comparison setting.

We consider the problem of how to query human feedback efﬁciently in Pb RL, i.e., the experimental design problem in Pb RL. In particular, we design a reward-agnostic trajectory collection algorithm for human feedback querying when the transition dynamics is unknown. Our algorithm provably requires less human feedback to learn the true reward and optimal policy than existing literature. Our results also go beyond the tabular cases and cover common MDPs models including linear MDPs and low-rank MDPs. Further, we consider the action-based comparison setting and propose corresponding algorithms to circumvent the exponential scaling with rmax of trajectory-based comparison setting.

Published as a conference paper at ICLR 2024

Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. (2019). Reinforcement learning: Theory and algorithms. Technical report.

Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. (2020a). Flambe: Structural complexity and representation learning of low rank mdps. ar Xiv preprint ar Xiv:2006.10814.

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2020b). Optimality and approximation with policy gradient methods in Markov decision processes. In Conference on Learning Theory, pages 64 66. PMLR.

Agarwal, A., Song, Y., Sun, W., Wang, K., Wang, M., and Zhang, X. (2022). Provable beneﬁts of representational transfer in reinforcement learning. ar Xiv preprint ar Xiv:2205.14571.

Bradley, R. A. and Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345.

Brown, D., Goo, W., Nagarajan, P., and Niekum, S. (2019). Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783 792. PMLR.

Cen, S., Cheng, C., Chen, Y., Wei, Y., and Chi, Y. (2022). Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563 2578.

Chen, J. and Jiang, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042 1051. PMLR.

Chen, J., Modi, A., Krishnamurthy, A., Jiang, N., and Agarwal, A. (2022a). On the statistical efﬁciency of reward-free exploration in non-linear rl. ar Xiv preprint ar Xiv:2206.10770.

Chen, X., Zhong, H., Yang, Z., Wang, Z., and Wang, L. (2022b). Human-in-the-loop: Provably efﬁcient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pages 3773 3793. PMLR.

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Du, S. S., Luo, Y., Wang, R., and Zhang, H. (2019). Provably efﬁcient Q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8058 8068. PMLR.

Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M. (2015). Contextual dueling bandits. In Conference on Learning Theory, pages 563 587. PMLR.

Glaese, A., Mc Aleese, N., Tr ebacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., et al. (2022). Improving alignment of dialogue agents via targeted human judgements. ar Xiv preprint ar Xiv:2209.14375.

Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. (2020a). Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870 4879. PMLR.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2019). Provably efﬁcient reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:1907.05388.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020b). Provably efﬁcient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR.

Laskey, M., Staszak, S., Hsieh, W. Y.-S., Mahler, J., Pokorny, F. T., Dragan, A. D., and Goldberg, K. (2016). Shiv: Reducing supervisor burden in dagger using support vectors for efﬁcient learning from demonstrations in high dimensional state spaces. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 462 469.

Published as a conference paper at ICLR 2024

Lee, K., Smith, L., and Abbeel, P. (2021). Pebble: Feedback-efﬁcient interactive reinforcement learning via relabeling experience and unsupervised pre-training. ar Xiv preprint ar Xiv:2106.05091.

Liu, H., Sferrazza, C., and Abbeel, P. (2023). Languages are rewards: Hindsight ﬁnetuning using human feedback. ar Xiv preprint ar Xiv:2302.02676.

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775 2785.

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. ar Xiv preprint ar Xiv:2112.09332.

Novoseller, E., Wei, Y., Sui, Y., Yue, Y., and Burdick, J. (2020). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 1029 1038. PMLR.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744.

Pacchiano, A., Parker-Holder, J., Tang, Y., Choromanski, K., Choromanska, A., and Jordan, M. (2020). Learning to score behaviors for guided policy optimization. In International Conference on Machine Learning, pages 7445 7454. PMLR.

Pacchiano, A., Saha, A., and Lee, J. (2021). Dueling rl: reinforcement learning with trajectory preferences. ar Xiv preprint ar Xiv:2111.04850.

Parker-Holder, J., Pacchiano, A., Choromanski, K. M., and Roberts, S. J. (2020). Effective diversity in population based reinforcement learning. Advances in Neural Information Processing Systems, 33:18050 18062.

Qiu, S., Ye, J., Wang, Z., and Yang, Z. (2021). On reward-free rl with kernel and neural function approximations: Single-agent mdp and markov game. In International Conference on Machine Learning, pages 8737 8747. PMLR.

Ramamurthy, R., Ammanabrolu, P., Brantley, K., Hessel, J., Sifa, R., Bauckhage, C., Hajishirzi, H., and Choi, Y. (2022). Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. ar Xiv preprint ar Xiv:2210.01241.

Ross, S., Gordon, G., and Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627 635.

Ross, S., Melik-Barkhudarov, N., Shankar, K. S., Wendel, A., Dey, D., Bagnell, J. A., and Hebert, M. (2013). Learning monocular reactive uav control in cluttered natural environments. In 2013 IEEE international conference on robotics and automation, pages 1765 1772. IEEE.

Shin, D., Dragan, A. D., and Brown, D. S. (2023). Benchmarks and algorithms for ofﬂine preferencebased reward learning. ar Xiv preprint ar Xiv:2301.01392.

Sodhani, S., Meier, F., Pineau, J., and Zhang, A. (2022). Block contextual mdps for continual learning. In Learning for Dynamics and Control Conference, pages 608 623. PMLR.

Sodhani, S., Zhang, A., and Pineau, J. (2021). Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pages 9767 9779. PMLR.

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. (2020). Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008 3021.

Published as a conference paper at ICLR 2024

Wang, R., Du, S. S., Yang, L., and Salakhutdinov, R. R. (2020). On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816 17826.

Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J., et al. (2017). A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1 46.

Wu, J., Ouyang, L., Ziegler, D. M., Stiennon, N., Lowe, R., Leike, J., and Christiano, P. (2021). Recursively summarizing books with human feedback. ar Xiv preprint ar Xiv:2109.10862.

Xu, Y., Parker-Holder, J., Pacchiano, A., Ball, P., Rybkin, O., Roberts, S., Rocktäschel, T., and Grefenstette, E. (2022). Learning general world models in a handful of reward-free deployments. Advances in Neural Information Processing Systems, 35:26820 26838.

Xu, Y., Wang, R., Yang, L., Singh, A., and Dubrawski, A. (2020). Preference-based reinforcement learning with ﬁnite-time guarantees. Advances in Neural Information Processing Systems, 33:18784 18794.

Xue, W., Cai, Q., Xue, Z., Sun, S., Liu, S., Zheng, D., Jiang, P., and An, B. (2022). Prefrec: Preference-based recommender systems for reinforcing long-term user engagement. ar Xiv preprint ar Xiv:2212.02779.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. (2012). The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556.

Zanette, A., Lazaric, A., Kochenderfer, M. J., and Brunskill, E. (2020). Provably efﬁcient rewardagnostic navigation with linear value iteration. Advances in Neural Information Processing Systems, 33:11756 11766.

Zhang, A., Lyle, C., Sodhani, S., Filos, A., Kwiatkowska, M., Pineau, J., Gal, Y., and Precup, D. (2020a). Invariant causal prediction for block mdps. In International Conference on Machine Learning, pages 11214 11224. PMLR.

Zhang, A., Mc Allister, R., Calandra, R., Gal, Y., and Levine, S. (2020b). Learning invariant representations for reinforcement learning without reconstruction. ar Xiv preprint ar Xiv:2006.10742.

Zhu, B., Jiao, J., and Jordan, M. I. (2023). Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. ar Xiv preprint ar Xiv:2301.11270.

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. ar Xiv preprint ar Xiv:1909.08593.

Zoghi, M., Whiteson, S. A., De Rijke, M., and Munos, R. (2014). Relative conﬁdence sampling for efﬁcient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73 82.

Published as a conference paper at ICLR 2024

A DISCUSSION

Validity of Linear Parametrization. In this work we consider linear reward parametrization, or more generally, linear trajectory embeddings. Such assumptions are commonly used in the theoretical works of Pb RL (Pacchiano et al., 2021; Zhu et al., 2023) and relevant examples can be borrowed from the practical works (Pacchiano et al., 2020; Parker-Holder et al., 2020) like the behavior guided class of algorithms for policy optimization. We admit that our analysis cannot cover all kinds of reward parametrization, but we think it is a reasonable starting point to consider linear function approximation for theoretical analysis.

Applicability of REGIME. Our results are not restricted to linear MDPs and low-rank MDPs. Our main result (Theorem 1) only requires a suitable reward-free oracle for dynamics learning and in practice there exist some oracles ready for use, even when the MDPs are very complicated (Xu et al., 2022). This implies that by plugging in these general reward-free oracles, we are able to deal with general MDPs as well.

Relationship to Active Learning. The trajectory collection process in Step 1 of REGIME utilizes a similar idea to active learning. In active learning, people choose a trajectory pair to query human feedback which can maximize the information gain in each iteration. Similarly, in our algorithm, the estimated covariance matrix bΣn can be regarded as the current information in n-th iteration. Then we choose a pair of policies (πn,0, πn,1) which can maximize the information gain approximately (line 5, it is approximate because we are using an approximate dynamics b P for evaluation).

Implementation of Algorithm 1. The majority of computational cost of Algorithm 1 lies in line 5. To implement the algorithm, gradient ascent can be applied here to solve the optimization problem. More speciﬁcally, assume that π0 and π1 are parameterized by ν0 and ν1 (for example, they might be parametrized by a neural network). Then the gradient of the objective f(ν0, ν1) := bφ(π0) bφ(π1) bΣ 1 n with respect to ν0 (gradient of ν1 can be similarly computed) can be computed as:

ν0 = 2 bφ(π0)

ν0 bΣ 1 n (bφ(π0) bφ(π1))

Here bφ(π0) and bφ(π1) can be estimated efﬁciently by simulating π0 and π1 in b P. For bφ(π0)

ν0 , we show in Section 4 that each coordinate of bφ(π0) is indeed a value function with respect to a certain reward function under b P. Therefore we can apply the techniques in policy gradient literature such as REINFORCE to estimate bφ(π0)

ν0 efﬁciently.

Technical Novelty. The novelty of our algorithm mainly lies in the reward-agnostic trajectory collection procedure (line 4-11 in Algorithm 1). For the other steps in the framework, we follow the common practice in empirical works and thus they are the same as the practical work. Our rewardagnostic trajectory collection procedure takes inspiration from the existing works in reward-free exploration RL and online Pb RL. However, it is not a simple combination of the existing algorithms. More speciﬁcally, our analysis has the following challenges that are not solved in the literature:

First, almost all the reward-free exploration literature is focused on learning the transition dynamics of the MDPs. Our problem is completely different because our main task is to learn the reward function from human preferences. Consequently, our iterative collection procedure (line 4-11 in Algorithm 1) also differs from the reward-free exploration literature. For example, Wang et al. (2020) is the closest work to ours and studies the reward-free exploration in linear MDPs. They utilize a variant of UCB-LSVI, which is quite different from our collection procedure. Correspondingly, we can not directly use the existing analysis either and need to come up with new proofs. Second, our framework is also different from the existing online Pb RL (Pacchiano et al., 2021;

Chen et al., 2022b). In their algorithms, the reward-learning process and exploration phase are not decoupled and the analysis heavily relies on the utilization of optimism. In contrast, our algorithm does not use optimism and thus their analysis techniques also do not apply here.

Published as a conference paper at ICLR 2024

Algorithm 4 REGIME-exploration

Input: The number of total episodes K, bonus parameter βex and regularization parameter λex. for k = 1, , K do

Initialize: Qk H+1( , ) 0, V k H+1( ) 0. for h = H, , 1 do

Compute the covariance matrix: Λk h Pk 1 i=1 φh(si h, ai h)φh(si h, ai h) + λex I. Compute the bonus and reward: bk h( , ) min βex φh( , ) (Λk h) 1, H h + 1 and rk h = bk h/H. Compute Q function:

Qk h( , ) Clip[0,H h+1]

Clip[0,H h+1]((wk h) φh( , ) + rk h( , )) + bk h( , ) ,

where wk h = (Λk h) 1 Pk 1 i=1 φh(si h, ai h) V k h+1(si h+1). Compute value function and policy:

V k h ( ) max a A Qk h( , a), πk h( ) arg max a A Qk h( , a).

end for Collect a trajectory τ k = (sk h, ak h, sk h+1)H h=1 by running πk = {πk h}H h=1 and add τ k into Dex. end for Sample K states from the initial states {si,in 1 }K i=1 and add them to Din. Return Dex, Din.

Third, our analysis of REGIME-lin differs from the one in the literature on linear MDPs. This is because while the style of our algorithm is akin to multi-policy evaluation in linear MDPs, the current literature of linear MDPs mainly focuses on Q-learning-style algorithms for policy optimization (Jin et al., 2019; Wang et al., 2020). More speciﬁcally, in REGIME-lin, we want to evaluate all the policies in a policy class under a certain reward function. This brings a new challenge that does not occur in Jin et al. (2019); Wang et al. (2020): our learning difﬁculty will scale with the size of the policy class . This forces us to ﬁnd and analyze an appropriate policy class that is able to approximate the optimal policy while retaining a relatively small size. In addition, a key component of our analysis is the new concentration inequality in Lemma 13, which also differs from the existing concentration lemma in Jin et al. (2019); Wang et al. (2020) and may be of independent interest.

B OMIT DETAILS IN SECTION 4

In this section we present the details of Algorithm 4 and Algorithm 5. Here Clip[a,b](x) means min{max{a, x}, b}. In particular, when estimating (φ(π))h,j, we use the reward function rh,j h (s, a) = φh (s, a) θh,j h for all h [H] (up to an R factor) where

R ej, if h = h, 0, otherwise.

Here ej is the one-hot vector whose j-th entry is 1. For simplicity, we denote b V rh,j,π, b Qrh,j,π, wrh,j,π

by V h,j,π, Qh,j,π, wh,j,π and let the estimation (bφ(π))h,j be Rb V π(rh,j).

Then we have the following formal theorem characterizing the sample complexity of Algorithm 2:

Theorem 4. Let

λex = λpl = R2,

βex = Cβd HR p

log(d KHR/δ), βpl = Cβd HR p

log(d KHRNΠ(ϵ )/δ)

λ 4HR2, K e O H8B2R4d4 log(NΠ(ϵ )/δ)

, N e O λκ2B2R2H4d2 log(1/δ)

Published as a conference paper at ICLR 2024

Algorithm 5 REGIME-planning

Input: Dataset Dex = {(si h, ai h, si h+1)}K,H i=1,h=1, Din = {si,in 1 }K i=1, bonus parameter βpl and regularization parameter λpl, policy π, reward function (rh)H h=1. for h = H, , 1 do

Compute the covariance matrix: Λh PK i=1 φh (si h , ai h )φh (si h , ai h ) + λpl I. Compute the bonus: bh ( , ) min βpl φh ( , ) (Λh ) 1, 2(H h + 1) . end for Initialize: b Qr,π H+1( , ) 0, b V r,π H+1( ) 0. for h = H, , 1 do

Compute Q function:

b Qr,π h ( , ) Clip[ (H h+1),H h+1]

Clip[ (H h+1),H h+1]((wr,π h ) φh ( , )+rh ( , ))+bh ( , ) ,

where wr,π h = (Λh ) 1 PK i=1 φh (si h , ai h ) b V r,π h +1(si h +1). Compute value function: b V r,π h ( ) Ea πh b Qr,π h ( , a). end for Compute b V π(r) 1

K PK i=1 b V r,π 1 (si,in 1 ). Return b V π(r).

where ϵ = ϵ 72BR2H

d HKH 1 , Cβ > 0 is a universal constant and κ = 2 + exp(2rmax) + exp( 2rmax). Then under Assumption 1 and 3, with probability at least 1 δ, we have

V r ,ˆπ V r , ϵ.

The proof is deferred to Appendix E.1.

B.1 LOG-LINEAR POLICY CLASS

The sample complexity in Theorem 2 depends on the covering number of the policy class Π. Therefore we want to ﬁnd a policy class for linear MDPs that is rich enough (i.e., contains near-global-optimal policies) while retains a small covering number at the same time. Indeed, the log-linear policy class (Agarwal et al., 2020b) satisﬁes this requirement, which is deﬁned as follows:

Π = n π : πζ h(a|s) = exp(ζ h φh(s, a)) P

a A exp(ζ h φh(s, a )), ζh B(d, W), s S, a A, h [H] o

Here B(d, W) is the d-dimensional ball centered at the origin with radius W. The following proposition characterizes the covering number of such log-linear policy class: Proposition 1. Let Π be the log-linear policy class. Then under Assumption 1, for any ϵ 1, we have log NΠ(ϵ) Hd log 12W R

Meanwhile, we can quantify the bias of such log-linear policy class as follows:

Proposition 2. Let W = B+(H+ϵ)

d H log |A| ϵ , then under Assumption 1 and 3, we have

V r ,πg max π Π V r ,π ϵ,

where πg is the global optimal policy.

Combining Theorem 4, Proposition 1 and Proposition 2, we know that the returned policy bπ by Algorithm 2 with log-linear policy classes can indeed compete against the global optimal policy with the following sample complexities:

Ntra = K + N = e O d5 + κ2d2

, Nhum = e O κ2d2

Published as a conference paper at ICLR 2024

C PROOF OF THEOREM 1 WITH KNOWN TRANSITIONS

In this section, we consider the proof of Theorem 1 when transitions are known, i.e., ϵ = 0 and b P = P . In this case we have bφ(π) = φ(π). We will deal with the unknown transition in Appendix D.1.

First, note that from the deﬁnition of bπ, we have

V br,bπ V br,π ,

where π is the optimal policy with respect to the ground-truth reward r , i.e., π = arg maxπ Π V r ,π. Therefore we can expand the suboptimality as follows:

V r , V r ,bπ = (V r , V br,π ) + (V br,π V br,bπ) + (V br,bπ V r ,bπ)

(V r , V br,π ) + (V br,bπ V r ,bπ)

= Eτ (π ,P )[ φ(τ), θ bθ ] Eτ (bπ,P )[ φ(τ), θ bθ ]

= φ(π ) φ(bπ), θ bθ

φ(π ) φ(bπ) Σ 1 N+1 θ bθ ΣN+1, (3)

where Σn := λI + Pn 1 i=1 (φ(πi,0) φ(πi,1))(φ(πi,0) φ(πi,1)) for all n [N + 1]. Here the third step is due to the deﬁnition of value function and the last step comes from Cauchy-Schwartz inequality. Next we will bound φ(π ) φ(bπ) Σ 1 N+1 and θ bθ ΣN+1 respectively.

First for φ(π ) φ(bπ) Σ 1 N+1, notice that ΣN+1 Σn for all n [N + 1], which implies

φ(π ) φ(bπ) Σ 1 N+1 1

n=1 φ(π ) φ(bπ) Σ 1 n 1

n=1 φ(πn,0) φ(πn,1) Σ 1 n

n=1 φ(πn,0) φ(πn,1) 2 Σ 1 n , (4)

where the second step comes from the deﬁnition of πn,0 and πn,1 and the last step is due to Cauchy Schwartz inequality. To bound the right hand side of (4), we utilize the following Elliptical Potential Lemma:

Lemma 1 (Elliptical Potential Lemma). For any λ R2 x and d 1, consider a sequence of vectors {xn Rd}N n=1 where xn Rx for all n [N]. Let Σn = λI + Pn 1 i=1 xn(xn) , then we have

n=1 xn 2 Σ 1 n 2d log 1 + N

The proof is deferred to Appendix C.1. Since we have λ 4HR2, by Lemma 1 we know v u u t

n=1 φ(πn,0) φ(πn,1) 2 Σ 1 n p

2Hd N log(1 + N/(Hd)).

Combining the above inequality with (4), we have

φ(π ) φ(bπ) Σ 1 N+1

2Hd log(1 + N/(Hd))

For θ bθ ΣN+1, ﬁrst note that bθ is the MLE estimator. Let eΣn denote the empirical cumulative covariance matrix λI + Pn 1 i=1 (φ(τ i,0) φ(τ i,1))(φ(τ i,0) φ(τ i,1)) , then from the literature (Zhu et al., 2023), we know that MLE has the following guarantee:

Published as a conference paper at ICLR 2024

Lemma 2 (MLE guarantee). For any λ > 0 and δ (0, 1), with probability at least 1 δ, we have

ˆθ θ eΣN+1 CMLE p

κ2(Hd + log(1/δ)) + λHB2, (6)

where κ = 2 + exp(2rmax) + exp( 2rmax) and CMLE > 0 is a universal constant.

The proof is deferred to Appendix C.2. With Lemma 2, to θ bθ ΣN+1 we only need to show eΣN+1 is close to ΣN+1. This can be achieved by the following concentration result from the literature: Lemma 3 (Pacchiano et al. (2021)[Lemma 7]). For any λ > 0 and δ (0, 1), with probability at least 1 δ, we have

θ ˆθ 2 ΣN+1 2 θ ˆθ 2 eΣN+1 + CCONH3d R2B2 log(N/δ), (7)

where CCON > 0 is a universal constant.

Therefore, combining (7) and (6), by union bound with probability at least 1 δ, we have that

θ bθ ΣN+1 C1 κBR p

λH3d log(N/δ), (8)

where C1 is a universal constant.

Thus substituting (5) and (8) into (3), we have V (r ) V (r , ˆπ) ϵ with probability at least 1 δ as long as

N e O λκ2B2R2H4d2 log(1/δ)

C.1 PROOF OF LEMMA 1

Note that when λ R2 x, we have xn Σ 1 n 1 for all n [N], which implies that for all n [N], we have

xn 2 Σ 1 n log 1 + xn 2 Σ 1 n

On the other hand, let wn denote xn Σ 1 n , then we know for any n [N 1]

log det Σn+1 = det(Σn + xn(xn) ) = log det(Σ1/2 n (I + Σ 1/2 n xn(xn) Σ 1/2 n )Σ1/2 n )

= log det(Σn) + log det(I + (Σ 1/2 n xn)(Σ 1/2 n xn) )

= log det(Σn) + log det(I + (Σ 1/2 n xn) (Σ 1/2 n xn))

= log det(Σn) + log 1 + xn 2 Σ 1 n

where the fourth step is due to the property of determinants. Therefore we have

n=1 log 1 + xn 2 Σ 1 n

= log det ΣN+1 log det Σ1 = log(det ΣN+1/ det Σ1)

= log det I + 1

n=1 xn(xn) .

Now let {λi}d i=1 denote the eigenvalues of PN n=1 xn(xn) , then we know

log det I + 1

n=1 xn(xn) = log d Y

i=1 (1 + λi/λ)

i=1 (1 + λi/λ) d log 1 + NR2 x dλ

d log 1 + N

where the third step comes from Pd i=1 λi = Tr PN n=1 xn(xn) = PN n=1 xn 2 NR2 x and

the last step is due to the fact that λ R2 x. This concludes our proof.

Published as a conference paper at ICLR 2024

C.2 PROOF OF LEMMA 2

First note that we have the following lemma from literature: Lemma 4 ( (Zhu et al., 2023)[Lemma 3.1]). For any λ > 0, with probability at least 1 δ, we have

bθ θ D+λ I O r

κ2(Hd + log(1/δ))

N + λ HB2 ,

where D = 1

N PN i=1(φ(τ i,0) φ(τ i,1))(φ(τ i,0) φ(τ i,1)) .

Therefore let λ = λ

N and from the above lemma we can obtain

bθ θ e ΣN+1

κ2(Hd + log(1/δ))

which is equivalent to

bθ θ eΣN+1 O p

κ2(Hd + log(1/δ)) + λHB2 .

This concludes our proof.

D PROOFS IN SECTION 3

D.1 PROOF OF THEOREM 1

Note that from the proof of Theorem 1 with known transition dynamics, we have:

V r , V r ,bπ φ(π ) φ(bπ), θ bθ + (V br,π V br,bπ), (9)

Then we have

V r , V r ,bπ φ(π ) bφ(π ), θ ˆθ + bφ(ˆπ) φ(ˆπ), θ ˆθ

+ bφ(π ) bφ(ˆπ), θ ˆθ + (V br,π V br,bπ). (10)

Now we only need to bound the three terms in the RHS of (10). For the ﬁrst and second term, we need to utilize the following lemma:

Lemma 5. Let dπ h(s, a) and bdπ h(s, a) denote the visitation measure of policy π under P and ˆP. Then with probability at least 1 δ/4, we have for all h [H] and π Π,

dπ h bdπ h 1 hϵ . (11)

Let E1 denote the event when (11) holds. Then under event E1, we further have the following lemma: Lemma 6. Under event E1, for all policy π Π and vector v = [v1, , v H] where vh Rd and vh 2B for all h [H] we have,

| φ(π) bφ(π), v | BRH2ϵ .

Substitute Lemma 6 into (10), we have

V r , V r ,bπ bφ(π ) bφ(bπ), θ bθ + 2BRH2ϵ + (V br,π V br,bπ).

Then by Cauchy-Schwartz inequality, we have under event E1,

V r , V r ,bπ bφ(π ) bφ(bπ) bΣ 1 N+1 θ bθ bΣN+1 + 2BRH2ϵ + (V br,π V br,bπ). (12)

Following the same analysis in the proof of Theorem 1 with known transition, we know

bφ(π ) bφ(bπ) bΣ 1 N+1

2Hd log(1 + N/(Hd))

Published as a conference paper at ICLR 2024

Now we only need to bound θ ˆθ bΣN+1. Similar to the proof of Theorem 1 with known tran-

sition, we use Σn and eΣn to denote λI + Pn 1 i=1 (φ(πi,0) φ(πi,1))(φ(πi,0) φ(πi,1)) and λI + Pn 1 i=1 (φ(τ i,0) φ(τ i,1))(φ(τ i,0) φ(τ i,1)) respectively. Then under event E1, we have the following connection between bΣN+1 and ΣN+1: Lemma 7. Under event E1, we have

2 θ bθ ΣN+1 + 2

Combining Lemma 7 with Lemma 2 and Lemma 3, we have under event E1 E2,

2 θ bθ ΣN+1 + 2

λH3d log(N/δ) + 2

2NBRH2ϵ , (14) where Pr(E2) 1 δ/2 and C2 > 0 is a universal constant.

Now we only need to bound (V br,π V br,bπ), which can be achieved with Lemma 6:

V br,π V br,bπ = φ(π ), bθ φ(bπ), bθ

= φ(π ) bφ(π ), bθ + bφ(π ) bφ(bπ), bθ + bφ(bπ) φ(bπ), bθ 2BRH2ϵ , (15) where the last step comes from Lemma 6 and the deﬁnition of bπ.

Combining (12), (13) (14) and (15), we have V r , V r ,bπ ϵ with probability at least 1 δ as long as

ϵ ϵ 6BRH2 , N e O λκ2B2R2H4d2 log(1/δ)

D.2 PROOF OF LEMMA 5

First notice that dπ h(s, a) = dπ h(s)π(a|s) and bdπ h(s, a) = bdπ h(s)π(a|s), which implies that for all h [H] dπ h bdπ h 1 = X

dπ h(s, a) bdπ h(s, a) = X

dπ h(s) bdπ h(s) π(a|s)

dπ h(s) bdπ h(s) X

a π(a|s) = X

dπ h(s) bdπ h(s) .

Therefore we only need to prove P s dπ h(s) bdπ h(s) hϵ for all h [H]. We use induction to prove this. First for the base case, we have P

s |dπ 1(s) bdπ 1(s)| = P

s P 1 (s) b P1(s) ϵ according to the guarantee of the reward-free learnign oracle P.

Now assume that P

s dπ h (s) bdπ h (s) h ϵ for all h [h] where h [H 1]. Then we have X

dπ h+1(s) bdπ h+1(s) = X

s ,a bdπ h(s )π(a |s ) b Ph(s|s , a ) dπ h(s )π(a |s )P h(s|s , a )

bdπ h(s ) dπ h(s ) π(a |s ) b Ph(s|s , a )

s,s ,a dπ h(s )π(a |s ) b Ph(s|s , a ) P h(s|s , a )

bdπ h(s ) dπ h(s ) X

a π(a |s ) X

s b Ph(s|s , a )

+ Eπ,P [ b Ph( |s , a ) P h( |s , a ) 1]

(h + 1)ϵ , where the second step comes from the triangle inequality and the last step is dueto the induction hypothesis and the guarantee of P. Therefore, we have P

s |dπ h+1(s) bdπ h+1(s)| (h + 1)ϵ . Then by induction, we know P

s |dπ h(s) bdπ h(s)| hϵ for all h [H], which concludes our proof.

Published as a conference paper at ICLR 2024

D.3 PROOF OF LEMMA 6

Note that from the deﬁnition of φ(π) we have

φ(π), v = Eτ (π,P )

h=1 φ h (sh, ah)vh

sh,ah dπ h(sh, ah)φ h (sh, ah)vh.

Similarly, we have

bdπ h(sh, ah)φ h (sh, ah)vh.

| φ(π) bφ(π), v |

sh,ah |bdπ h(sh, ah) dπ h(sh, ah)| |φ h (sh, ah)vh|

sh,ah |bdπ h(sh, ah) dπ h(sh, ah)|

h=1 hϵ BRH2ϵ ,

where the ﬁrst step is due to the triangle inequality and the third step comes from Lemma 5. This concludes our proof.

D.4 PROOF OF LEMMA 7

We use θ to denote θ ˆθ in this proof. From Lemma 6, we know that for any policy π,

| φ(π) bφ(π), θ | BRH2ϵ .

By the triangle inequality, this implies that for any policy π0, π1,

| bφ(π0) bφ(π1), θ | | φ(π0) φ(π1), θ | + 2BRH2ϵ .

Therefore we have for any policy π0, π1,

| bφ(π0) bφ(π1), θ |2 2| φ(π0) φ(π1), θ |2 + 8(BRH2ϵ )2. (16)

Note that from the deﬁnition of bΣN+1 and ΣN+1, we have

θ 2 bΣN+1 = θ λI +

n=1 (bφ(πn,0) bφ(πn,1))(bφ(πn,0) bφ(πn,1)) θ

n=1 | bφ(πn,0) bφ(πn,1), θ |2

n=1 | φ(πn,0) φ(πn,1), θ |2 + 8N(BRH2ϵ )2

= 2 θ 2 ΣN+1 + 8N(BRH2ϵ )2,

where the third step comes from (16). This implies that

2 θ ΣN+1 + 2

which concludes our proof.

Published as a conference paper at ICLR 2024

E PROOFS IN SECTION 4 AND APPENDIX B

E.1 PROOF OF THEOREM 4

First note that Algorithm 4 provides us with the following guarantee: Lemma 8. We have with probability at least 1 δ/6 that

Es1 P1( )[V b/H, 1 (s1)] Clin p

d3H4R2 log(NΠ(ϵ )d KHR/δ)/K,

where bh is deﬁned in Algorithm 5 and Clin > 0 is a universal constant. Here V r, 1 (s1) := maxπ Π V r,π 1 (s1).

Lemma 8 is adapted from Wang et al. (2020)[Lemma 3.2] and we highlight the difference of the proof in Appendix E.2. Then we consider a ϵ -covering for Π, denoted by C(Π, ϵ ). Following the similar analysis in Wang et al. (2020)[Lemma 3.3], we have the following lemma: Lemma 9. With probability 1 δ/6, for all h [H], policy π C(Π, ϵ ) and linear reward function r with rh [ 1, 1], we have

Qr,π h ( , ) b Qr,π h ( , ) rh ( , ) + X

s P h (s | , )b V r,π h +1(s ) + 2bh ( , ).

The proof of Lemma 9 is deferred to Appendix E.3. Denote the event in Lemma 8 and Lemma 9 by E4 and E5 respectively. Then under event E4 E5, we have for all policy π C(Π, ϵ ) and all linear reward function r with rh [ 1, 1],

0 Es1 P 1 ( )[b V r,π 1 (s1) V r,π 1 (s1)] 2Es1 P 1 ( )[V b,π 1 (s1)]

2HEs1 P1( )[V b/H, 1 (s1)] 2Clin

d3H6R2 log(d KHRNΠ(ϵ )/δ)

where ϵ0 = ϵ 72BR

Hd. Here the ﬁrst step comes from the left part of Lemma 9 and the second step is due to the right part of Lemma 9.

Note that in the proof of Lemma 13, we calculate the covering number of the function class {b V r,π 1 : r is linear and rh [ 1, 1]} for any ﬁxed π in (24). Then by Azuma-Hoeffding s inequality and (24), we have with probability at least 1 δ/6 that for all policy π C(Π, ϵ ) and all linear reward function r with rh [ 1, 1] that

Es1 P 1 ( )[b V r,π 1 (s1)] 1

i=1 b V r,π 1 (si,in 1 ) C3H

log(NΠ(ϵ )HKd R/δ)

where C3 > 0 is a universal constant.

Combining (17) and (18), we have with probability at least 1 δ/2 that for all policy π C(Π, ϵ ) and all linear reward function r with rh [ 1, 1]

|b V π(r) V r,π| 2ϵ0. (19)

This implies that we can estimate the value function for all π C(Π, ϵ ) and linear reward function r with rh [ 1, 1] up to estimation error 2ϵ0.

Now we consider any policy π Π. Suppose that π C(Π, ϵ ) satisﬁes that

max s S,h [H] πh( |s) π h( |s) 1 ϵ . (20)

Then we can bound |b V π(r) b V π (r)| and |V r,π V r,π | for all linear reward function r with rh [ 1, 1] respectively.

For |V r,π V r,π |, note that we have the following performance difference lemma: Lemma 10. For any policy π, π and reward function r, we have

V r,π V r,π =

h=1 Eπ ,P h Qr,π h (sh, ), π h( |s) πh( |s) i .

Published as a conference paper at ICLR 2024

The proof is deferred to Appendix E.4. Therefore from Lemma 10 we have

|V r,π V r,π|

h =1 Eπ,P h Qrh,j,π

h (sh , ), πh ( |s) π h ( |s) i

h =1 Eπ,P h πh ( |s) π h ( |s) 1 i Hϵ . (21)

On the other hand, we have the following lemma to bound |b V π(r) b V π (r)|:

Lemma 11. Suppose (20) holds and b V π(r), b V π (r) are calculated as in Algorithm 5. Then for all linear reward function r with 0 r(τ) rmax, we have

|b V π(r) b V π (r)| ϵcover := Hϵ

d K 1 (d K) H

The proof is deferred to Appendix E.5.

Combining (19),(21) and (22), we have for all policy π Π and linear reward function r with 0 r(τ) rmax,

|b V π(r) V r,π| 2ϵ0 + Hϵ + ϵcover. (23)

In particular, since (φ(π))h,j = RV rh,j,π, we have for all policy π Π and h [H], j [d],

|(φ(π))h,j (bφ(π))h,j| 2Rϵ0 + HRϵ + Rϵcover.

This implies that for all policy π Π and any v deﬁned in Lemma 6, we have

| (φ(π)) (bφ(π)), v | 2BH

d(2Rϵ0 + HRϵ + Rϵcover).

The rest of the proof is the same as Theorem 1 and thus is omitted here. The only difference is that we need to show bπ is a near-optimal policy with respect to br. This can be proved as follows:

V br, V br,bπ = V br, b V bπ(br) + b V bπ(br) b V π (br)(br) + b V π (br)(br) V br,bπ

4ϵ0 + 2Hϵ + 2ϵcover,

where the last step comes from (23) and the deﬁnition of bπ.

E.2 PROOF OF LEMMA 8

Here we outline the difference of the proof from Wang et al. (2020)[Lemma 3.2]. First, we also have the following concentration guarantee: Lemma 12. Fix a policy π. Then with probability at least 1 δ, we have for all h [H] and k [K],

V k h+1(si h+1) X

s S P h(s |si h, ai h)V k h+1(s ) Λ 1 h O d HR p

log(d KHR/δ) .

The proof is almost the same as Lemma 13 and thus is omitted here. Then following the same arguments in Wang et al. (2020), we have the following inequality under Lemma 12: φh(s, a) wk h X

s S P h(s |s, a)V k h+1(s ) βex φh(s, a) (Λk h) 1.

Note that V k h+1(s) [0, H h] for all s S, which implies that

s S P h(s |s, a)V k h+1(s ) + rk h(s, a) H h + 1.

Published as a conference paper at ICLR 2024

Note that Clip is a contraction operator, which implies that Clip[0,H h+1]((wk h) φh(s, a) + rk h(s, a)) X

s S P h(s |s, a)V k h+1(s ) + rk h(s, a)

(wk h) φh(s, a) X

s S P h(s |s, a)V k h+1(s ) βex φh(s, a) (Λk h) 1.

On the other hand, Clip[0,H h+1]((wk h) φh(s, a) + rk h(s, a)) X

s S P h(s |s, a)V k h+1(s ) + rk h(s, a) H h + 1.

This implies that Clip[0,H h+1]((wk h) φh(s, a) + rk h(s, a)) X

s S P h(s |s, a)V k h+1(s ) + rk h(s, a) bk h(s, a).

The rest of the proof is the same as Wang et al. (2020) and thus is omitted.

E.3 PROOF OF LEMMA 9

In the following discussion we will use φi h to denote φh(si h, ai h). First we need the following concentration lemma which is similar to Jin et al. (2020b)[Lemma B.3]: Lemma 13. Fix a policy π. Then with probability at least 1 δ, we have for all h [H] and linear reward functions r with rh [ 1, 1],

b V r,π h+1(si h+1) X

s S P h(s |si h, ai h)b V r,π h+1(s ) Λ 1 h O d HR p

log(d KHR/δ) .

The proof is deferred to Appendix E.6. Then by union bound, we know with probability 1 δ/6, we have for all policy π C(Π, ϵ ), h [H] and linear reward functions r with rh [ 1, 1] that

b V r,π h+1(si h+1) X

s S P h(s |si h, ai h)b V r,π h+1(s ) Λ 1 h O d HR p

log(d KHRNΠ(ϵ )/δ) .

Let E6 denote the event thar the above inequality holds. Then under E6, following the same analysis in (Wang et al., 2020)[Lemma 3.1], we have for all policy π C(Π, ϵ ), (s, a) S A, h [H] and linear reward functions r with rh [ 1, 1] that φh(s, a) wr,π h X

s S P h(s |s, a)b V r,π h+1(s ) βpl φh(s, a) Λ 1 h .

Form the contraction property of Clip and the fact that P

s S P h(s |s, a)b V r,π h+1(s ) + rh(s, a) [ (H h + 1), H h + 1], we know Clip[ (H h+1),H h+1]((wr,π h ) φh(s, a) + rh(s, a)) X

s S P h(s |s, a)b V r,π h+1(s ) rh(s, a) bh(s, a)

Therefore, under E6 we have

b Qr,π h (s, a) rh(s, a) + X

s P h(s |s, a)b V r,π h+1(s ) + 2bh(s, a).

Now we only need to prove under E6, for all policy π C(Π, ϵ ), (s, a) S A, h [H] and linear reward function r with rh [ 1, 1], we have Qr,π h (s, a) b Qr,π h (s, a). We use induction to prove this. The claim holds obviously for h = H + 1. Then we suppose for some h [H], we have

Published as a conference paper at ICLR 2024

Qr,π h+1(s, a) b Qr,π h+1(s, a) for all policy π C(Π, ϵ ), (s, a) S A and linear reward function r with rh [ 1, 1]. Then we have:

V r,π h+1(s) = Ea πh+1( |s) Qr,π h+1(s, a) b V r,π h+1(s) = Ea πh+1( |s) b Qr,π h+1(s, a) .

This implies that

Clip[ (H h+1),H h+1]((wr,π h ) φh(s, a) + rh(s, a)) + bh(s, a) X

s S P h(s |s, a)V r,π h+1(s ) + rh(s, a) = Qr,π h (s, a).

On the other hand we have

Qr,π h (s, a) H h + 1.

Therefore we have

Qr,π h (s, a) b Qr,π h (s, a).

By induction we can prove the lemma.

E.4 PROOF OF LEMMA 10

For any two policies π and π, it follows from the deﬁnition of V r,π and V r,π that

V r,π V r,π

=Eπ ,P h r1(s1, a1) + V r,π 2 (s2) i Eπ ,P [V r,π 1 (s1)]

=Eπ ,P h V r,π 2 (s2) (V r,π 1 (s1) r1(s1, a1)) i

=Eπ ,P h V r,π 2 (s2) V r,π 2 (s2) i + Eπ ,P [Qr,π 1 (s1, a1) V r,π 1 (s1)]

=Eπ ,P h V r,π 2 (s2) V r,π 2 (s2) i + Eπ ,P [ Qr,π 1 (s1, ), π 1( |s1) π1( |s1) ]

h=1 Eπ ,P [ Qr,π h (sh, ), π h( |s) πh( |s) ] .

This concludes our proof.

E.5 PROOF OF LEMMA 11

For any h [H], suppose maxs S |b V r,π h +1(s) b V r,π

h +1(s)| ϵh +1, then for any s S, a A, we have

| b Qr,π h (s, a) b Qr,π

h (s, a)| |(wr,π h wr,π

h ) φh (s, a)|

i=1 |φh (s, a) (Λh ) 1φh (si h , ai h )|

v u u th K X

i=1 φh (s, a) 2 (Λh ) 1 i h K X

i=1 φh (si h , ai h ) 2 (Λh ) 1 i ϵh +1

Here the ﬁnal step is comes from the auxiliary Lemma 14 and the fact that Λh R2I and thus PK i=1 φh (s, a) 2 (Λh ) 1 PK i=1 1 K.

Therefore we have

ϵh := max s S |b V r,π h (s) b V r,π

h (s)| Hϵ +

Note that ϵH+1 = 0, therefore we have

d K 1 (d K) H

This concludes our proof.

Published as a conference paper at ICLR 2024

E.6 PROOF OF LEMMA 13

The proof is almost the same as Jin et al. (2020b)[Lemma B.3] except that the function class of V r,π h is different. Therefore we only need to bound the covering number NV(ϵ) of V r,π h where the distance is deﬁned as dist(V, V ) = sups |V (s) V (s)|. Note that V r,π h belongs to the following function class:

V = Vw,A(s) = Ea π( |s)

Clip[ (H h+1),H h+1]

Clip[ (H h+1),H h+1](w φh (s, a))

+Clip[0,2(H h+1)]( φ(s, a) A) , s S ,

where the parameters (w, A) satisfy w 2H p

d K/λpl, A β2 plλ 1 pl .

Note that for any Vw1,A1, Vw2,A2 V, we have

dist(Vw1,A1, Vw2,A2) sup s,a

h Clip[ (H h+1),H h+1](w 1 φh (s, a)) + Clip[0,2(H h+1)]( φ(s, a) A1) i

h Clip[ (H h+1),H h+1](w 2 φh (s, a)) + Clip[0,2(H h+1)]( φ(s, a) A2) i

Clip[ (H h+1),H h+1](w 1 φh (s, a)) Clip[ (H h+1),H h+1](w 2 φh (s, a))

Clip[0,2(H h+1)]( φ(s, a) A1) Clip[0,2(H h+1)]( φ(s, a) A2)

(w1 w2) φ + R sup φ 1

r φ (A1 A2)φ

R( w1 w2 + p

where the ﬁrst and third step utilize the contraction property of Clip. Let Cw be the ϵ/(2R)-cover of {w Rd : w 2rmax p

d K/λpl} w.r.t. ℓ2-norm and CA be the (ϵ/2R)-cover of {A Rd d : A β2 plλ 1 pl } w.r.t. the Frobenius norm, then from the literature Jin et al. (2020b)[Lemma D.5], we have

NV(ϵ) log |Cw| + log |CA| d log 1 + 8 q

d Kr2max R2/(λplϵ2) + d2 log h 1 + 8d1/2β2 pl R2/(λplϵ2) i .

The rest of the proof follows Jin et al. (2020b)[Lemma B.3] directly so we omit it here.

E.7 PROOF OF PROPOSITION 1

First consider ζ and ζ which satisﬁes:

ζh ζ h ϵz, h [H].

Then we know for any h [H], s S, a A,

|ζ h φh(s, a) (ζ h) φh(s, a)| ϵz R. (25)

Now ﬁx any h [H] and s S. To simplify writing, we use x(a) and x (a) to denote ζ h φh(s, a) and (ζ h) φh(s, a) respectively. Without loss of generality, we assume P

a exp(x(a)) P

a exp(x (a)). Then from (25) we have X

a exp(x(a)) X

a exp(x (a)) exp(ϵz R) X

a exp(x(a)).

Note that we have

πζ h( |s) πζ

h ( |s) 1 = X

exp(x(a)) P a exp(x(a )) exp(x (a)) P a exp(x (a ))

Published as a conference paper at ICLR 2024

a exp(x(a)) P

a exp(x (a )) exp(x (a)) P

a exp(x(a )) P

a exp(x(a )) P

a exp(x (a )) .

For any a A, if exp(x(a)) P a exp(x (a )) exp(x (a)) P a exp(x(a )) 0, then exp(x(a)) X

a exp(x (a )) exp(x (a)) X

a exp(x(a ))

exp(ϵz R) exp(x(a)) X

a exp(x(a )) exp( ϵz R) exp(x(a)) X

a exp(x(a ))

=(exp(ϵz R) exp( ϵz R)) exp(x(a)) X

a exp(x(a )).

Otherwise, we have exp(x(a)) X

a exp(x (a )) exp(x (a)) X

a exp(x(a ))

exp(ϵz R) exp(x(a)) X

a exp(x(a )) exp(x(a)) X

a exp(x(a ))

=(exp(ϵz R) 1) exp(x(a)) X

a exp(x(a )).

Therefore we have

πζ h( |s) πζ

h ( |s) 1 (exp(ϵz R) exp( ϵz R)) P

a exp(x(a)) P

a exp(x(a )) P

a exp(x(a )) P

a exp(x (a )) exp(2ϵz R) 1.

This implies that for any ϵ 1,

NΠ(ϵ) NB(d,W ) ln 2

2R ϵ H 12WR

where the ﬁrst step uses exp(x) 1 x/ ln 2 when x ln 2. This concludes our proof.

E.8 PROOF OF PROPOSITION 2

First we consider the following entropy-regularized RL problem where we try to maximize the following objective for some α > 0:

max π Vα(r , π) := Eπ,P h H X

h=1 r h(sh, ah) α log πh(ah|sh) i .

From the literature (Nachum et al., 2017; Cen et al., 2022), we know that we can deﬁne corresponding optimal regularized value function and Q function as follows:

Q α,h(s, a) = r h(s, a) + Esh+1 P h ( |s,a) V α,h+1 ,

V α,h(s) = max πh Eah πh( |s) Q α,h(s, ah) α log πh(ah|s) ,

where V α,H+1(s) = 0 for all s S. Note that we have V α,h(s) H(1 + α log |A|) for all s S and h [H]. The global optimal regularized policy is therefore

π α,h(a|s) = exp(Q α,h(s, a)/α) P

a exp(Q α,h(s, a )/α).

In particular, in linear MDPs, we have

Q α,h(s, a) = φh(s, a) θ h + Z

s S µ h(s)V α,h+1(s)ds .

Published as a conference paper at ICLR 2024

Therefore, Q α,h(s, a) = φh(s, a) w α,h where

w α,h B + H(1 + α log |A|)

This implies that π α belongs to the log-linear policy class Π with W = (B+H(1+α log |A|)

On the other hand, let πg denote the global unregularized optimal policy, then

V (r , πg) max π Π V (r , π) V (r , πg) V (r , π α)

= V (r , πg) V α (r , πg) + V α (r , πg) V α (r , π α) + V α (r , π α) V (r , π α)

V α (r , π α) V (r , π α) αH log |A|.

Therefore we only need to let α = ϵ H log |A| to ensure V (r , πg) maxπ Π V (r , π) ϵ.

F PROOF OF THEOREM 3

First from performance difference lemma (Lemma 10), we have

V r ,bπ V r , =

h=1 Esh dbπ h[Q h(sh, bπ) Q h(sh, π )]

h=1 Esh dbπ h[Q h(sh, bπ) b Ah(sh, bπ)] + Esh dbπ h[ b Ah(sh, bπ) b Ah(sh, π )]

+ Esh dbπ h[ b Ah(sh, π ) Q h(sh, π )]

h=1 Esh dbπ h[Q h(sh, bπ) b Ah(sh, bπ)] + Esh dbπ h[ b Ah(sh, π ) Q h(sh, π )]

h=1 Esh dbπ h[A h(sh, bπ) b Ah(sh, bπ) + b Ah(sh, π ) A h(sh, π )]

h=1 Esh dbπ h[ φh(sh, bπ), ξ h bξh φh(sh, π ), ξ h bξh ]

h=1 Esh dbπ h[ φh(sh, bπ) φh(sh, π ), ξ h bξh ]

h=1 Esh dbπ h[φh(sh, bπ) φh(sh, π )] Σ 1 h,N+1 ξ h bξh Σh,N+1. (26)

Next we will bound Esh dbπ h[φh(sh, bπ) φh(sh, π )] Σ 1 h,N+1 and ξ bξ Σh,N+1 respectively. First

for Esh dbπ h[φh(sh, bπ) φh(sh, π )] Σ 1 h,N+1, notice that Σh,N+1 Σh,n for all n [N +1], which implies

Esh dbπ h[φh(sh, bπ) φh(sh, π )] Σ 1 h,N+1 1

n=1 Esh dbπ h[φh(sh, bπ) φh(sh, π )] Σ 1 h,n

n=1 Esh πh,n,0[φh(sh, πh,n,0) φh(sh, πh,n,1)] Σ 1 h,n

n=1 Esh πh,n,0[φh(sh, πh,n,0) φh(sh, πh,n,1)] 2 Σ 1 h,n

2d log(1 + N/d)

Published as a conference paper at ICLR 2024

where the third step comes from the deﬁnition of πh,n,0 and πh,n,1 and the last step comes from Elliptical Potential Lemma (Lemma 1) and the fact that λ 4R2.

For ξ h bξh Σh,N+1, let eΣh,n denote λI + Pn 1 i=1 (φh(sh,n, ah,n,0) φh(sh,n, ah,n,1))(φh(sh,n, ah,n,0) φh(sh,n, ah,n,1)) . Then similar to Lemma 3, we have with probability at least 1 δ/2,

ξ h bξh 2 Σh,N+1 2 ξ h bξh 2 eΣh,N+1 + 2CCONd R2B2 log(N/δ). (28)

On the other hand, similar to Lemma 2, MLE guarantees us that with probability at least 1 δ/2,

bξh ξ h eΣh,N+1 2CMLE q

κ2 adv(d + log(1/δ)) + λB2, (29)

where κadv = 2 + exp(2Badv) + exp( 2Badv).

Therefore combining (28) and (29), we have with probability at least 1 δ,

ξ bξh Σh,N+1 O κadv BR p

λd log(N/δ) . (30)

Thus combining (26), (27) and (30) via union bound, we have V (r ) V (r , bπ) ϵ with probability at least 1 δ as long as

N e O λκ2 adv B2R2H2d2 log(1/δ)

G AUXILIARY LEMMAS

Lemma 14 (Jin et al. (2020b)[Lemma D.1]). Let Λ = λI + PK i=1 φiφ i where φi Rd and λ > 0, then we have PK i=1 φ i Λ 1φi d.