# demonstrationregularized_rl__6fa5ec54.pdf

Published as a conference paper at ICLR 2024

DEMONSTRATION-REGULARIZED RL

Daniil Tiapkin1,2 Denis Belomestny3,2 Daniele Calandriello4 Eric Moulines1,5

Remi Munos4 Alexey Naumov2 Pierre Perrault6 Michal Valko4 Pierre M enard7

1CMAP, Ecole Polytechnique 2HSE University 3Duisburg-Essen University 4Google Deep Mind 5Mohamed Bin Zayed University of AI, UAE 6IDEMIA 7ENS Lyon {daniil.tiapkin,eric.moulines}@polytechnique.edu denis.belomestny@uni-due.de {dcalandriello,munos,valkom}@google.com anaumov@hse.ru pierre.perrault@outlook.com pierre.menard@ens-lyon.fr

Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL s sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using N E expert demonstrations enables the identification of an optimal policy at a sample complexity of order e O(Poly(S, A, H)/(ε2N E)) in finite and e O(Poly(d, H)/(ε2N E)) in linear Markov decision processes, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behavior cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.

1 INTRODUCTION

In reinforcement learning (RL, Sutton & Barto 1998), agents interact with an environment to maximize the cumulative reward they collect. While RL has shown remarkable success in mastering complex games (Mnih et al., 2013; Silver et al., 2018; Berner et al., 2019), controlling physical systems (Degrave et al., 2022), and enhancing computer science algorithms (Mankowitz et al., 2023), it does face several challenges. In particular, RL algorithms suffer from a large sample complexity, which is a hindrance in scenarios where simulations are impractical and struggle in environments with sparse rewards (Goecks et al., 2020). A remedy found to handle these limitations is to incorporate information from a pre-collected offline dataset in the learning process. Specifically, leveraging demonstrations from experts trajectories without rewards has proven highly effective in reducing sample complexity, especially in fields like robotics (Zhu et al., 2018; Nair et al., 2020) and guiding exploration (Nair et al., 2018; Aytar et al., 2018; Goecks et al., 2020). However, from a theoretical perspective, little is known about the impact of this approach. Previous research has often focused on either offline RL (Rashidinejad et al., 2021; Xie et al., 2021; Yin et al., 2021; Shi et al., 2022) or online RL (Jaksch et al., 2010; Azar et al., 2017; Fruit et al., 2018; Dann et al., 2017; Zanette & Brunskill, 2019b; Jin et al., 2018). In this study, we aim to quantify how prior demonstrations from experts influence the sample complexity of various RL tasks, specifically two scenarios: best policy identification (BPI, Domingues et al., 2021a; Al Marjani et al., 2021) and reinforcement learning from human feedback (RLHF), within the context of finite or linear Markov decision processes (Jin et al., 2020). Imitation learning The case where the agent only observes expert demonstrations without further interaction with the environment corresponds to the well-known imitation learning problem. There

Published as a conference paper at ICLR 2024

are two primary approaches in this setting: inverse reinforcement learning (Ng & Russell, 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016) where the agent first infers a reward from demonstrations then finds an optimal policy for this reward; and behavior cloning (Pomerleau, 1988; Ross & Bagnell, 2010; Ross et al., 2011; Rajeswaran et al., 2018), a simpler method that employs supervised learning to imitate the expert. However collecting demonstration could be expansive and, furthermore, imitation learning suffers from the compounding errors effect, where the agent can diverge from the expert s policy in unvisited states (Ross & Bagnell, 2010; Rajaraman et al., 2020). Hence, imitation learning is often combined with an online learning phase where the agent directly interacts with the environment. BPI with demonstrations In BPI with demonstrations, the agent observes expert demonstrations like in imitation learning but also has the opportunity to collect new trajectories, including reward information, by directly interacting with the environment. There are three main method categories1 for BPI with demonstration: one employs an off-policy algorithm augmented with a supervised learning loss and a replay buffer pre-filled the demonstrations (Hosu & Rebedea, 2016; Lakshminarayanan et al., 2016; Vecer ık et al., 2017; Hester et al., 2018); while a second uses reinforcement learning with a modified reward supplemented by auxiliary rewards obtained by inverse reinforcement learning (Zhu et al., 2018; Kang et al., 2018). The third class, demonstration-regularized RL, which is the one we study in this paper, leverages behavior cloning to learn a policy that imitates the expert and then applies reinforcement learning with regularization toward this behavior cloning policy (Rajeswaran et al., 2018; Nair et al., 2018; Goecks et al., 2020; Pertsch et al., 2021). Demonstration-regularized RL We introduce a particular demonstration-regularized RL method that consists of several steps. We start by learning with maximum likelihood estimation from the demonstrations of a behavior policy. This transfers the prior information from the demonstrations to a more practical representation: the behavior cloning policy. During the online phase, we aim to solve a trajectory Kullback-Leibler divergence regularized MDP (Neu et al., 2017; Vieillard et al., 2020; Tiapkin et al., 2023), penalizing the policy for deviating too far from the behavior cloning policy. We use the solution of this regularized MDP as an estimate for the optimal policy in the unregularized MDP, effectively reducing BPI with demonstrations to regularized BPI. Consequently, we propose two new algorithms for BPI in regularized MDPs: The UCBVI-Ent+ algorithm, a variant of the UCBVI-Ent algorithm by Tiapkin et al. (2023) with improved sample complexity, and the LSVI-UCB-Ent algorithm, its adaptation to the linear setting. When incorporated into the demonstration-regularized RL method, these algorithms yield sample complexity rates for BPI with N E demonstrations of order2 e O(Poly(S, A, H)/(ε2N E)) in finite and e O(Poly(d, H)/(ε2N E)) in linear MDPs, where ε is the target precision, H the horizon, A the number of action, S the number of states in the finite case and d the dimension of the feature space in the linear case. Notably, these rates show that leveraging demonstrations can significantly improve upon the rates of BPI without demonstrations, which are of order e O(Poly(S, A, H)/ε2) in finite MDPs (Kaufmann et al., 2021; M enard et al., 2021) and e O(Poly(d, H)/ε2) in linear MDPs (Taupin et al., 2023). This work, up to our knowledge, represents the first instance of sample complexity rates for BPI with demonstrations, establishing the provable efficiency of demonstration-regularized RL. Preference-based BPI with demonstration In RL with demonstrations, the assumption typically entails the observation of rewards in the online learning phase. However, in reinforcement learning from human feedback, such that recommendation system (Chaves et al., 2022), robotics (Jain et al., 2013; Christiano et al., 2017), clinical trials (Zhao et al., 2011) or large language models fine-tuning (Ziegler et al., 2019; Stiennon et al., 2020; Ouyang et al., 2022), the reward is implicitly defined by human values. Our focus centers on preference-based RL (Pb RL, Busa-Fekete et al. 2014; Wirth et al. 2017; Novoseller et al. 2020; Saha et al. 2023) where the observed preferences between two trajectories are essentially noisy reflections of the value of a link function evaluated at the difference between cumulative rewards for these trajectories. Existing literature on Pb RL focuses either on the offline setting where the agent observes a precollected dataset of trajectories and preferences (Zhu et al., 2023; Zhan et al., 2023a) or on the

1The boundary between the above families of methods is not strict, since for example, one can see the regularization in the third family as a particular choice of auxiliary reward learned by inverse reinforcement learning that appears in the second class of methods. 2In the e O( ) notation we ignore terms poly-log in H, S, A, d, 1/δ, 1/ε and the notation Poly indicates polynomial dependencies.

Published as a conference paper at ICLR 2024

online setting where the agent sequentially samples a pair of trajectories and observes the associated preference (Saha et al., 2023; Xu et al., 2020; Wang et al., 2023). In this work, we explore a hybrid setting that aligns more closely with what is done in practice (Ouyang et al., 2022). In this framework, which we call preference-based BPI with demonstration, the agent selects a sampling policy based on expert-provided demonstrations used to generate trajectories and associated preferences. The offline collection of preference holds particular appeal in RLHF due to the substantial cost and latency associated with obtaining preference feedback. Finally, in our setting, the agent engages with the environment by sequentially collecting reward-free trajectories and returns an estimate for the optimal policy. Demonstration-regularized RLHF To address this novel setting, we follow a similar approach that was used in RL with demonstrations. We employ the dataset of preferences sampled using the behavior cloning policy to estimate rewards. Then, we solve the MDP regularized towards the behavior cloning policy, equipped with the estimated reward. Using the same regularized BPI solvers, we establish a sample complexity for the demonstration-regularized RLHF method of order e O((Poly(S, A, H)/(ε2N E)) in finite MDPs and e O((Poly(d, H)/(ε2N E)) in linear MDPs. Intriguingly, these rates mirror those of RL with demonstrations, illustrating that RLHF with demonstrations does not pose a greater challenge than RL with demonstrations. Notably, these findings expand upon the similar observation made by Wang et al. (2023) in the absence of prior information. We highlight our main contributions: We establish that demonstration-regularized RL is an efficient solution method for RL with N E demonstrations, exhibiting a sample complexity of order e O(Poly(S, A, H)/(ε2N E)) in finite MDPs and e O(Poly(d, H)/(ε2N E)) in linear MDPs. We provide evidence that demonstration-regularized methods can effectively address reinforcement learning from human feedback (RLHF) by collecting preferences offline and eliminating the necessity for pessimism (Zhan et al., 2023a). Interestingly, they achieve sample complexities similar to those in RL with demonstrations. We prove performance guarantees for the behavior cloning procedure in terms of Kullback-Leibler divergence from the expert policy. They are of order e O(Poly(S, A, H)/N E) for finite MDPs and e O(Poly(d, H)/N E) for linear MDPs. We provide novel algorithms for regularized BPI in finite and linear MDPs with sample complexities e O(H5S2A/(λε)) and e O(H5d2/(λε)), correspondingly, where λ is a regularization parameter.

MDPs We consider an episodic MDP M = S, s1, A, H, {ph}h [H], {rh}h [H] , where S is the set of states with s1 the fixed initial state, A is the finite set of actions of size A, H is the number of steps in one episode, ph(s |s, a) is the probability transition from state s to state s by performing action a in step h. And rh(s, a) [0, 1] is the reward obtained by taking action a in state s at step h. We will consider two particular classes of MDPs. Definition 1. (Finite MDP) An MDP M is finite if the state space S is finite with size denoted by S. Definition 2. (Linear MDP) An MDP M = S, s1, A, H, {ph}h [H], {rh}h [H] is linear if the state space S is a measurable for a certain σ-algebra FS, and there exists known feature map ψ: S A Rd, and unknown parameters θh Rd and an unknown family of signed measure µh,i, h [H], i [d] with its vector form µh : FS Rd such that for all (h, s, a) [H] S A and for any measurable set B FS, it holds rh(s, a) = ψ(s, a)Tθh, and ph(B|s, a) = Pd i=1 ψ(s, a)iµh,i(B) = ψ(s, a)Tµh(B). Without loss of generality, we assume ψ(s, a) 2 1 for all (s, a) S A and max{ µh(S) 2, θh 2}

d for all h [H]. Policy & value functions A policy π is a collection of functions πh : S A for all h [H], where every πh maps each state to a probability over the action set. We denote by Π the set of policies. The value functions of policy π at step h and state s is denoted by V π h , and the optimal value functions, denoted by V h , are given by the Bellman respectively optimal Bellman equations Q h(s, a) = rh(s, a) + ph V h+1(s, a) V h (s) = max a Q h(s, a) where by definition, V H+1 0. Furthermore, phf(s, a) Es ph( |s,a)[f(s )] denotes the expectation operator with respect to the transition probabilities ph and πhg(s) Ea πh( |s)[g(s, a)] denotes the composition with the policy π at step h.

Trajectory Kullback-Leibler divergence We define the trajectory Kullback-Leibler divergence between policy π and policy π as the average of the Kullback-Leibler divergence between policies

Published as a conference paper at ICLR 2024

at each step along a trajectory sampled with π,

KLtraj(π π ) = Eπ

h=1 KL(πh(sh) π h(sh))

3 BEHAVIOR CLONING In this section, we analyze the complexity of behavior cloning for imitation learning in finite and linear MDPs. Imitation learning In imitation learning we are provided a dataset DE { τi = (si 1, ai 1, . . . , si H, ai H), i [N E]} of N E independent reward-free trajectories sampled from a fixed unknown expert policy πE. The objective is to learn from these demonstrations a policy close to optimal. In order to get useful demonstrations we assume that the expert policy is close to optimal, that is, V 1 (s1) V πE 1 (s1) εE for some small εE > 0. Behavior cloning The simplest method for imitation learning is to directly learn to replicate the expert policy in a supervised fashion. Precisely the behavior cloning policy πBC is obtained by minimizing the negative-loglikelihood over a class of policies F = {π Π : πh Fh} with Fh being a class of conditional distributions S P(A) and where Rh is some regularizer,

πBC arg min π F

i=1 log 1 πh(ai h|si h) + Rh(πh)

In order to provide convergence guarantees for behavior cloning, we make the following assumptions. First, we assume some regularity conditions on the class of policies defined in terms of covering numbers of the class, see Appendix A for a definition. Assumption 1. For all h [H], there are two positive constants d F, RF > 0 such that h [H], ε (0, 1) : log N(ε, Fh, ) d F log(RF/ε). Moreover, there is a constant γ > 0 such that for any h [H], πh Fh it holds πh(a|s) γ for any (s, a) S A. The Assumption 1 is a typical parametric assumption in density estimation, see e.g. Zhang (2002), with d F being a covering dimension of the underlying parameter space. The part of the assumption on a minimal probability is needed to control KL-divergences (Zhang, 2006). Next, we assume that a smooth version of the expert policy belongs to the class of hypotheses. Assumption 2. There is a constant κ (0, 1/2) such that a κ-greedy version of the expert policy defined by πE,κ h (a|s) = (1 κ)πE h (a|s) + κ/A belongs to the hypothesis class of policies: πE,κ F . Note that a deterministic expert policy verifies Assumption 2 provided that γ is small enough and the policy class is rich enough. For κ = 0, this assumption is never satisfied for any γ > 0. In the sequel, we provide examples of the policy class F and regularizers (Rh)h [H] for finite or linear MDPs such that the above assumptions are satisfied. We are now ready to state general performance guarantees for behavior cloning with KL regularization. Theorem 1. Let Assumptions 1-2 be satisfied and let 0 Rh(πh) M for all h [H] and any policy π Fh. Then with probability at least 1 δ, the behavior policy πBC satisfies

KLtraj(πE πBC) 6d FH (log(Ae3/(Aγ κ)) log(2HN ERF/(γδ))

This result shows that if the number of demonstrations N E is large enough and γ = 1/N E, κ = A/N E then the behavior cloning policy πBC converges to the expert policy πE at a fast rate of order e O((d FH + A)/N E) where we measure the distance between two policies by the trajectory Kullback-Leibler divergence. The proof of this theorem is postponed to Appendix B and it heavily relies on verifying the so-called Bernstein condition (Bartlett & Mendelson, 2006).

3.1 FINITE MDPS For finite MDPs, we chose a logarithmic regularizer Rh(πh) = P

s,a log(1/πh(a|s)) and the class of policies F = {π Π : πh(a|s) 1/(N E + A)}. One can check that Assumptions 1-2 hold and 0 Rh(πh) SA log(N E + A). We can apply Theorem 1 to obtain the following bound for finite MDPs (see Appendix B.2 for additional details). Corollary 1. For all N E A, for function class F and regularizer (Rh)h [H] defined above, it holds with probability at least 1 δ,

KLtraj(πE πBC) 6SAH log(2e4N E) log(12H(N E)2/δ)

Published as a conference paper at ICLR 2024

Note that, imitation learning with a logarithmic regularizer is closely related to the statistical problem of conditional density estimation with Kullback-Leibler divergence loss, see for example Section 4.3 by van der Hoeven et al. (2023) and references therein. Additionally, we would like to emphasize that the presented upper bound is optimal up to poly-logarithmic terms, see Appendix B.5 for a corresponding lower bound. Remark 1. In fact, the constraint added by the class of policies F is redundant with the effect of regularization and one can directly optimize over the whole set of policies in (1). It is then easy to obtain a closed formula for the behavior cloning policy πBC h (a|s) = (N E h (s, a) + 1)/(N E h (s) + A), where we define the counts by N E h (s) = P

a A N E h (s, a) and N E h (s, a) = PNE

i=1 1{(si h, ai h) = (s, a)}. Remark 2. Contrary to Ross & Bagnell (2010) and Rajaraman et al. (2020), our bound does not feature the optimality gap of the behavior policy but measures how close it is to the expert policy which is crucial to obtain the results of the next sections. Nevertheless, we can recover from our bound some of the results of the aforementioned references, see Appendix B.6 for details.

3.2 LINEAR MDPS For the linear setting, we need the expert policy to belong to some well-behaved class of parametric policies. A first possibility would be to consider a greedy policy with respect to Q-value, linear in the feature space πh(s) arg maxπ A(πψ)(s)Twh for some parameters wh as it is done in the existing imitation learning literature (Rajaraman et al., 2021). However, under such a parametrization it would be almost impossible to learn an expert policy with a high quality since a small perturbation in the parameters wh could lead to a completely different policy. We emphasize that if we assume that the expert policy is an optimal one, then Rajaraman et al. (2021) proposes a way to achieve a ε-optimal policy but not how to reconstruct the expert policy itself. That is why we consider another natural parametrization where the log probability of the expert policy is linear in the feature space. Assumption 3. For all h [H], there exists an unknown parameter w E h Rd with w E h 2 R for some known R 0 such that πE h (a|s) = exp(ψ(s, a)Tw E h)/(P

a A exp(ψ(s, a )Tw E h)). For instance, this assumption is satisfied for optimal policy in entropy-regularized linear MDPs, see Lemma 1 in Appendix B.3. Under Assumption 3, a suitable choice of policy class is given by F = {π Π : πh Fh} where

Fh = πh(a|s) = κ

A + (1 κ) exp(ψ(s, a)Twh) P

a A exp(ψ(s, a )Twh) : wh Rd, wh 2 R (2)

and κ = A/(N E + A). Furthermore, for the linear setting, we do not need regularization, that is, Rh(π) = 0. Equipped with this class of policies we can prove a similar result as in the finite setting with the number of states replaced by the dimension d of the feature space. Corollary 2. Under Assumption 3, function class F defined above and regularizer Rh = 0 for all h [H], it holds for all N E A with probability at least 1 δ,

KLtraj(πE πBC) 8d H (log(2e3AN E) (log(48(N E)2R) + log(H/δ)))

N E . Taking into account the fact that finite MDPs are a specific case within the broader category of linear MDPs, the lower bound presented in Appendix B.5 also shows the optimality of this result.

4 DEMONSTRATION-REGULARIZED RL

In this section, we study reinforcement learning when demonstrations from an expert are also available. First, we describe the regularized best policy identification framework that will be useful later.

Regularized best policy identification (BPI) Given some reference policy eπ and some regularization parameter λ > 0, we consider the trajectory Kullback-Leibler divergence regularized value function V π eπ,λ,1(s1) V π 1 (s1) λKLtraj(π, eπ). In this value function, the policy π is penalized for moving too far from the reference policy eπ. Interestingly, we can compute the value of policy π with the regularized Bellman equations, (Neu et al., 2017; Vieillard et al., 2020) Qπ eπ,λ,h(s, a) = rh(s, a) + ph V π eπ,λ,h+1(s, a) , V π eπ,λ,h(s) = πh Qπ eπ,λ,h(s) λ KL(πh(s) eπh(s)) , where V π eπ,λ,H+1 = 0. We are interested in the best policy identification for this regularized value. Precisely, in regularized BPI, the agent interacts with MDP as follows: at the beginning of episode t, the agent picks up a policy πt based only on the transitions collected up to episode t 1. Then a new trajectory (with rewards) is sampled following the policy πt and is observed by the agent. At the end of each episode, the agent can decide to stop collecting new data, according to a random stopping time ι (ι = t if the agent stops after the t-th episode), and output a policy bπ based on the observed transitions. An agent for regularized BPI is therefore made of a triplet ((πt)t N, ι, bπ).

Published as a conference paper at ICLR 2024

Definition 3. (PAC algorithm for regularized BPI) An algorithm ((πt)t N, ι, bπ) is (ε, δ)-PAC for BPI regularized with policy eπ and parameter λ with sample complexity C(ε, λ, δ) if

P V eπ,λ,1(s1) V bπ eπ,λ,1(s1) ε , ι C(ε, λ, δ) 1 δ .

We can now describe the setting studied in this section.

BPI with demonstration We assume, as in Section 3, that first the agent observes N E independent trajectories DE sampled from an expert policy πE. Then the setting is the same as in BPI. Precisely, the agent interacts with the MDP as follows: at episode t, the agent selects a policy πt based on the collected transitions and the demonstrations. Then a new trajectory (with rewards) is sampled following the policy πt and observed by the agent. At the end of each episode, the agent stops according to a stopping rule ι (ι = t if the agent stops after the t-th episode), and outputs a policy πRL. Definition 4. (PAC algorithm for BPI with demonstration) An algorithm ((πt)t N, ι, πRL) is (ε, δ)- PAC for BPI with demonstration with sample complexity C(ε, N E, δ) if

P V 1 (s1) V πRL 1 (s1) ε, ι C(ε, N E, δ) 1 δ.

To tackle BPI with demonstration we focus on the following natural and simple approach.

Demonstration-regularized RL The main idea behind this method is to reduce BPI with demonstration to regularized BPI. Indeed, in demonstration-regularized RL, the agent starts by learning through behavior cloning from the demonstration of a policy πBC that imitates the expert policy, refer to Section 3 for details. Then the agent computes a policy πRL by performing regularized BPI with policy πBC and some well-chosen parameter λ. The policy πRL is then returned as the guess for an optimal policy. The whole procedure is described in Algorithm 1. Intuitively the prior information contained in the demonstration is compressed into a handful representation namely the policy πBC. Then this information is injected into the BPI procedure by encouraging the agent to output a policy close to the behavior policy.

Algorithm 1 Demonstration-regularized RL

1: Input: Precision parameter εRL, probability parameter δRL, demonstrations DE, regularization parameter λ. 2: Compute behavior cloning policy πBC = Behavior Cloning(DE). 3: Perform regularized BPI πRL = Reg BPI(πBC, λ, εRL, δRL) 4: Output: policy πRL.

Using the previous results for regularized BPI, we next derive guarantees for demonstrationregularized RL. We start from a general black-box result that shows how the final policy error depends on the behavior cloning error, parameter λ, and the quality of regularized BPI. Theorem 2. Assume that there are an expert policy πE such that V 1 (s1) V πE 1 (s1) εE and a behavior cloning policy πBC satisfying p

KLtraj(πE πBC) εKL. Let πRL be εRL-optimal policy in λ-regularized MDP with respect to πBC, that is, V πBC,λ,1(s1) V πRL

πBC,λ,1 εRL. Then πRL fulfills

V 1 (s1) V πRL 1 (s1) εE + εRL + λε2 KL.

In particular, under the choice λ = εRL/ε2 KL, the policy πRL is (2εRL + εE)-optimal in the original (non-regularized) MDP.

Remark 3. We define an error in trajectory KL-divergence under the square root because the KLdivergence behaves quadratically in terms of the total variation distance by Pinsker s inequality. Remark 4 (BPI with prior policy). We would like to underline that we exploit all the prior information only through one fixed behavior cloning policy. However, as it is observable from the bounds of Theorem 2, our guarantees are not restricted to such type of policies and potentially could work with any prior policy close enough to a near-optimal one in trajectory Kullback-Leibler divergence.

The proof of the theorem above is postponed to Appendix C. To apply this result and derive sample complexity for demonstration-regularized BPI, we present the UCBVI-Ent+ algorithm, a modification of the algorithm UCBVI-Ent proposed by Tiapkin et al. (2023), that achieves better rates for regularized BPI in the finite setting. In Appendix E, we also present the LSVI-UCB-Ent algorithm, a direct adaptation of the UCBVI-Ent+ to the linear setting.

Published as a conference paper at ICLR 2024

Notably, the use of UCBVI-Ent by Tiapkin et al. (2023) within the framework of demonstrationregularized methods, fails to yield acceleration through expert data incorporation due to its e O(1/ε2) sample complexity. In contrast, the enhanced variant, UCBVI-Ent+, exhibits a more favorable complexity of e O(1/(ελ)), where λ = ε/ε2 KL ε under the conditions stipulated in Theorem 2, for a εKL sufficiently small. It is noteworthy that an alternative approach employing the RL-Explore-Ent algorithm, introduced by Tiapkin et al. (2023), can also achieve such acceleration. However, RL-Explore-Ent is associated with inferior rates in terms of S and H and is challenging to extend beyond finite settings. The UCBVI-Ent+ algorithm works by sampling trajectories according to an exploratory version of an optimistic solution for the regularized MDP which is characterized by the following rules.

UCBVI-Ent+ sampling rule To obtain the sampling rule at episode t, we first compute a policy πt by optimistic planning in the regularized MDP,

Q t h(s, a) = clip rh(s, a) + bpt h V t h+1(s, a) + bp,t h (s, a), 0, H ,

πt+1 h (s) = arg max π A

n πQ t h(s) λ KL(π eπh(s)) o , V t h(s) = πt+1 h Q t h(s) λ KL( πt+1 h (s) eπh(s))

with V t H+1 = 0 by convention, where eπ is a reference policy, bpt is an estimate of the transition probabilities, and bt some bonus term taking into account estimation error for transition probabilities. Then we define a family of policies that aim to explore actions for which Q-value is not well estimated at a particular step. That is, for h [0, H], the policy πt,(h ) first follows the optimistic policy πt until step h where it selects an action leading to the largest width of a confidence interval for the optimal Q-value,

πt,(h ) h (a|s) =

( πt h(a|s) if h = h ,

1 n a = arg maxa A(Q t h(s, a ) Qt h(s, a )) o if h = h , where Qt is a lower bound on the optimal regularized Q-value function, see Appendix D.4. In particular, for h = 0 we have πt,(0) = πt. The sampling rule is obtained by picking uniformly at random one policy among the family πt = πt,(h ), h [0, H] in each episode. Note that it is equivalent to sampling from a uniform mixture policy πmix,t over all h [0, H], see Appendix D for more details. This algorithmic choice allows us to exploit strong convexity of the KL-divergence and control the properties of a stopping rule, defined in Appendix D.4, that depends on the gap (Q t h(s, a) Qt h(s, a))2. The complete procedure is described in Algorithm 3 in Appendix D. We prove that for the wellcalibrated bonus functions bp,t and a stopping rule defined in Appendix D.4, the UCBVI-Ent+ algorithm is (ε, δ)-PAC for regularized BPI and provide a high-probability upper bound on its sample complexity. Additionally, a similar result holds for LSVI-UCB-Ent algorithm. The next result is proved in Appendix D.5 and Appendix E.5. Theorem 3. For all ε > 0, δ (0, 1), the UCBVI-Ent+ / LSVI-UCB-Ent algorithms defined in Appendix D.4 /Appendix E.4 are (ε, δ)-PAC for the regularized BPI with sample complexity

C(ε, δ) = e O H5S2A

(finite) C(ε, δ) = e O H5d2

Additionally, assume that the expert policy is εE = ε/2-optimal and satisfies Assumption 3 in the linear case. Let πBC be the behavior cloning policy obtained using corresponding function sets described in Section 3. Then demonstration-regularized RL based on UCBVI-Ent+ / LSVI-UCB-Ent with parameters εRL = ε/4, δRL = δ/2 and λ = e O N Eε/(SAH) / e O N Eε/(d H) is (ε, δ)-PAC for BPI with demonstration in finite / linear MDPs and has sample complexity of order

C(ε, N E, δ) = e O H6S3A2

(finite) C(ε, N E, δ) = e O H6d3

In the finite setting, UCBVI-Ent+ improves the previous fast-rate sample complexity result of order e O(H8S4A/(λε)) by Tiapkin et al. (2023). For the linear setting, we would like to acknowledge that LSVI-UCB-Ent is the first algorithm that achieves fast rates for exploration in regularized linear MDPs.

5 DEMONSTRATION-REGULARIZED RLHF

In this section, we consider the problem of reinforcement learning with human feedback. We assume that the MDP is finite, i.e., |S| < + to simplify the manipulations with the trajectory space.

Published as a conference paper at ICLR 2024

However, the state space could be arbitrarily large. In this setting, we do not observe the true reward function r but have access to an oracle that provides a preference feedback between two trajectories. We assume that the preference is a random variable with parameters that depend on the cumulative rewards of the trajectories as detailed in Assumption 4. Given a reward function r = {rh}H h=1, we define the reward of a trajectory τ (S A)H as the sum of rewards collected over this trajectory r(τ) PH h=1 rh(sh, ah). Assumption 4 (Preference-based model). Let τ0, τ1 be two trajectories. The preference for τ1 over τ0 is a Bernoulli random variable o with a parameter q (τ0, τ1) = σ(r (τ1) r (τ0)), where σ: R [0, 1] is a monotone increasing link function that satisfies infx [ H,H] σ (x) = 1/ζ for ζ > 0. The main example of the link function is a sigmoid function σ(x) = 1/(1 + exp( x)) that leads to the Bradley-Terry-Luce (BTL) model (Bradley & Terry, 1952) widely used in the literature (Wirth et al., 2017; Saha et al., 2023). We now introduce the learning framework. Preference-based BPI with demonstration We assume, as in Section 3, that the agent observes N E independent trajectories DE sampled from an expert policy πE. Then the learning is divided in two phases: 1) Preference collection. Based on the observed expert trajectories DE, the agent selects a sampling policy πS to generate a data set of preferences DRM = {(τ k 0 , τ k 1 , ok)}NRM k=1 consisting of pairs of trajectories and the sampled preferences. Specifically, both trajectories of the pair (τ k 0 , τ k 1 ) are sampled with the policy πS and the associated preference ok is obtained according to the preference-based model described in Assumption 4. 2) Reward-free interaction. Next, the agent interacts with the reward-free MDP as follows: at episode t, the agent selects a policy πt based on the collected transitions up to time t, demonstrations and preferences. Then a new trajectory (reward-free) is sampled following the policy πt and is observed by the agent. At the end of each episode, the agent can decide to stop according to a stopping rule ι and outputs a policy πRLHF. Definition 5 (PAC algorithm for preference-based BPI with demonstration). An algorithm ((πt)t N, πS, ι, πRLHF) is (ε, δ)-PAC for preference-based BPI with demonstrations and sample complexity C(ε, N E, δ) if P V 1 (s1) V πRLHF 1 (s1) ε, ι C(ε, N E, δ) 1 δ, where the unknown true reward function r is used in the value-function V . For the above setting, we provide a natural approach that combines demonstration-regularized RL with the maximum likelihood estimation of the reward given preferences dataset. Demonstration-regularized RLHF During the preference collection phase, the agent generates a dataset comprising trajectories and observed preferences, denoted as DRM = {(τ k 0 , τ k 1 , ok)}NRM k=1 by executing the previously computed policy πBC. Using this dataset, the agent can infer the reward via maximum likelihood estimation (MLE).

Algorithm 2 Demonstration-regularized RLHF

1: Input: Precision parameter εRLHF, probability parameter δRLHF, demonstrations DE, preferences budget N RM, regularization parameter λ. 2: Compute behavior cloning policy πBC = Behavior Cloning(DE); 3: Select sampling policy πS = πBC and collect preference dataset DRM; 4: Compute reward estimate ˆr = Reward MLE(Gr, DRM); 5: Perform regularized BPI using ˆr as reward: πRLHF = Reg BPI(πBC, λ, εRLHF, δRLHF; ˆr) 6: Output: policy πRLHF.

The core idea behind this approach is to simplify the problem by transforming it into a regularized BPI problem. The agent starts with behavior cloning applied to the expert dataset, resulting in the policy πBC. During the preference collection phase, the agent generates a dataset comprising trajectories and observed preferences, denoted as DRM = {(τ k 0 , τ k 1 , ok)}NRM k=1 by executing the previously computed policy πBC. Using this dataset, the agent can infer the reward via MLE:

ˆr arg max r G

k=1 ok log σ r(τ k 1 ) r(τ k 0 ) + (1 ok) log 1 σ r(τ k 1 ) r(τ k 0 ) ,

where G is a function class for trajectory reward functions3. Finally, the agent computes πRL by performing regularized BPI with policy πBC, a properly chosen regularization parameter λ and the estimated reward ˆr. The complete procedure is outlined in Algorithm 2.

3For the theoretical guarantees on MLE estimate of rewards ˆr we refer to Appendix F.1.

Published as a conference paper at ICLR 2024

For this algorithm, we use the behavior cloning policy πBC for two purposes. First, it allows efficient offline collection of the preference dataset DRM, from which a high-quality estimate of the reward can be derived. Second, a regularization towards the behavior cloning policy πBC enables the injection of information obtained from the demonstrations, while also avoiding the direct introduction of pessimism in the estimated reward as in the previous works that handle offline datasets (Zhu et al., 2023; Zhan et al., 2023a). Remark 5. Zhan et al. (2023b) propose a similar two-stage setting of preference collection and reward-free interaction without prior demonstrations and propose an algorithm for this setup. However, as compared to their result, our pipeline is adapted to any parametric function approximation of rewards and does not require solving any (non-convex) optimization problem during the preference collection phase. Remark 6. Our approach to solve BPI with demonstration within the preference-based model framework draws inspiration from well-established methods for large language model RL finetuning (Stiennon et al., 2020; Ouyang et al., 2022; Lee et al., 2023). Specifically, our algorithm s policy learning phase is similar to solving an RL problem with policy-dependent rewards

r RLHF h (s, a) = ˆrh(s, a) λ log πRLHF h (a|s)/πBC h (a|s) . This formulation, coupled with our prior stages of behavior cloning, akin to supervised fine-tuning (SFT), and reward estimation through MLE based on trajectories generated by the SFT policy, mirrors a simplified version of the three-phase RLHF pipeline.

The following sample complexity bounds for tabular and linear MDPs is a simple corollary of Theorem 8 and Theorem 3 and its proof is postponed to Appendix F.3. Corollary 3 (Demonstration-regularized RLHF). Let Assumption 4 hold. For ε > 0 and δ (0, 1), assume that an expert policy εE is ε/8-optimal and satisfies Assumption 3 in the linear case. Let πBC be the behavioral cloning policy obtained using function sets described in Section 3 and let the set G be defined in Lemma 19 for finite and in Lemma 20 for linear setting, respectively.

If the following two conditions hold

(1) N E N RM eΩ ζ2H2 e D2/ε2 ; (2) N E eΩ H2 e D/ε or N RM eΩ Crζ2H e D/ε2

for e D = SA / d in finite /linear MDPs and Cr = Cr(G, πE, πBC) is a single-policy concentrability coefficient defined in (20), Appendix F.3 (see also Zhan et al. 2023a), then demonstration-regularized RLHF based on UCBVI-Ent+ / LSVI-UCB-Ent with parameters εRL = ε/16, δRL = δ/3 and λ = λ e O N Eε/(SAH) / e O N Eε/(d H) is (ε, δ)-PAC for BPI with demonstration in finite / linear MDPs with sample complexity

C(ε, N E, δ) = e O H6S3A2

(finite) C(ε, N E, δ) = e O H6d3

The conditions (1) and (2) control two different terms in the reward estimation error presented in Theorem 8. Condition (1) shows that the small size of the expert dataset should be compensated by a larger dataset used for reward estimation and vice versa. At the same time, condition (2) requires that at least one of these datasets is large enough to overcome the sub-exponential behavior of the error in the reward estimation problem. We remark that the second part of the condition (2) N RM Cr/ε2 is unavoidable in the general case of offline learning even if the transitions are known due to a lower bound in Theorem 3 by Zhan et al. (2023a). However, as soon as the reward estimation error is small enough, we obtain the same sample complexity guarantees as in the demonstration-regularized RL (see Section 4).

6 CONCLUSION

In this study, we introduced the BPI with demonstration framework and showed that demonstrationregularized RL, a widely employed technique, is not just practical but also theoretically efficient for this problem. Additionally, we proposed a novel preference-based BPI with demonstration approach, where the agent gathers demonstrations offline. Notably, we proved that a demonstrationregularized RL method can also solve this problem efficiently without explicit pessimism injection. A compelling direction for future research could involve expanding the feedback mechanism in the preference-based setting, transitioning from pairwise comparison to preference ranking (Zhu et al., 2023). Additionally, it would be interesting to explore scenarios where the assumption of a whitebox preference-based model, as proposed by Wang et al. (2023), is relaxed.

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

D. Belomestny acknowledges the financial support from Deutsche Forschungsgemeinschaft (DFG), Grant Nr.497300407. The work of D. Belomestny and A. Naumov was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139. The work of D. Tiapkin has been supported by the Paris ˆIle-de-France R egion in the framework of DIM AI4IDF.

Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML 04: Proceedings of the twenty-first international conference on Machine learning, pp. 1, New York, NY, USA, 2004. ACM. ISBN 1-58113-828-5.

Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095 20107, 2020.

Aymen Al Marjani, Aur elien Garivier, and Alexandre Proutiere. Navigating to the best policy in markov decision processes. Advances in Neural Information Processing Systems, 34:25852 25864, 2021.

Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, and Nando de Freitas. Playing hard exploration games by watching youtube. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_ files/paper/2018/file/35309226eb45ec366ca86a4329a2b7c3-Paper.pdf.

Mohammad Gheshlaghi Azar, Ian Osband, and R emi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, 2017. URL https: //arxiv.org/pdf/1703.05449.pdf.

Peter L Bartlett and Shahar Mendelson. Empirical minimization. Probability theory and related fields, 135(3):311 334, 2006.

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal J ozefowicz, Scott Gray, Catherine Olsson, Jakub W. Pachocki, Michael Petrov, Henrique Pond e de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning. Ar Xiv, abs/1912.06680, 2019.

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324 345, 1952.

R obert Busa-Fekete, Bal azs Sz or enyi, Paul Weng, Weiwei Cheng, and Eyke H ullermeier. Preference-based reinforcement learning: Evolutionary direct policy search using a preferencebased racing algorithm. Mach. Learn., 97(3):327 351, dec 2014. ISSN 0885-6125. doi: 10.1007/s10994-014-5458-8. URL https://doi.org/10.1007/s10994-014-5458-8.

Pedro Dalla Vecchia Chaves, Bruno L. Pereira, and Rodrygo L. T. Santos. Efficient online learning to rank for sequential music recommendation. In Proceedings of the ACM Web Conference 2022, WWW 22, pp. 2442 2450, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450390965. doi: 10.1145/3485447.3512116. URL https://doi.org/10.1145/ 3485447.3512116.

Published as a conference paper at ICLR 2024

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips. cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Neural Information Processing Systems, 2017. URL https://arxiv.org/pdf/1703.07710.pdf.

Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de las Casas, Craig Donner, Leslie Fritz, Cristian Galperti, Andrea Huber, James Keeling, Maria Tsimpoukelli, Jackie Kay, Antoine Merle, Jean-Marc Moret, Seb Noury, Federico Pesamosca, David Pfau, Olivier Sauter, Cristian Sommariva, Stefano Coda, Basil Duval, Ambrogio Fasoli, Pushmeet Kohli, Koray Kavukcuoglu, Demis Hassabis, and Martin Riedmiller. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414 419, February 2022. ISSN 1476-4687. doi: 10.1038/s41586-021-04301-9. URL https://doi.org/10.1038/ s41586-021-04301-9.

Omar Darwiche Domingues, Pierre M enard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp. 578 598. PMLR, 2021a.

Omar Darwiche Domingues, Pierre Menard, Matteo Pirotta, Emilie Kaufmann, and Michal Valko. Kernel-based reinforcement learning: A finite-time analysis. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2783 2792. PMLR, 18 24 Jul 2021b. URL https://proceedings.mlr.press/v139/domingues21a.html.

Richard M Dudley. Uniform central limit theorems, volume 142. Cambridge university press, 2014.

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pp. 1578 1586. PMLR, 2018.

Vinicius G. Goecks, Gregory M. Gremillion, Vernon J. Lawhern, John Valasek, and Nicholas R. Waytowich. Integrating behavior cloning and reinforcement learning for improved performance in dense and sparse reward environments. In Proceedings of the 19th International Conference on Autonomous Agents and Multi Agent Systems, AAMAS 20, pp. 465 473, Richland, SC, 2020. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450375184.

Jean-Bastien Grill, Omar Darwiche Domingues, Pierre Menard, Remi Munos, and Michal Valko. Planning in entropy-regularized markov decision processes and games. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ 50982fb2f2cfa186d335310461dfa2be-Paper.pdf.

Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, Gabriel Dulac-Arnold, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. Deep q-learning from demonstrations. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI 18/IAAI 18/EAAI 18. AAAI Press, 2018. ISBN 978-1-57735-8008.

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_ files/paper/2016/file/cc7e2b878868cbae992d1fb743995d8f-Paper.pdf.

Published as a conference paper at ICLR 2024

I.-A. Hosu and T. Rebedea. Playing atari games with deep reinforcement learning and human checkpoint replay. In ECAI Workshop on Evaluating General Purpose AI, 2016.

Ashesh Jain, Brian Wojcik, Thorsten Joachims, and Ashutosh Saxena. Learning trajectory preferences for manipulators via iterative improvement. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/ paper_files/paper/2013/file/c058f544c737782deacefa532d9add4c-Paper.pdf.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 99:1563 1600, 2010. URL http://www. jmlr.org/papers/volume11/jaksch10a/jaksch10a.pdf.

Chi Jin, Zeyuan Allen-Zhu, S ebastien Bubeck, and Michael I. Jordan. Is Q-learning provably efficient? In Neural Information Processing Systems, 2018. URL https://arxiv.org/pdf/ 1807.03765.pdf.

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020.

Anders Jonsson, Emilie Kaufmann, Pierre M enard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. Advances in Neural Information Processing Systems, 33:1253 1263, 2020.

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267 274, 2002.

Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy optimization with demonstrations. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2469 2478. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/kang18a.html.

Emilie Kaufmann, Pierre M enard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato (eds.), Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pp. 865 891. PMLR, 16 19 Mar 2021. URL https://proceedings.mlr.press/v132/kaufmann21a.html.

A. S. Lakshminarayanan, S. Ozair, and Y. Bengio. Reinforcement learning with few expert demonstrations. In NIPS Workshop on Deep Learning for Action and Interaction, 2016.

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. ar Xiv preprint ar Xiv:2309.00267, 2023.

Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas K oppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandhane, Thomas Hubert, Julian Schrittwieser, Demis Hassabis, Pushmeet Kohli, Martin Riedmiller, Oriol Vinyals, and David Silver. Faster sorting algorithms discovered using deep reinforcement learning. Nature, 618(7964):257 263, Jun 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06004-9. URL https://doi.org/10.1038/s41586-023-06004-9.

Pierre M enard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7599 7608. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/menard21a.html.

Published as a conference paper at ICLR 2024

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.

Ashvin Nair, Bob Mc Grew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6292 6299. IEEE Press, 2018. doi: 10.1109/ICRA.2018.8463162. URL https://doi.org/10.1109/ICRA.2018.8463162.

Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. Co RR, abs/2006.09359, 2020. URL https://arxiv.org/abs/ 2006.09359.

Gergely Neu, Anders Jonsson, and Vicenc G omez. A unified view of entropy-regularized markov decision processes. Co RR, abs/1705.07798, 2017. URL http://arxiv.org/abs/1705.07798.

Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In Pat Langley (ed.), ICML, pp. 663 670. Morgan Kaufmann, 2000. ISBN 1-55860-707-2. URL http://dblp. uni-trier.de/db/conf/icml/icml2000.html#Ng R00.

Ellen Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel Burdick. Dueling posterior sampling for preference-based reinforcement learning. In Jonas Peters and David Sontag (eds.), Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pp. 1029 1038. PMLR, 03 06 Aug 2020. URL https://proceedings.mlr.press/v124/novoseller20a.html.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730 27744, 2022.

Karl Pertsch, Youngwoon Lee, Yue Wu, and Joseph J. Lim. Demonstration-guided reinforcement learning with learned skills. 5th Conference on Robot Learning, 2021.

Dean Pomerleau. ALVINN: an autonomous land vehicle in a neural network. In David S. Touretzky (ed.), Advances in Neural Information Processing Systems 1, [NIPS Conference, Denver, Colorado, USA, 1988], pp. 305 313. Morgan Kaufmann, 1988. URL http://papers.nips.cc/ paper/95-alvinn-an-autonomous-land-vehicle-in-a-neural-network.

Nived Rajaraman, Lin Yang, Jiantao Jiao, and Kannan Ramchandran. Toward the fundamental limits of imitation learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 2914 2924. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/1e7875cf32d306989d80c14308f3a099-Paper.pdf.

Nived Rajaraman, Yanjun Han, Lin Yang, Jingbo Liu, Jiantao Jiao, and Kannan Ramchandran. On the value of interaction and function approximation in imitation learning. Advances in Neural Information Processing Systems, 34:1325 1336, 2021.

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2018.

Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 11702 11716. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ 60ce36723c17bbac504f2ef4c8a46995-Paper.pdf.

St ephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661 668. JMLR Workshop and Conference Proceedings, 2010.

Published as a conference paper at ICLR 2024

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dud ık (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pp. 627 635, Fort Lauderdale, FL, USA, 11 13 Apr 2011. PMLR. URL https://proceedings.mlr.press/ v15/ross11a.html.

Aadirupa Saha, Aldo Pacchiano, and Jonathan Lee. Dueling RL: reinforcement learning with trajectory preferences. In Francisco J. R. Ruiz, Jennifer G. Dy, and Jan-Willem van de Meent (eds.), International Conference on Artificial Intelligence and Statistics, 25-27 April 2023, Palau de Congressos, Valencia, Spain, volume 206 of Proceedings of Machine Learning Research, pp. 6263 6289. PMLR, 2023. URL https://proceedings.mlr.press/v206/saha23a.html.

Igal Sason and Sergio Verd u. f -divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973 6006, 2016. doi: 10.1109/TIT.2016.2603151.

Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, and Yuejie Chi. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 19967 20025. PMLR, 17 23 Jul 2022. URL https://proceedings. mlr.press/v162/shi22c.html.

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362, 2018.

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 3008 3021. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1f89885d556929e98d3ef9b86448f951-Paper.pdf.

R. Sutton and A. Barto. Reinforcement Learning: an Introduction. MIT press, 1998.

Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, pp. 770 805, 2018.

J erˆome Taupin, Yassir Jedra, and Alexandre Proutiere. Best policy identification in discounted linear MDPs. In Sixteenth European Workshop on Reinforcement Learning, 2023. URL https: //openreview.net/forum?id=SOCOg ATRQi Y.

Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, and Pierre Menard. Fast rates for maximum entropy exploration. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 34161 34221. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/tiapkin23a.html.

A.B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer New York, 2008. ISBN 9780387790527. URL https://doi.org/10.1007/b13794.

Sara A van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.

Dirk van der Hoeven, Nikita Zhivotovskiy, and Nicol o Cesa-Bianchi. High-probability risk bounds via sequential predictors, 2023.

Ramon van Handel. Probability in high dimensions. 2016. URL https://web.math.princeton. edu/ rvan/APC550.pdf.

Published as a conference paper at ICLR 2024

Matej Vecer ık, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Roth orl, Thomas Lampe, and Martin A. Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. Co RR, abs/1707.08817, 2017. URL http://arxiv.org/abs/1707.08817.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, R emi Munos, and Matthieu Geist. Leverage the average: An analysis of kl regularization in reinforcement learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

Yuanhao Wang, Qinghua Liu, and Chi Jin. Is RLHF more difficult than standard rl? Co RR, abs/2306.14111, 2023. doi: 10.48550/ar Xiv.2306.14111. URL https://doi.org/10.48550/ ar Xiv.2306.14111.

Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes F urnkranz. A survey of preferencebased reinforcement learning methods. J. Mach. Learn. Res., 18(1):4945 4990, jan 2017. ISSN 1532-4435.

Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 27395 27407. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ e61eaa38aed621dd776d0e67cfeee366-Paper.pdf.

Yichong Xu, Ruosong Wang, Lin F. Yang, Aarti Singh, and Artur Dubrawski. Preference-based reinforcement learning with finite-time guarantees. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

Yuhong Yang and A.R. Barron. An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 44(1):95 116, 1998. doi: 10.1109/18.650993.

Ming Yin, Yu Bai, and Yu-Xiang Wang. Near-optimal offline reinforcement learning via double variance reduction. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 7677 7688. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/ 2021/file/3f24bb08a5741e4197af64e1f93a5029-Paper.pdf.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In Proceedings of the 36th International Conference on Machine Learning, (ICML), 2019a.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, 2019b. URL https://arxiv.org/pdf/1901.00210.pdf.

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline reinforcement learning with human feedback. ar Xiv preprint ar Xiv:2305.14816, 2023a.

Wenhao Zhan, Masatoshi Uehara, Wen Sun, and Jason D. Lee. How to query human feedback efficiently in rl? Co RR, abs/2305.18505, 2023b. doi: 10.48550/ar Xiv.2305.18505. URL https: //doi.org/10.48550/ar Xiv.2305.18505.

Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527 550, 2002.

Tong Zhang. From ε-entropy to KL-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5):2180 2210, 2006. doi: 10.1214/ 009053606000000704. URL https://doi.org/10.1214/009053606000000704.

Published as a conference paper at ICLR 2024

Y. Zhao, D. Zeng, M. A. Socinski, and M. R. Kosorok. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics, 67(4):1422 1433, 2011.

Banghua Zhu, Michael Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 43037 43067. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr. press/v202/zhu23f.html.

Yuke Zhu, Ziyu Wang, Josh Merel, Andrei A. Rusu, Tom Erez, Serkan Cabi, Saran Tunyasuvunakool, J anos Kram ar, Raia Hadsell, Nando de Freitas, and Nicolas Heess. Reinforcement and imitation learning for diverse visuomotor skills. Co RR, abs/1802.09564, 2018. URL http://arxiv.org/abs/1802.09564.

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. Co RR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.

Published as a conference paper at ICLR 2024

Table of Contents

A Notation 18

B Behavior cloning 20 B.1 Proof for General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2 Proofs for Finite setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.3 Proofs for Linear setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B.4 Concentration Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.5 Proof of Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.6 Imitation Learning Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 31

C Proof for Demonstration-regularized RL 35

D Best Policy Identification in Regularized Finite MDPs 36 D.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 D.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 D.3 Concentration Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 D.4 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 D.5 Sample Complexity Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

E Best Policy Identification in Regularized Linear MDPs 45 E.1 General Properties of Linear MDPs . . . . . . . . . . . . . . . . . . . . . . . . 45 E.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 E.3 Concentration Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 E.4 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 E.5 Sample Complexity Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

F Demonstration-Regularized Preference-Based Learning 53 F.1 Maximum Likelihood Estimation for Reward Model . . . . . . . . . . . . . . . 53 F.2 Properties of Bracketing numbers . . . . . . . . . . . . . . . . . . . . . . . . . 56 F.3 Proof for Demonstration-regularized RLHF . . . . . . . . . . . . . . . . . . . . 57

G Deviation Inequalities 60 G.1 Deviation inequality for categorical distributions . . . . . . . . . . . . . . . . . 60 G.2 Deviation inequality for sequence of Bernoulli random variables . . . . . . . . . 60 G.3 Deviation inequality for bounded distributions . . . . . . . . . . . . . . . . . . 60 G.4 Deviation inequality for vector-valued self-normalized processes . . . . . . . . . 60 G.5 Deviation inequality for sample covariance matrices . . . . . . . . . . . . . . . 61

H Technical Lemmas 63 H.1 Counts to pseudo-counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 H.2 Counts to pseudo-counts in linear MDPs . . . . . . . . . . . . . . . . . . . . . 63 H.3 On the Bernstein inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 H.4 Change of policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Published as a conference paper at ICLR 2024

Table 1: Table of notation use throughout the paper Notation Meaning

S state space of size S A action space of size A d dimension of linear MDP H length of one episode s1 initial state ι stopping time T trajectory space, T (S A)H ε desired accuracy of solving the problem δ desired upper bound on failure probability ph(s |s, a) probability transition rh(s, a) reward function V π h , V h value of policy π and optimal value Qπ h, Q h Q-value of policy π and optimal Q-value V π eπ,λ,h, V eπ,λ,h regularized value of policy π and optimal regularized value Qπ eπ,λ,h, Q eπ,λ,h regularized Q-value of policy π and optimal regularized Q-value Π, Πh space of all policies and space of policies on step h Πγ, Πh,γ space of all policies and policies on step h with minimal probability γ DE expert dataset of size N E: DE { τi = (si 1, ai 1, . . . , si H, ai H), i [N E]} πE expert policy πE,κ κ-greedy version of the expert policy εE sub-optimality gap of the expert policy: V 1 (s1) V πE 1 (s1) εE πBC behavior cloning policy Rh regularizer for behavior cloning F class of policies for behavior cloning d F covering dimension of one-step policy class for behavior cloning s t h state that was visited at h step during t episode a t h action that was picked at h step during t episode r true reward function in a preference-based model σ link function, see Assumption 4 ζ linearity measure of link function, see Assumption 4 πS sampling policy for generation preference dataset DRM preference dataset of size N RM : DRM {(τ k 0 , τ k 1 , ok)} Cr(G, πE, πBC) single-policy concentrability coefficient, see (20) G class of trajectory rewards for reward modeling d G bracketing dimension of the induced preference models πRL policy for BPI with demonstration πRLHF policy for preference-based BPI with demonstration C(ε, N E, δ) sample complexity for BPI with demonstration C(ε, λ, δ) sample complexity for regularized BPI R+ non-negative real numbers N+ positive natural numbers [n] set {1, 2, . . . , n} e Euler s number d d 1-dimensional probability simplex: d {x Rd + : Pd j=1 xj = 1} X set of distributions over a finite set X : X = |X|. clip(x, m, M) clipping procedure clip(x, m, M) max(min(x, M), m)

Let (X, X) be a measurable space and P(X) be the set of all probability measures on this space. For p P(X), we denote by Ep the expectation w.r.t. p. For a random mapping ξ : X R notation ξ p means Law(ξ) = p. For any measures p, q P(X), we denote their product measure by

Published as a conference paper at ICLR 2024

p q. We also write Eξ p instead of Ep. For any p, q P(X), the Kullback-Leibler divergence between p and q is given by

( Ep[log dp

dq ], p q , + , otherwise .

For any p P(X) and f : X R, we denote pf = Ep[f]. In particular, for any p d and f : {0, . . . , d} R, we use pf = Pd ℓ=0 f(ℓ)p(ℓ). Define Varp(f) = Es p (f(s ) pf)2 = p[f 2] (pf)2. For any (s, a) S, transition kernel p(s, a) P(S) and f : S R, define pf(s, a) = Ep(s,a)[f] and Varp[f](s, a) = Varp(s,a)[f]. For any s S, policy π(s) P(S) and f : S A R, set πf(s) = Ea π(s)[f(s, a)] and Varπf(s) = Vara π(s)[f(s, a)]. For a MDP M, a policy π and a sequence of function (fh, h [H]), define Eπ[PH h =h f(sh , ah )|sh] as a conditional expectation of PH h =h f(sh , ah ) with respect to the sigma-algebra Fh = σ{(sh , ah )|h h}, where for any h [H], we have ah π(sh), sh+1 ph(sh, ah).

We define trajectory KL-divergence between two policies π = {πh}h [H], π = {πh}h [H] as follows

KLtraj(π, π ) = Eπ

h=1 KL(πh(sh), π h(sh))

We write f(S, A, H, ε) = O(g(S, A, H, ε, δ)) if there exist S0, A0, H0, ε0, δ0 and constant Cf,g such that for any S S0, A A0, H H0, ε < ε0, δ < δ0, f(S, A, H, T, δ) Cf,g g(S, A, H, T, δ). We write f(S, A, H, ε, δ) = e O(g(S, A, H, ε, δ)) if Cf,g in the previous definition is poly-logarithmic in S, A, H, 1/ε, 1/δ.

For any symmetric positive definite matrix A, we define the corresponding A-scalar product and A-norm as follows x, y A = x, Ay , x A = p

Notice that if A 2 c, then x A c x 2.

Coverings, packings, and bracketings A pair (X, ρ) is called pseudometric space with a metric ρ: X X R+ if ρ satisfies ρ(x, x) = 0 for all x X, ρ is symmetric, that is, x, y X : ρ(x, y) = ρ(y, x), and ρ satisfies triangle inequality x, y, z : ρ(x, y) + ρ(y, z) ρ(x, z).

Definition 6 (ε-covering and packing). Let (X, ρ) be a (pseudo)metric space with a metric ρ: X X R+. The ε-covering number N(ε, X, ρ) is the size of the minimal ε-cover of (X, ρ), that is,

N(ε, X, ρ) = min X X{|X| : y X x X : ρ(y, x) ε}.

The ε-packing number P(ε, X, ρ) is the size of the maximal ε-separated set of (X, ρ),

P(ε, X, ρ) = max X X{|X| : x = y X : ρ(x, y) > ε}.

Definition 7 (ε-bracketing). Let F : X R be a function class endowed with a norm . Given two functions ℓ, u: X R, a bracket [ℓ, u] is a set of all functions f F such that ℓ(x) f(x) u(x) for all x X. A ε-bracket is a bracket [ℓ, u] such that ℓ u ε. The ε-bracketing number N[](ε, F, ) is the cardinality of the minimal set of ε-brackets needed to cover F,

N[](F, ) = min N {|N| | f F [ℓ, u] N : ℓ(x) f(x) u(x) x X, ℓ u ε}.

Published as a conference paper at ICLR 2024

B BEHAVIOR CLONING

In this appendix, we gather the proofs of the results for behavior cloning presented in Section 3.

B.1 PROOF FOR GENERAL SETTING

In this appendix, we provide the proof of Theorem 1. Theorem (Restatement of Theorem 1). Assume Assumptions 1-2 and that 0 Rh(πh) M for all h [H], for any policy π Fh. Let πBC be a solution to (1). Then with probability at least 1 δ the behavior policy πBC satisfies

KLtraj(πE πBC) 6d FH (log(Ae3/(Aγ κ)) log(2HN ERF/(γδ))

Proof. We commence by defining the one-step trajectory KL-divergence as follows:

KLtraj(πE h πBC h ) = EπE log πE h (ah|sh) πBC h (ah|sh)

In particular, by the linearity of expectation, the following holds

KLtraj(πE πBC) =

h=1 KLtraj(πE h πBC h ).

Recall the definition of the κ-greedy version of the expert policy

πE,κ h (a|s) = (1 κ)πE h (a|s) + κ

A = (1 κ) πE h (a|s) + κ (1 κ)A

Next, we can decompose the one-step trajectory KL-divergence as follows

KLtraj(πE h πBC h ) = EπE

πE,κ h (ah|sh) πBC h (ah|sh)

πE h (ah|sh)

πE,κ h (ah|sh)

For the second term, we have

πE h (ah|sh)

πE,κ h (ah|sh)

log πE h (ah|sh) πE h (ah|sh) + κ/(A(1 κ))

log(1 κ) κ 1 κ,

where the last inequality follows from the fact that (1 x) log(1 x) x for any x < 1, by convexity of the function x 7 x log x. Next, we decompose the smoothed version of the one-step trajectory KL to the sum of stochastic and empirical terms,

πE,κ h (ah|sh) πBC h (ah|sh)

πE,κ h (ah|sh) πBC h (ah|sh)

πE,κ h (at h|st h) πBC h (at h|st h)

πE,κ h (at h|st h) πBC h (at h|st h)

To upper bound the first term we apply Lemma 3 and obtain with probability at least 1 δ

πE,κ h (ah|sh) πBC h (ah|sh)

πE,κ h (at h|st h) πBC h (at h|st h)

2 log(e2/γ) KLtraj(πE h πBC h ) d(log(2N ERF/γ) + log(1/δ))

+ 5(log(Ae3/(Aγ κ)) d F(log(2N ERF/γ) + log(1/δ)))

3N E + 8κ 1 κ.

Published as a conference paper at ICLR 2024

To control the second term, we first notice that since F has a product structure, then by a simple observation

{πh}H h=1 = arg min π1 F1,...,πH FH

h=1 Lh(πh) h [H] : πh = arg min πh Fh Lh(πh)

for any functions {Lh}H h=1, the MLE estimation (1) implies

πBC h arg min πh Fh

i=1 log 1 πh(ai h|si h) + Rh(πh),

therefore the following holds

πE,κ h (at h|st h) πBC h (at h|st h)

t=1 log 1 πBC h (at h|st h)

+ Rh(πBC h )

πE,κ h (at h|st h)

+ Rh(πE,κ h )

Thus, we have

KLtraj(πE h πBC h )

2 log(e2/γ) KLtraj(πE h πBC h ) d F(log(2N ERF/γ) + log(1/δ))

+ 5(log(Ae3/(Aγ κ)) d F(log(2N ERF/γ) + log(1/δ)))

N E + 9κ 1 κ.

This means that q

KLtraj(πE h πBC h ) satisfies a quadratic inequality of the form x2 ax + b. Since

ax (a2 + x2)/2, we further have x2 a2 + 2b. As a result

KLtraj(πE h πBC h ) 6d F log(Ae3/(Aγ κ)) log(2N ERF/(γδ))

To conclude the statement, we apply a union bound over h [H] and sum over the final upper bound.

B.2 PROOFS FOR FINITE SETTING

We recall that for finite MDPs we chose a logarithmic regularizer Rh(πh) = P

s,a log(1/πh(a|s)) and the policy class F = {π Π : πh(a|s) 1/(N E + A)}. One can check that Assumptions 1-2 holds for these choices and that 0 Rh(πh) SA log(A). Then we can apply Theorem 1 to obtain the following bound for finite MDPs. Corollary (Restatement of Corollary 1). For all N E A, for function class F and regularizer (Rh)h [H] defined above, with probability at least 1 δ,

KLtraj(πE πBC) 6SAH log(2e4N E) log(12H(N E)2/δ)

Proof. Let us start from a simple observation that Fh S A is a subset of a unit ball in ℓ -norm. Therefore by a standard result in the covering numbers for finite-dimensional Banach spaces (see i.e. Problem 5.5 by van Handel (2016)) log N(ε, Fh, ) SA log(3/ε). Thus, the parametric classes {Fh}h [H] satisfies Assumption 1 with constants d F = SA, RF = 3, γ = 1/(N E + A). Next we notice that for any expert policy, Assumption 2 is satisfied with κ = A/(N E + A) for this parametric family. Thus, we can apply Theorem 1 and get

KLtraj(πE πBC) 6SAH log((N E + A)e3) log(6HN E(N E + A)/δ)

+ 2SAH log(A)

By upper bounding the first and the second terms under the assumption N E A we conclude the statement.

Published as a conference paper at ICLR 2024

B.3 PROOFS FOR LINEAR SETTING

We start from a natural example when Assumption 3 is fulfilled and the sub-optimality error εE is small.

Lemma 1. Assume that the MDP M is linear (see Definition 2) and consider the regularized MDP with uniform policy eπ(a|s) = Unif[A] and with a coefficient λ (see Appendix E for more exposition). Then the optimal regularied policy π eπ,λ,h satisfies Assumption 3 with a constant R = H

d/λ. Moreover, this policy is λH log(A)-optimal.

Proof. At first, by Proposition 2, it holds that for an optimal policy π eπ,λ,h there are some weights w h such that Q h(s, a) = ψ(s, a), w h and, moreover, w h H

Then we notice that from the regularized Bellman equations it holds

π eπ,λ,h(a|s) = arg max π

πQ eπ,λ,h(s) λ KL(π Unif[A])

= arg max π

(s) KL(π Unif[A]) = exp ψ(s, a), 1

Therefore, Assumption 3 is satisfied with R = H

d/λ for πE = π λ.

To verify the suboptimality of this policy, we notice that π eπ,λ satisfies

π eπ,λ = arg max π Π {V π 1 (s1) λ KLtraj(π eπ)},

V 1 (s1) λ KLtraj(π eπ)} V π eπ,λ 1 (s1) λ KLtraj(π eπ,λ Unif[A])

V 1 (s1) V π eπ,λ 1 (s1) λH log(A).

Next, we provide the result for linear MDPs under Assumption 3, using the parametric assumption given in (2).

Corollary (Restatement of Corollary 2). Under Assumption 3, for all N E A, for the function class F defined in (2) and regularizer Rh = 0, for all h [H], with probability at least 1 δ,

KLtraj(πE πBC) 6d H log(2e3N E) log(48H(N E)2R/δ)

Proof. We start by checking that Assumption 1 holds. By construction of Fh in (2), we have

inf πh Fh inf (s,a) S A πh(a|s) 1 N E + A.

Next, we have to consider the covering dimension of the hypothesis set. First, we notice that for any two policies πh, µh Fh we have

|πh(a|s) µh(a|s)| = |(1 κ)π h(a|s) (1 κ)µ h(a|s)| = (1 κ)|π h(a|s) µ h(a|s)|,

where π h, µ h F h for F h defined as follows

F h = πh(a|s) = exp(ψ(s, a)Twh) P

a A exp(ψ(s, a )Twh) : wh 2 R .

Thus, it is sufficient to compute the covering number for F h. Let us define

Φ(a|s, wh) = exp{ ψ(s, a), wh }, Z(s, wh) = X

a A Φ(a|s, wh).

Published as a conference paper at ICLR 2024

Then let wh, w h be weight vectors that correspond to π h and µ h respectively. Then

|π h(a|s) µ h(a|s)| = Φ(a|s, wh)

Z(s, wh) Φ(a|s, w h) Z(s, w h)

= Φ(a|s, wh) Φ(a|s, w h) Z(s, wh) Φ(a|s, w h) 1 Z(s, w h) 1 Z(s, wh)

1 Φ(a|s, w h) Φ(a|s, wh)

+ Φ(a|s, w h) Z(s, w h)

1 Z(s, w h) Z(s, wh)

Next, we analyze both terms separately. For the first term, we have 1 Φ(a|s, w h) Φ(a|s, wh)

= |1 exp{ ψ(s, a), w h wh }|.

We notice that the absolute value of the expression under exponent is upper-bounded by wh w h 2. Let us assume that wh w h 2 1, then by the inequality |1 ex| 2|x| for any |x| 1, we have 1 Φ(a|s, w h) Φ(a|s, wh)

2 wh w h 2. (3)

For the second term, we have by the definition of the normalization constant 1 Z(s, w h) Z(s, wh)

a Φ(a |s, wh)[1 Φ(a |s, w h)/Φ(a |s, wh)] Z(s, wh)

a Φ(a |s, wh) |1 Φ(a |s, w h)/Φ(a |s, wh)| Z(s, wh) 2 wh w h 2,

where in the end we applied (3). Finally, we have, for any policies π h and µ h such that the corresponding weights wh and w h satisfies wh w h 2 1, that

|πh(a|s) µh(a|s)| |π h(a|s) µ h(a|s)| 4 wh w h 2. (4)

Now we construct an ε-net for ε (0, 1). Let Nε/4(W, 2) be a ε/4-net in the space of weights W = {wh Rd : wh 2 R}. It satisfies (see i.e. van Handel (2016))

log N(ε/4, W, 2)| d log(12R/ε).

Next, we show that policies with weights that correspond to a covering of size N(ε/4, Wh, 2) forms an ε-net in Fh. Let πh Fh be an arbitrary policy with parameter wh. Let w h be in the covering of size N(ε/4, W, 2) be a parameter that satisfies wh w h 2 ε/4 1. Let us fix µh as a policy that corresponds to w h. Since wh w h 2 1, (4) is applicable. Thus

πh µh = sup (s,a) S A |πh(a|s) µh(a|s)| 4 wh w h 2 ε.

Therefore, policies that correspond to an ε/4-net in wh form an ε-net in F and we have an upper bound on the size of the ε-net. As a result, Fh satisfies Assumption 1 with a dimension d F = d, a scaling factor RF = 12R and γ = 1/(N E + A). Additionally, by construction of Fh and Assumption 3, the last Assumption 2 holds with κ = A/(N E + A). Therefore, we can apply Theorem 1 and obtain with probability at least 1 δ

KLtraj(πE πBC) 6d H (log(e3(N E + A)) (log(24HN E(N E + A)R/δ)))

Using of A N E concludes the statement.

B.4 CONCENTRATION RESULTS

In this section, we state important results on the concentration of the stochastic error for the risk estimates.

Recall the definition of the κ-greedy version of the expert policy as follows

πE,κ h (a|s) = (1 κ)πE h (a|s) + κ

A = (1 κ) πE h (a|s) + κ (1 κ)A

Published as a conference paper at ICLR 2024

Lemma 2. Let πE be a fixed expert policy. Let (st h, at h)N t=1 be an i.i.d. sequence of state-action pairs generated by following the policy πE at step h. For γ (0, 1/A) let π a policy such that for all (s, a) S A it holds πh(a|s) γ. Then for any δ (0, 1) and any κ < 1/2 with probability at least 1 δ 1 N

πE h,σ(at h|st h) πh(at h|st h)

πE,κ h (ah|sh) πh(ah|sh)

2 log(e2/γ) KLtraj(πE h |πh) log(2/δ) N

+ 2 log(Ae3/(Aγ κ) log(2/δ)

3N + 5κ 1 κ.

Proof. As a first step, we can apply Bernstein inequality

πE,κ h (at h|st h) πh(at h|st h)

πE,κ h (ah|sh) πh(ah|sh)

v u u u t2VarπE log πE,κ h (ah|sh) πh(ah|sh)

+ 2(log(1/γ) log(A/κ) log(2/δ)

Next, we want to upper bound a variance in terms of KLtraj(πE h πh). We start from a bound of square root variance in terms of the second moment and Minkowski inequality v u u t VarπE

πE,κ h (ah|sh) πh(ah|sh)

v u u t EπE

πE,κ h (ah|sh) πh(ah|sh)

v u u t EπE

log πE h (a|s) πh(a|s)

πE,κ h (a|s) πE h (a|s)

v u u t EπE

" log πE h (ah|sh) πh(ah|sh)

v u u t EπE

" log 1 + κ (1 κ)AπE h (a|s)

+ log(1 κ) 2#

Term (A). The result below directly follows from Lemma 4 of Yang & Barron (1998). However, for completeness, we prove it here.

First, we notice that

a A πE h (a|sh) log2 πE h (a|sh) πh(a|sh)

To analyze this term, we define an f-divergence (Sason & Verd u, 2016) for a function f as follows

Df(πE h (s) πh(s)) = X

a A f πE h (a|s) πh(a|s)

In particular, KL(πE h (s), πh(s)) = Dg(πE h (s) πh(s)) for g(t) = t log t + (1 t) and, moreover for f(t) = t log2(t)

Df(πE h (s) πh(s)) = X

a A πE h (a|s) log2 πE h (a|s) πh(a|s)

Then we notice that g and f are non-negative function and, moreover, its argument t takes values in (0, 1) (1, 1/γ] since for t = 1 both functions are zero. First, we analyze the ratio for f and g for any t (0, 1)

r(t) = f(t)

g(t) = t log2(t) t log t + (1 t).

Published as a conference paper at ICLR 2024

To bound this function, let us prove that is monotone for all t (0, 1)

0 z }| { log(t)

0 z }| { ((t + 1) log(t) + 2(1 t))

(t log t + (1 t))2 0.

Thus for any t (0, 1)

r(t) lim t 1 t log2(t) t log t + (1 t) = 2.

Next, we analyze the segment t (1, 1/γ].

r(t) = t log2(t) t log t + (1 t) log(t) + 2 log(1/γ) + 2 = log(e2/γ).

since t log2(t) t log2(t)+(1 t) log t+2t log(t)+2(1 t) (t+1) log(t)+2(1 t) 0 t > 1. Therefore we have for any t (0, 1) (1, 1/γ] and as a simple corollary

f(t) log(e2/γ) g(t) Df(πE h (s) πh(s)) log(e2/γ) KL(πE h (s) πh(s)). Finally, we have

(A) log(e2/γ)EπE KL(πE h (sh), πh(sh)) = log(e2/γ) KLtraj(πE h πh).

Term (B). We can rewrite this term as follows using inequality (a + b)2 2a2 + 2b2

a:πE h (a|sh)>0 πE h (a|sh) log2 1 + κ (1 κ)AπE h (ah|sh)

Next we analyze the function g(x) = x log2(1 + ε/x). Its derivative is equal to

g (x) = log 1 + ε

Since ε > 0. we can define x as a root of equation g (x) = 0 for x > 0. Notice that it will be maximum of g(x), thus for ε > 0

g(x) g(x ) = x log 1 + ε

(x + ε)2 4ε2

(B) 8κ 1 κ + 2 κ 1 κ

Final bound on variance Combining these two bounds, we have v u u t VarπE

πE,κ h (ah|sh) πh(ah|sh)

log(e2/γ) KLtraj(πE h πh) + p

8κ/(1 κ) + 2κ2/(1 κ)2.

Using this bound on variance, we can bound the main stochastic term v u u t2VarπE h log πE,κ h (ah|sh) πh(ah|sh) i log(2/δ)

2 log(e2/γ) KLtraj(πE h πh) log(2/δ) N

(4κ/(1 κ) + κ2/(1 κ)2) log(2/δ)

Next, we use an inequality 2ab a2 + b2 to obtain

(4κ/(1 κ) + κ2/(1 κ)2) log(2/δ)

N (4κ/(1 κ) + κ2/(1 κ)2) + 2 log(2/δ)

Finally, since k/(1 κ) 1 we conclude the statement.

Published as a conference paper at ICLR 2024

Next we define Πγ,h a set of all one-step policies πh(a|s) such that

inf (s,a) S A πh(a|s) γ.

This set forms a metric space with a metric induced by ℓ -norm

πh π h = sup (s,a) S A |πh(a|s) π h(a|s)|.

Lemma 3. Let F be sub-space of Πγ,h with an induced metric, such that it satisfies for all ε (0, 1)

log |N(ε, F, )| d log(R/ε).

for some positive constants R, d > 0. Then with probability at least 1 δ the following holds for all πh F simultaneously 1 N

πE,κ h (at h|st h) πh(at h|st h)

πE,κ h (ah|sh) πh(ah|sh)

2 log(e2/γ) KLtraj(πE h πh) d(log(2NR/γ) + log(1/δ))

+ 5(log(Ae3/(γ σ)) d(log(2NR/γ) + log(1/δ)))

3N + 8κ 1 κ.

Proof. Let Nε be a minimal ε-net of F for ε that will be specified later. Combining Lemma 2 with a union bound over Nε we have for any π h Nε with probability at least 1 δ 1 N

πE,κ h (at h|st h) π h(at h|st h)

πE,κ h (ah|sh) π h(ah|sh)

2 log(e2/γ) KLtraj(πE h π h) d log(2R/(εδ)) N

+ 2(log(Ae3/(Aγ κ)) d log(2R/(εδ))

3N + 5κ 1 κ.

Next, we select an arbitrary policy πh F and let π h Nε be ε-close policy to πh. Then 1 N

πE,κ h (at h|st h) πh(at h|st h)

πE,κ h (ah|sh) πh(ah|sh)

πE,κ h (at h|st h) π h(at h|st h)

πE,κ h (ah|sh) π h(ah|sh)

log π h(at h|st h) πh(at h|st h)

EπE log π h(ah|sh) πh(ah|sh)

We start from bounding the second term, which could be done as follows 1 N

log π h(at h|st h) πh(at h|st h)

EπE log π h(ah|sh) πh(ah|sh)

log π h(a|s) πh(a|s)

Next, we use simple inequalities

log π h(a|s) πh(a|s)

= log 1 + π h(a|s) πh(a|s)

|π h(a|s) πh(a|s)|

and, in the opposite direction, we can use the same reasoning

log π h(a|s) πh(a|s)

= log πh(a|s)

Published as a conference paper at ICLR 2024

Thus, applying (5) for the first term we obtain

πE,κ h (at h|st h) πh(at h|st h)

πE,κ h (ah|sh) πh(ah|sh)

2 log(e2/γ) KLtraj(πE h π h) d log(2R/(εδ)) N

+ 2(log(Ae3/(Aγ κ)) d log(2R/(εδ))

3N + 5κ 1 κ + 2ε/γ.

Next, we use a similar inequality to obtain

KLtraj(πE h π h) = EπE log πE h (ah|sh) π h(ah|sh)

= KLtraj(πE h πh) + EπE log πh(ah|sh)

KLtraj(πE h πh) + ε

Finally, applying inequalities

b and 2ab a2 + b2

πE,κ h (at h|st h) πh(at h|st h)

πE,κ h (ah|sh) πh(ah|sh)

2d log(e2/γ) KLtraj(πE h πh) log(2R/(εδ)) N

+ d log(e2/γ) log(2R/(εδ))

+ 2(log(Ae3/(Aγ κ)) d log(2R/(εδ))

3N + 5κ 1 κ + 2ε/γ.

By rearranging the terms and taking ε = γ κ/(1 κ) we conclude the statement.

B.5 PROOF OF LOWER BOUNDS

B.5.1 GENERAL SETUP

In this section we provide a lower bound on estimation in KL-divergence using a framework of Chapter 2 by Tsybakov (2008). Our goal is to obtain a lower bound on minimax risk that is defined as follows inf bπ sup π F Eτ1,...,τN π[KLtraj(π bπ)],

where infimum is taken over all estimators that map the sampled trajectories (τ1, . . . , τN) to a policy from the hypothesis class F F1 . . . FH.

Let us consider a specific type of MDPs where the transition kernel ph(s, a) does not depend on a state-action pair (s, a): (s, a, h) S A [H], A FS : ph(A|s, a) = µh(A) for fixed measures µh. In particular, for h = 1 we always have µh = δs1 is a Dirac measure at initial state s1.

Then we define over the space of all policies the following specific distance defined through the Hellinger distance

ρh(πh, π h) = q

Es µh[d2 H(πh, π h)], d2 H(πh, π h) = X

Published as a conference paper at ICLR 2024

and we define the following distance for the space of full policies (the triangle inequality follows from Minkowski inequality)

h=1 ρ2 h(πh, π h). (6)

Next we impose the following metric-specific assumption for our hypothesis classes

Assumption 5. For all h [H] a of the function class Fh with respect to the metric ρh satisfies

ε (0, 1) : log P(ε, Fh, ρh) dh log(R/ε)

for constants dh 0 and R > 0.

In particular, Lemma 6 implies that

log P(ε, F, ρ)

h=1 dh log(R/ε).

Theorem 4. Let Assumption 5 holds and let us define D = PH h=1 dh. Also we assume that for any π F it holds πh(s, a) γ for γ (0, 1/A). Let us assume D 5 and n e2D/R2. Then the following minimax lower bound holds

min bπ max π F E[KLtraj(π bπ)] D 16N log(e2/γ).

Proof. First, we notice by the first part of Lemma 7

min bπ max π F E log(e2/γ)N

D KLtraj(π bπ) min bπ max π F E log(e2/γ)N

D ρ2(π, bπ) ,

where the expectation is taken with respect to a sample τ1, . . . , τN. Next, we can follow the general reduction scheme, see Chapter 2.2 by Tsybakov (2008). By Markov inequality

min bπ max π F E n log(e2/γ)

D ρ2(π, bπ) 1

4 min bπ max π F P

D 4N log(e2/γ)

Next, we use the reduction to a finite hypothesis class. Define ζ = p

D/(4 log(e2/γ)N) and take P is a maximal ζ-separated set of size M + 1 = P(ζ, F, ρ) and enumerate all the policies in it as π0, . . . , πM.

Therefore we obtain

min bπ max π F Pτ1,...,τN π

D 4N log(e2/γ)

min bπ max j {0,...,M} Pτ1,...,τN πj

D 4N log(e2/γ)

Let us define ψ = arg minj=0,...,M ρ(πj, ˆπ). Then we have that if ψ = j, then

2ρ(πj, bπ) ρ(πj, bπ) + ρ(πψ , bπ) ρ(πj, πψ ).

Since j = π , then by definition of ζ-separable set we have ρ(πj, bπ) ζ/2. As a result

min bπ max j {0,...,M} Pτ1,...,τN πj

D 4N log(e2/γ)

min bπ max j {0,...,M} Pτ1,...,τN πj[ψ = j].

Published as a conference paper at ICLR 2024

Finally, taking infimum over all hypothesis tests we obtain

min bπ max π F P

D 4N log(e2/γ)

inf ψ max j=0,...,M Pj[ψ = j] pe,M.

To lower bound the right-hand side, we apply Proposition 2.3 by Tsybakov (2008). Notice that the maximal ε-packing is ε-net (see Lemma 4.2.6 by Vershynin (2018)). Therefore, by Lemma 7 we have

i=1 KL(Pτ1,...,τN πi Pτ1,...,τN π0) = N

i=1 KLtraj(πi π0)

n log(e2/γ) 1

i=1 ρ2(πi, π0) D α .

Thus by Proposition 2.3 by Tsybakov (2008)

pe,M sup 0<τ<1

Next, we select τ in a way such that

α /2 log(τ ) = 1

2 log(τ ) = 1

Therefore we have

2 exp(log(M) 1/2(α + p

1 + exp(log(M) 1/2(α + p

Notice that the function f(x) = exp(x)/(1 + exp(x)) monotonically increasing, therefore it is enough to bound the expression under the exponent from below.

Let us assume that α 5 D 5. Then we have

log(M) 1/2(α + p

α /2) log(M + 1) 2 +

2 4 α log(2)

4 log(e2/γ)N

log 4 log(e2/γ)R2 N

To show that the expression above is non-negative, it is enough to guarantee

log(N) + log(R2/D) 2 N e2D/R2.

Under this condition we have pe,M 1/4 concluding the statement.

B.5.2 FINITE MDPS

For the case of finite MDPs, we additionally specialize the distributions µh as a uniform over S for all h > 1 and µ1 = δs1.

Lemma 4. Let A,γ = {x RA : Pn i=1 xi = 1, xi γ} with γ < 1/A. Then we have for any ε (0, 1) log P(ε, A,γ, d H)| (A 1) log((1 Aγ)/(2ε)).

Proof. Let S = {x RA : Pn i=1 x2 i = 1, xi 0}. Then there is a mapping φ: (S, 2) ( A, d H) that defines an isometry between these two metric spaces:

x, y S : x y 2 = d H(φ(x), φ(y)), φ(x) = x,

Published as a conference paper at ICLR 2024

where the square root is applied component-wise. Therefore, it is enough to estimate the packing number of the preimage of A,γ that is defined as follows Sγ(1) = {x RA : PA i=1 x2 i = 1, xi γ} with the same Euclidean metric.

The next step is to proceed with a shift x 7 x + γ that will be isometry between Sγ(1) and S0(1 Aγ).

Next, we can lower bound the ℓ2-distance over the sphere by the ℓ2 distance over the first A 1 coordinates and therefore it is enough to consider the packing number of S 0(1 Aγ) = B(0, 1) {y RA 1 : PA 1 i=1 y2 i 1 Aγ}.

Finally, we apply the volume argument. In particular, it is enough to compute the volume of S 0(1 Aγ). To do it, we notice that we can represent the ball of radius 1 Aγ by 2A 1 copy of S 0(1 Aγ). Thus

vol(S 0(1 Aγ)) = (1 Aγ)A 1

2A 1 vol(BA 1 2 ).

Finally we have by Proposition 4.2.12 by Vershynin (2018)

P(ε, A,γ, d H)| 1 Aγ

Lemma 5. Let (X, ρ)K i=1 be a metric space such that log |Pε(X, ρ)| d log(R/ε) for d 1. and define on the space X K the following metric ρ(x, y) = 1

K PK i=1 ρ(xi, yi). Then

log P(ε, X K, ρ) d K/2 log(R/(8ε)).

Proof. Consider the maximal ε-separable set P of the space K. This set could be considered as a finite alphabet of size q (R/ε)d. Let us consider the set P K as the set of words in alphabet P of size q of length K with a Hoeffding distance. Then we notice that if there is two words (x, y) P K that have Hoeffding distance at least αK for some constant α (0, 1), then

ρ(x, y) = 1

i ρ(xi, yi) αε.

Therefore, if we consider a αK separable set in P K in terms of Hoeffding distance, it will be automatically a αε-separable set in the original space X K. To find such a set we use the Gilbert Varshamov bound from coding theory. As a result

P(αε, P K, ρ) 1 P αK j=1 K j (1 1/q)j (1/q)K j .

The denominator could be interpreted as follows: Let X1, . . . , XK be Ber(1/q) random variables.

(1 1/q)j (1/q)K j = P

i=1 Xi (1 α)K

To upper bound the last probability we can apply the Chernoff Hoeffding theorem

i=1 Xi 1/q + (1 α 1/q)

exp( kl(1 α 1/q) K).

Take α = 1/2, then we have

kl(1/2 1/q) = 1

2 log q q 1

Thus, we have since d 1

P(ε/2, P K, ρ) exp(K/2 log(q/4)) exp(d K/2 log(R/(4ε))).

By rescaling ε we conclude the statement.

Published as a conference paper at ICLR 2024

Corollary 4. Assume that γ 1/(2A). Let us define Fh = S A,γ and F = F1 . . . FH. Then Assumption 5 holds with constants dh = (A 1)S/2 for all h > 1, d1 = (A 1) and R = 1/32.

As a result, as soon as H 2, A 2, HSA 40 and n 512e2HSA the following minimax lower bound holds

min bπ max π F Eτ1,...,τN π[KLtraj(π bπ)] HSA 128N log(e2/γ).

B.5.3 TECHNICAL LEMMAS

Lemma 6. Let {(Xi, ρi)}i=1,...,K be a collection of relaxed pseudometric spaces that satisfy

ε (0, 1) : log P(ε, Xi, ρi) di log(R/ε)

for some constants 0 d1 d2, Then the product space X = X1 . . . XK with a pseudometric

ρ((x1, . . . , x K), (y1, . . . , y K)) = q PK i=1 ρ2 i (xi, yi) satisfies

ε (0, 1) : log P(ε, X, ρ)

i=1 di log(R/ε).

Proof. Let P1, . . . , PK be a maximal ε-separated set in the corresponding spaces M1, . . . , MK. Then we want to show that the set P1 . . . PK is also ε-separated set in the product space.

Let x = x be two point in P. Then

i=1 ρ2 i (xi, x i) ε

since x and x are different in at least one coordinate. As a result, we have

log P(ε, Mi, ρi)|

i=1 log P(ε, Mi, ρi)

i=1 di log(R/ε).

Lemma 7. Let π, π Πγ. Let ρ be an averaged Hellinger distance distance defined in (6). Then the following inequality holds

ρ2(π, π ) KLtraj(π π ) log(e2/γ)ρ2(π, π ).

Proof. It is enough to show that for two measures p, q n such that mini pi/qi γ the following holds d2 H(p, q) KL(p q) log(e2/γ)d2 H(p, q).

The lower bound holds by Lemma 2.4 of Tsybakov (2008), and the upper bound holds by Lemma 4 of Yang & Barron (1998).

B.6 IMITATION LEARNING GUARANTEES

In this appendix, we present guarantees that give behavior cloning procedure in the setting of imitation learning for finite MDPs and compare obtained results to (Ross & Bagnell, 2010; Rajaraman et al., 2020).

Published as a conference paper at ICLR 2024

General expert Using Pinsker inequality in the space of trajectories (see Lemma 9) and the fact the expert policy is εE-optimal we deduce the following bound on the optimality gap of the behavior cloning policy with probability at least 1 δ

V 1 (s1) V πBC 1 (s1) εE + e O

We remark that we obtain a similar rate as in BPI, see for example M enard et al. (2021), where instead of observing N E demonstrations we collect the same number of trajectories (observing also the rewards) by interacting sequentially with the MDPs. This seems a bit counter-intuitive since we expect to learn faster by directly observing the expert.

However, for the deterministic expert, we get an improved rate using the following variance-aware Pinkser inequality. Lemma 8. Let π and π be arbitrary policies. Then the following upper bound holds

V π 1 (s1) V π 1 (s1)

h=1 VarπQπ h (sh)

KLtraj(π π ) + 4H KLtraj(π π ).

Proof. Let us start from Lemma 11 and Lemma 32.

V π 1 (s1) V π 1 (s1) = Eπ

h=1 [πh π h]Qπ h (sh)

2Varπ Qπ h (sh) KL(πh(sh) π h(sh)) + 2H

3 KL(πh(sh) π h(sh))

h=1 Varπ Qπ h (sh)

KLtraj(π π ) + 2H

3 KLtraj(π π ) ,

where in the last line we have applied Cauchy-Schwartz inequality. Next we apply Lemma 33 and obtain

h=1 Varπ Qπ h (sh)

h=1 VarπQπ h (sh)

+ 4H2 KLtraj(π π ).

By inequality

b for any a, b 0 we have

V π 1 (s1) V π 1 (s1)

h=1 VarπQπ h (sh)

KLtraj(π π ) + 4H KLtraj(π π ).

Deterministic expert If we assume that the expert policy is deterministic, for example, a deterministic optimal policy, then we can improve the bound on the optimality gap since the variance term in Lemma 8 is zero,

V 1 (s1) V πBC 1 (s1) εE + e O SAH2

Ross & Bagnell (2010) also consider behavior cloning with a deterministic expert and provide a bound in terms of the classification-type error of the behavior cloning policy to imitate the expert

V 1 (s1) V πBC 1 (s1) εE + e E(πBC) where e E(πBC) = 1

H EπBC " H X

h=1 1{πE(ah|sh) = 1}

We can easily recover our bound on the optimality gap of the behavior cloning policy from their bound by noting that e E(πBC) KLtraj(πE πBC)/H, see Lemma 10 for a proof. In particular, we

Published as a conference paper at ICLR 2024

also remark that the bound scales quadratically with the horizon H but also linearly with the number of actions and states SA.

By comparing this bound to the lower bound in Theorem 1.1 of Rajaraman et al. (2020) we see that it is optimal in its dependence on S, H and N. Additional dependence on a number of actions comes from the fact that our behavior cloning algorithm always outputs stochastic policy and obtains additional dependence on a number of actions.

We would like to underline that the bound in Lemma 8 directly does not give any insights on the performance of the algorithm in the case of non-deterministic optimal expert, whereas Rajaraman et al. (2020) provides e O(SH2/N E) guarantees using a non-regularized behavior cloning algorithm. It is connected to the fact that for the optimal policy, it is enough to determine the subset of the support of π (s) to achieve the policy with the same value.

Finally, we would like to emphasize that our approach is directly generalized to arbitrary parametric function approximation setting whereas the approach of Rajaraman et al. (2020) could be applied only in the setting of finite MDPs.

B.6.1 TECHNICAL LEMMAS FOR IMITATION LEARNING

Lemma 9. Let M = (S, A, {ph}H h=1, {rh}H h=1, s1) be a finite MDP and let π and π be two any policies in M. Then

V π 1 (s1) V π 1 (s1) H q

KLtraj(π π )/2.

Proof. Let us define the trajectory distribution qπ(τ) for τ = (s1, a1, . . . , s H, a H). Then by the chain rule for KL-divergence we have KL(qπ qπ ) = KLtraj(π π ).

Since rh [0, 1], we may apply a variational formula for total variation distance and Pinkser s inequality

V π 1 (s1) V π 1 (s1)

h=1 Eπ[rh(sh, ah)] Eπ [rh(sh, ah)]

H TV(qπ, qπ ) H q

KLtraj(π π )/2.

Let us assume that a policy π is deterministic, then

a A π (a|sh)1{a = π(sh)}

Notice that X

a A π (a|sh)1{a = π(sh)} = 1

a A |π (a|sh) π(a|sh)| = TV(π(sh), π (sh)).

Therefore this quantity could be decomposed as follows

h=1 Eπ[TV(π(sh), π (sh))].

Lemma 10. Let π be a deterministic policy and π be any policy. Then

eπ(π ) KLtraj(π π )

Published as a conference paper at ICLR 2024

Proof. If the policy π is deterministic, then

KL(πh(sh) π h(sh)) = log 1 π h(ah|sh)

= log 1 π h(πh(sh)|sh)

= log 1 1 P

a A π h(a|sh)1{a = πh(sh)}

= log(1 TV(πh(sh), π h(sh))).

By an inequality log(1 x) x for any x > 0 we have KL(πh(sh) π h(sh)) TV(πh(sh), π h(sh)).

The following lemma is known as performance-difference lemma (see e.g. Kakade & Langford (2002) for a statement in the discounted setting). We provide proof for completeness.

Lemma 11. Let π and π be arbitrary policies. Then the following decomposition holds

V π 1 (s1) V π 1 (s1) = Eπ

h=1 [πh π h]Qπ h (sh)

Proof. Let us proceed by backward induction over h [H]. We want to show the following bound

V π h (s) V π h (s) = Eπ

h =1 [πh π h ]Qπ h (sh )|sh = s

For h = H + 1 both sides of the equation above are equal to zero. Let us assume that the statement holds for any h > h. Then we have

V π h (s) V π h (s) = πQπ h(s) π Qπ h (s) = π[Qπ h Qπ h ](s) + [π π ]Qπ (s)

= Eπ h V π h+1(sh+1) V π h+1(sh+1) + [π π ]Qπ (sh)|sh = s i .

By the induction hypothesis and the tower property of mathematical expectation, we conclude the statement.

Published as a conference paper at ICLR 2024

C PROOF FOR DEMONSTRATION-REGULARIZED RL

Theorem (Restatement of Theorem 2). Assume that there are an expert policy πE such that V 1 (s1) V πE 1 (s1) εE and a behavior cloning policy πBC satisfying p

KLtraj(πE πBC) εKL. Let πRL be εRL-optimal policy in λ-regularized MDP with respect to πBC, that is, V πBC,λ,1(s1) V πRL

πBC,λ,1 εRL. Then πRL fulfills V 1 (s1) V πRL 1 (s1) εE + εRL + λε2 KL.

In particular, under the choice λ = εRL/ε2 KL, the policy πBC is (2εRL + εE)-optimal in the original (non-regularized) MDP.

Proof. We start from the following observation that comes from the assumption on expert policy and a definition of regularized value

V 1 (s1) V πRL 1 (s1) εE + V πE 1 (s1) V πRL 1 (s1) εE + V πE πBC,λ,1(s1) + λ KLtraj(πE πBC)

V πRL πBC,λ,1(s1) λ KLtraj(πRL πBC) εE + εRL + λε2 KL,

where in the last inequality we apply assumptions on πBC and πRL.

Here for completeness we present the proof of Theorem 3.

Theorem (Restatement of Theorem 3). For all ε > 0, δ (0, 1), the UCBVI-Ent+ /LSVI-UCB-Ent algorithms defined in Appendix D.4 /Appendix E.4 are (ε, δ)-PAC for the regularized BPI with sample complexity

C(ε, δ) = e O H5S2A

(finite) C(ε, δ) = e O H5d2

Additionally, assume that the expert policy is εE = ε/2-optimal and satisfies Assumption 3 in the linear case. Let πBC be the behavior cloning policy obtained using corresponding function sets described in Section 3. Then demonstration-regularized RL based on UCBVI-Ent+ / LSVI-UCB-Ent with parameters εRL = ε/4, δRL = δ/2 and λ = e O N Eε/(SAH) / e O N Eε/(d H)

is (ε, δ)-PAC for BPI with demonstration in finite / linear MDPs and has sample complexity of order

C(ε, N E, δ) = e O H6S3A2

(finite) C(ε, N E, δ) = e O H6d3

Proof. The first part of the statement is a combination of Theorem 5 and Theorem 6. The second part of the statement follows from an upper bound on ε2 KL by a behavior cloning (see Appendix B.2 and Appendix B.3) and Theorem 2.

Published as a conference paper at ICLR 2024

D BEST POLICY IDENTIFICATION IN REGULARIZED FINITE MDPS

In this appendix, we present and analyze the UCBVI-Ent+ algorithm for regularized BPI.

D.1 PRELIMINARIES

We first detail the general setting of KL-regularized MDPs.

Given some reference policy eπ and some regularization parameter λ > 0, instead of looking at the usual value function of a policy π we consider the trajectory Kullback-Leibler divergence regularized value function V π eπ,λ,1(s1) V π 1 (s1) λKLtraj(π, eπ). In this case, the value function of the policy π is penalized for moving too far from the reference policy eπ. Interestingly, we can compute the value of policy π and the optimal value thanks to the regularized Bellman equations

Qπ eπ,λ,h(s, a) = rh(s, a) + ph V π eπ,λ,h+1(s, a) ,

V π eπ,λ,h(s) = πh Qπ eπ,λ,h(s) λ KL(πh(s) eπh(s)) ,

Q eπ,λ,h(s, a) = rh(s, a) + ph V eπ,λ,h+1(s, a) , (7)

V eπ,λ,h(s) = max π A

πQ eπ,λ,h(s) λ KL(πh(s) eπh(s)) ,

π eπ,λ,h(s) = arg max π A

πQ eπ,λ,h(s) λ KL(πh(s) eπh(s)) ,

where V π eπ,λ,H+1 = V eπ,λ,H+1 = 0. Note that for π the uniform policy we recover the entropyregularized Bellman equations.

Next we define a convex conjugate to λ KL( eπh(s)) as Feπh(s),λ,h : RA R

Feπh(s),λ,h(x) = max π A{ π, x λ KL(π eπh(s))} = λ log

a A eπh(a|s) exp{xa/λ}

and, with a sight abuse of notation extend the action of this function to the Q-function as follows

V eπ,λ,h(s) = Feπh(s),λ,h(Q eπ,λ,h(s, )) = max π A

πQ eπ,λ,h(s) λ KL(π eπh(s)) .

Thanks to the fact that the norm of gradients of KL(π|eπh(s)) tends to infinity as π tends to a border of simplex, we have an exact formula for the optimal policy by Fenchel-Legendre transform

π eπ,λ,h(s) = arg max π A

πQ eπ,λ,h(s) λ KL(π eπh(s)) = Feπh(s),λ,h(Q eπ,λ,h(s, )).

Notice that we have Feπh(s),λ,h(Q eπ,λ,h(s, )) A since the gradient of Φ diverges on the boundary of A.

Finally, it is known that the smoothness property of Feπh(s),λ,h plays a key role in reduced sample complexity for planning in regularized MDPs (Grill et al., 2019). For our general setting we have that since λ KL( eπh(s)) is λ-strongly convex with respect to 1, then Feπh(s),λ,h is 1/λ-strongly smooth with respect to the dual norm

Feπh(s),λ,h(x) Feπh(s),λ,h(x ) + Feπh(s),λ,h(x ), x x + 1

We notice that KL-divergence is always non-negative and, moreover, V eπ,λ,h(s) 0 for any s since the value of the reference policy eπ is non-negative.

D.2 ALGORITHM DESCRIPTION

In this appendix we present the UCBVI-Ent+ algorithm, a modification of the algorithm UCBVI-Ent proposed by Tiapkin et al. (2023), that achieves better rates in the tabular setting. The UCBVI-Ent+ algorithm works by sampling trajectory according to an exploratory version of an optimistic solution of the regularized MDP and is characterized by the following rules.

Published as a conference paper at ICLR 2024

Sampling rule To obtain the sampling rule at episode t, we first compute a policy πt by optimistic planning in a regularized MDP,

Q t h(s, a) = clip rh(s, a) + bpt h V t h+1(s, a) + bp,t h (s, a), 0, H ,

V t h(s) = max π A

n πQ t h(s) λ KL(π eπh(s)) o , (8)

πt+1 h (s) = arg max π A

n πQ t h(s) λ KL(π eπh(s)) o ,

with V t H+1 = 0 by convention, where bpt is an estimate of the transition probabilities defined in Appendix D.3 and bp,t some bonus term, defined in (10), Appendix D.4. It takes into account an estimation error for transition probabilities. Then we define a family of policies aimed to explore actions for which Q-value is not well estimated at a particular step. That is, for h [0, H], the policy πt,(h ) first follows the optimistic policy πt until step h where it selects an action leading to the largest confidence interval for the optimal Q-value,

πt,(h ) h (a|s) =

( πt,(h ) h (a|s) = πt h(a|s) if h = h ,

πt,(h ) h (a|s) = 1 n a arg maxa A(Q t h(s, a ) Qt h(s, a )) o if h = h ,

where Qt is a lower bound on the optimal regularized Q-value function, see Appendix D.4. In particular, for h = 0 we have πt,(0) = πt. The sampling rule is obtained by picking up uniformly at random one policy among the family πt = πt,(h ), h Unif[0, H] . Note that it is equivalent to sampling from a uniform mixture policy πmix,t over all h [0, H].

Stopping rule and decision rule To define the stopping rule, we first recursively build an upperbound on the difference between the value of the optimal policy and the value of the current optimistic policy πt,

W t h(s, a) = 1 + 1

bpt h Gt h+1(s) + bgap,t h (s, a),

Gt h(s) = clip πt+1 h W t h(s) + 1

Q t h(s, a) Qt h(s, a) 2 , 0, H , (9)

where bgap,t h is a bonus defined in (12), Appendix D.4, V t is a lower-bound on the optimal value function defined in Appendix D.4 and Gt H+1 = 0 by convention. Then the stopping time ι = inf{t N : Gt 1(s1) ε} corresponds to the first episode when this upper-bound is smaller than ε. At this episode, we return the policy bπ = πι.

The complete procedure is described in Algorithm 3.

Algorithm 3 UCBVI-Ent+

1: Input: Target precision ε, target probability δ, bonus functions bt, bt,KL. 2: while true do 3: Compute πt by optimistic planning with (8). 4: Compute bound on the gap Gt 1(s, a) with (9). 5: if Gt 1(s1) ε then break 6: Sample h Unif[H] and set πt = πt,(h ). 7: for h [H] do 8: Play at h πt h(st h) 9: Observe st h+1 ph(st h, at h) 10: end for 11: Update transition estimates bpt. 12: end while 13: Output policy bπ = πt.

Published as a conference paper at ICLR 2024

D.3 CONCENTRATION EVENTS

We first define an estimate of the transition kernel. The number of times the state action-pair (s, a) was visited in step h in the first t episodes are nt h(s, a) Pt i=1 1 (si h, ai h) = (s, a) . Let nt h(s |s, a) Pt i=1 1 (si h, ai h, si h+1) = (s, a, s ) be the number of transitions from s to s at step h. The empirical distribution is defined as bp t h(s |s, a) = n t h(s |s, a)/n t h(s, a) if nt h(s, a) > 0 and bp t h(s |s, a) 1/A for all s S else.

Following the ideas of M enard et al. (2021), we define the following concentration events. First, we define pseudo-counts as a sum of conditional expectations of the random variables that correspond to counts

nt h(s, a) =

i=1 deπt h (s, a),

where πmix,t is a mixture policy played on t-th step. In particular, we have

dπmix,t h (s, a) = 1 H + 1

h =0 dπt,(h ) h (s, a),

where πt,(h ) h (s, a) is a greedy-modified policy πt in h -step:

πt,(h ) h (a|s) =

( πt,(h ) h (a|s) = πt h(a|s) if h = h

πt,(h ) h (a|s) = 1 n a = arg maxa A(Q t h(s, a ) Qt h(s, a )) o if h = h ,

Let βKL, βcnt : (0, 1) N R+ be some functions defined later on in Lemma 12. We define the following favorable events

t N, h [H], (s, a) S A : KL(bp t h(s, a) ph(s, a)) βKL(δ, n t h(s, a)) n t h(s, a)

t N, h [H], (s, a) S A : nt h(s, a) 1

2nt h(s, a) βcnt(δ)

We also introduce an intersection of these events of interest, G(δ) EKL(δ) Ecnt(δ). We prove that for the right choice of the functions βKL, βcnt the above events hold with high probability.

Lemma 12. For any δ (0, 1) and for the following choices of functions β,

βKL(δ, n) log(2SAH/δ) + S log(e(1 + n)),

βcnt(δ) log(2SAH/δ),

it holds that

P[EKL(δ)] 1 δ/2, P[Ecnt(δ)] 1 δ/2,

In particular, P[G(δ)] 1 δ.

Proof. Applying Theorem 9 and the union bound over h [H], (s, a) S A we get P[EKL(δ)] 1 δ/2.

By Theorem 10 and union bound, P[Ecnt(δ)] 1 δ/2. The union bound over three prescribed events concludes P[GH(δ)] 1 δ and P[GB(δ)] 1 δ.

Lemma 13. Assume conditions of Lemma 12. Then on event EKL(δ), for any f : S [0, H], t N, h [H], (s, a) S A

[ph bpt h]f(s, a)

2H2βKL(δ, nt h(s, a)) nt h(s, a) .

Published as a conference paper at ICLR 2024

Proof. By a H older and Pinsker inequalities

[ph bpt h]f(s, a) H ph(s, a) bpt h(s, a) 1 H q

2 KL(bpt h(s, a) ph(s, a)).

Applying the definition of the event EKL we conclude the statement.

Lemma 14. Assume conditions of Lemma 12. Then on event EKL(δ), for any f : S [0, H], t N, h [H], (s, a) S A,

[ph bpt h]f(s, a) 1

H bpt hf(s, a) + 2H 2HβKL(δ, n t h(s, a)) n t h(s, a) 1 ,

[bpt h ph]f(s, a) 1

H phf(s, a) + 2H 2HβKL(δ, n t h(s, a)) n t h(s, a) 1 .

Proof. Let us start from the first statement. We apply the second inequality of Lemma 32 and Lemma 33 to obtain

[ph bpt h]f(s, a) q

2Varph[f](s, a) KL(bpt h ph)

Varbpt h[f](s, a) KL(bpt h ph) + 3H KL(bpt h ph).

Since 0 f(s) H we get

Varbpt h[f](s, a) bpt h[f 2](s, a) H bpt hf(s, a).

Finally, applying 2

ab a + b, a, b 0, we obtain the following inequality

(bpt h ph)f(s, a) 1

H bpt hf(s, a) + 4H2 KL(bpt h ph).

The definition of EKL(δ) implies the part of the statement. At the same time we have a trivial bound since f(s) [0, H]

[ph bpt h]f(s, a) 2H 1

H bpt hf(s, a) + 2H.

To prove the second statement, apply the first inequality of Lemma 32 and proceed similarly.

D.4 CONFIDENCE INTERVALS

Similar to Azar et al. (2017); Zanette & Brunskill (2019a); M enard et al. (2021), we define the upper confidence bound for the optimal regularized Q-function with Hoeffding bonuses.

Then we have the following sequences defined as follows

Q t h(s, a) = clip rh(s, a) + bpt h V t h+1(s, a) + bp,t h (s, a), 0, H ,

πt+1 h (s) = max π A{πQ t h(s) λ KL(π eπh(s))} ,

V t h(s) = πt+1 h Q t h(s) λ KL( πt+1 h (s) eπh(s)) ,

V t H+1(s) = 0 ,

and the lower confidence bound as follows

Qt h(s, a) = clip rh(s, a) + bpt h V t h(s, a) bp,t h (s, a), 0, H

V t h(s) = max π A{πQt h(s) λ KL(π eπh(s))} ,

V t H+1(s) = 0 ,

where we have two types of transition bonuses that will be specified before use. The Hoeffding bonuses are defined as follows

bp,t h (s, a)

2H2βKL(δ, nt h(s, a))] nt h(s, a) . (10)

Published as a conference paper at ICLR 2024

Proposition 1. Let δ (0, 1). Assume Hoeffding bonuses (10). Then on event G(δ) for any t N, (h, s, a) [H] S A it holds

Qt h(s, a) Q eπ,λ,h(s, a) Q t h(s, a), V t λ,h(s) V eπ,λ,h(s) V t h(s).

Proof. Proceed by induction over h. For h = H + 1 the statement is trivial. Now we assume that inequality holds for any h > h for a fixed h [H]. Fix a timestamp t N and a stateaction pair (s, a) and assume that Q t h(s, a) < H, i.e. no clipping occurs. Otherwise the inequality Q eπ,λ,h(s, a) Q t h(s, a) is trivial. In particular, it implies nt h(s, a) > 0.

In this case by Bellman equations (7) we have

[Q t h Q eπ,λ,h](s, a) = bpt h V t h+1(s, a) ph V eπ,λ,h+1(s, a) + bp,t h (s, a) .

To show that the right-hand side is non-negative, we start from the induction hypothesis

[Q t h Q eπ,λ,h](s, a) [bpt h ph]V eπ,λ,h+1(s, a) + bp,t h (s, a).

The non-negativity of the expression above automatically holds from Lemma 13. To prove the second inequality on Q-value, we proceed exactly the same.

Finally, we have to show the inequality for V -values. To do it, we use the fact that V -value are computed by Feπh(s),λ,h applied to Q-value

V t λ,h(s) = Feπh(s),λ,h(Qt h)(s), V eπ,λ,h(s) = Feπh(s),λ,h(Q eπ,λ,h)(s), V t λ,h(s) = Feπh(s),λ,h(Q t h)(s).

Notice that Feπh(s),λ,h takes values in a probability simplex, thus, all partial derivatives of Feπh(s),λ,h are non-negative and therefore Feπh(s),λ,h is monotone in each coordinate. Thus, since

Qt h(s, a) Q eπ,λ,h(s, a) Q t h(s, a), we have the same inequality V t h(s) V eπ,λ,h(s)

D.5 SAMPLE COMPLEXITY BOUNDS

In this section, we provide guarantees for the regularization-aware gap that highly depends on the parameter λ.

Let us recall the regularization-aware gap that is defined recursively, starting from Gt H+1 0 and

W t h(s, a) = 1 + 1

bpt h Gt h+1(s) + bgap,t h (s, a),

Gt h(s) = clip πt+1 h W t h(s) + 1

Q t h(s, a) Qt h(s, a) 2 , 0, H , (11)

where the additional bonus is defined as

bgap,t h (s, a) 4H2βKL(δ, nt h(s, a)) nt h(s, a) , (12)

and the corresponding stopping time for the algorithm

ι = inf{t N : Gt 1(s1) ε}. (13)

The next lemma justifies this choice of the stopping rule. Lemma 15. Assume the choice of Hoeffding bonuses (10) and let the event G(δ) defined in Lemma 12 holds. Then for any t N, s S, h [H]

V eπ,λ,h(s) V πt+1 eπ,λ,h(s) Gt h(s).

Proof. Let us proceed by induction. For h = H + 1 the statement is trivial. Assume that for any h > h the statement holds. Also assume that Gt h(s) < H, otherwise the inequality on the policy error holds trivially. In particular, it holds that nt h(s, a) > 0 for all a A.

Published as a conference paper at ICLR 2024

We can start analysis from understanding the policy error by applying the smoothness of Feπh(s),λ,h.

V eπ,λ,h(s) V πt+1 eπ,λ,h(s) = Feπh(s),λ,h(Q eπ,λ,h(s, )) πt+1 h Q πt+1 eπ,λ,h(s) λ KL( πt+1 h (s) eπh(s))

Feπh(s),λ,h(Q t h(s, )) + Feπh(s),λ,h(Q t h(s, )), Q eπ,λ,h(s, ) Q t h(s, )

2λ Q t h Q eπ,λ,h 2 (s) πt+1 h Q πt+1 eπ,λ,h(s) λ KL( πt+1 h (s) eπh(s)) .

Next we recall that

πt+1 h (s) = Feπh(s),λ,h(Q t h(s, )), Feπh(s),λ,h(Q t h)(s) = πt+1 h Q t h(s) λ KL( πt+1 h (s) eπh(s)),

thus we have

Feπh(s),λ,h(Q t h)(s) πt+1 h Q πt+1 eπ,λ,h(s, ) λ KL( πt+1 h (s) eπh(s)) = πt+1 h [Q t h Q πt+1 eπ,λ,h](s)

and, by Bellman equations

V eπ,λ,h(s) V πt+1 eπ,λ,h(s) πt+1 h h Q eπ,λ,h Q πt+1 eπ,λ,h i (s) + 1

2λ Q t h Q eπ,λ,h 2 (s)

πt+1 h ph h V eπ,λ,h+1 V πt+1 eπ,λ,h+1 i (s) + 1

2λ Q t h Q eπ,λ,h 2 (s).

By induction hypothesis we have

V eπ,λ,h(s) V πt+1 λ,h (s) πt+1 h ph Gt λ,h+1(s) + 1

2λ Q t h Q eπ,λ,h 2 (s).

Next, we apply Lemma 14

ph Gλ,h+1(s, a) = bpt h Gλ,h+1(s, a) + [ph bpt h]Gt λ,h+1(s, a)

bpt h Gt λ,h+1(s, a) + 4H2βKL(δ, nt h(s, a)) nt h(s, a) W t h(s, a),

V eπ,λ,h(s) V πt+1 λ,h (s) πt+1 h W t h(s) + 1

2λ Q t h Q eπ,λ,h 2 (s).

Finally, by the definition of and Proposition 1

V eπ,λ,h(s) V πt+1 λ,h (s) πt+1 h W t h(s) + 1

Q t h(s, a) Qt h(s, a) 2 Gt h(s).

Theorem 5. Let ε > 0, δ (0, 1), S 2 and λ H. Then UCBVI-Ent+ algorithm with Hoeffding bonuses and a stopping rule ι (13) is (ε, δ)-PAC for the best policy identification in regularized MDPs.

Moreover, the stopping time ι is bounded as follows

ι = O H5SA (log(SAH/δ) + SL) L

where L = O(log(SAH log(1/δ)/(ελ))).

Proof. To show that UCBVI-Ent+ is (ε, δ)-PAC we notice that on event G(δ) for bπ = πι by Lemma 15

V eπ,λ,1(s1) V ˆπ eπ,λ,1(s1) Gι 1(s1) ε,

and the event G(δ) holds with probability at least 1 δ. Next we show that the sample complexity is bounded by the quantity mentioned above.

Published as a conference paper at ICLR 2024

Step 1. Bound for Gt 1(s1) First, we start from bounding W t h(s, a) and Gt h(s). By Lemma 14 we can define the following upper bound for W t h(s, a)

W t h(s, a) 1 + 2

ph Gt h+1(s, a) + 8H2βKL(δ, nt h(s, a)) nt h(s, a) .

Therefore we obtain

Gt h(s) E πt+1 1 + 2

Gt h+1(sh+1) + 8H2βKL(δ, nt h(sh, ah)) nt h(sh, ah)

Q t h(sh, a) Qt h(sh, a) 2 sh = s ,

By rolling out this expression

Gt 1(s1) E πt+1 H X

h 8H2βKL(δ, nt h(sh, ah)) nt h(sh, ah)

Q t h(sh, a) Qt h(sh, a) 2 .

Using the fact that (1 + 2/H)h e2, we have

Gt 1(s1) 8e2H2E πt+1

βKL(δ, nt h(sh, ah)) nt h(sh, ah)

h=1 max a A

Q t h(sh, a) Qt h(sh, a) 2 #

Term (A). The analysis of the term (A) follows M enard et al. (2021): we switch counts to pseudocounts by Lemma 28 and obtain

(A) 32e2H2 H X

(s,a) S A d πt+1 h (s, a)βKL(δ, nt h(s, a)) nt h(s, a) 1 .

Term (B). For this term we analyze each summand over h separately. By Lemma 35

E πt+1 max a A

Q t h(sh, a) Qt h(sh, a) 2 = Eπt+1,(h)

Q t h(sh, ah) Qt h(sh, ah) 2 .

Next, we analyze the expression under the square. First, we have

Q t h(sh, ah) Qt h(sh, ah) 2bp,t h (sh, ah) + bpt h[V t h+1 V t h+1](sh, ah).

By Lemma 13

Q t h(sh, ah) Qt h(sh, ah) 4bp,t h (sh, ah) + ph[V t h+1 V t h+1](sh, ah).

using V t λ,h+1(s) V t λ,h+1(s) πt h+1 h

Q t h+1 Qt h+1

i (s) and the definition of Hoeffding bonuses (10), thus, rolling out this recursion

Q t h(sh, ah) Qt h(sh, ah) 4H E πt+1

2βKL(δ, nt h (sh , ah )) nt h (sh , ah )

By Lemma 28, Jensen inequality, and a change of policy πt+1 to πt+1,(h) by Lemma 35 we have

Q t h(sh, ah) Qt h(sh, ah) 4H3/2

v u u t Eπt+1,(h)

2βKL(δ, nt h (sh , ah )) nt h (sh , ah ) 1 |sh

Published as a conference paper at ICLR 2024

By taking the square we get

E πt+1 max a A

Q t h(sh, a) Qt h(sh, a) 2

16H3Eπt+1,(h)

2βKL(δ, nt h (sh , ah )) nt h (sh , ah ) 1

The telescoping property of conditional expectation yields the final bound

E πt+1 max a A

Q t h(sh, a) Qt h(sh, a) 2

h =1 32H3Eπt+1,(h) 2βKL(δ, nt h (sh , ah )) nt h (sh , ah ) 1

Finally, collecting bounds over all h [H] we have

h =1 dπt,(h) h (s, a)βKL(δ, nt h (s, a)) nt h (s, a) 1 .

The final bound for an initial gap follows

Gt 1(s1) 32e2H2 H X

(s,a) S A d πt+1 h (s, a)βKL(δ, nt h (s, a)) nt h (s, a) 1

h =1 dπt,(h) h (s, a)βKL(δ, nt h (s, a)) nt h (s, a) 1 .

Since λ H, we have that H2 H3/λ. Using a convention d πt+1 h (s, a) = dπt+1,(0) h (s, a) we have

Gt 1(s1) 48e2H3

h =1 dπt,(h) h (s, a)βKL(δ, nt h (s, a)) nt h (s, a) 1 .

By changing the summation order and noticing that

dπmix,t h (s, a) = 1 H + 1

h=0 dπt,(h) h (s, a)

for H + 1 2H we get

Gt 1(s1) 96e2H4

h=1 dπmix,t h (s, a)βKL(δ, nt h(s, a)) nt h(s, a) 1 .

Step 2. Sum over t < ι. Assume ι > 0. In the case ι = 0 the bound is trivially true. Notice that for any t < ι we have Gt λ,1(s1) > ε,

thus, summing upper bounds on Gt λ,1(s1) over all t < ι we have

t=1 Gt λ,1(s1) 96e2H4

t=1 dπmix,t h (s, a)βKL(δ, nt h(s, a)) nt h(s, a) 1 .

Notice that βKL(δ, ) is monotone and maximizes at ι 1, and dπmix,t+1 h (s, a) = nt+1 h (s, a) nt h(s, a). Thus, applying Lemma 29, we have

ε(ι 1) < 384e2H5SA

λ βKL(δ, ι 1) log(ι).

Then by definition of βKL

ε(ι 1) 384e4H5SA

λ (log(2SAH/δ) + S log(eι)) log(ι).

Published as a conference paper at ICLR 2024

Step 3. Solving the recurrence. Define A = 384e2H5SA/(λε) and B = log(2SAH/δ) + S. Our goal is to upper bound solutions to the following inequality

ι 1 + A(S log(ι) + B) log(ι).

First, we obtain a loose solution by using inequality log(ι) ιβ/β that holds for any ι 1. Taking β = 1/3 we have ι 1 + 3A(3S ι1/3 + B) ι1/3.

Also we may assume that ι 2, thus 1 ι/2 and we achieve

ι2/3 6A(3Sι1/3 + B).

Solving this quadratic inequality in ι1/3, we have

(18AS)2 + 24AB

Define L = 3 log 54AS +

18AB . Then we can easily upper bound the initial inequality as follows ι 1 + A(B + SL)L.

Published as a conference paper at ICLR 2024

E BEST POLICY IDENTIFICATION IN REGULARIZED LINEAR MDPS

In this appendix, we first state some useful properties of regularized linear MDPs and describe the LSVI-UCB-Ent algorithm.

E.1 GENERAL PROPERTIES OF LINEAR MDPS

Let us start with a description of how to generalize the techniques of regularized MDPs to the setup of linear function approximation (Jin et al., 2020). Again, we consider the case of KL-regularized MDPs with respect to a reference policy eπ. In this setting the Qand V -values could be defined through regularized Bellman equations Qπ eπ,λ,h(s, a) = rh(s, a) + ph V π eπ,λ,h+1(s, a),

V π eπ,λ,h(s) = πh Qπ eπ,λ,h(s) λ KL(πh(s) eπh(s)).

Moreover, for optimal Qand V -functions we have Q eπ,λ,h(s, a) = rh(s, a) + ph V eπ,λ,h+1(s, a),

V eπ,λ,h(s) = max π A

πQ eπ,λ,h(s) λ KL(π eπh(s)) .

Note that the value of a policy could be arbitrarily negative, however, we know a priori that the optimal policy has non-negative value V eπ,λ,h [0, H], since the policy eπ itself has non-negative value.

In particular, under this assumption, we have the following simple proposition Proposition 2. For a linear MDP, for any policy π such that V π eπ,λ,h(s, a) 0 for any (s, a, h) S A [H] there exists weights {wπ h}h [H] such that for any (s, a, h) S A [H] we have Qπ eπ,λ,h(s, a) = ψ(s, a)Twπ h. Moreover, for any h [H] it holds wπ h 2 2H

Proof. By Bellman equations

Qπ eπ,λ,h(s, a) = rh(s, a) + ph V π eπ,λ,h(s, a) = ψ(s, a) Tθh + Z

S V π eπ,λ,h(s )

i=1 ψ(s, a)iµh,i(ds )

= ψ(s, a), θh + Z

S V π eπ,λ,h(s )µh(ds ) .

To show the second part, we use Definition 2. First, we note that θh

d and, at the same time Z

S V π eπ,λ,h(s )µh(ds ) H

since the value is bounded by H.

E.2 ALGORITHM DESCRIPTION

In this appendix, we describe the LSVI-UCB-Ent algorithm for regularized BPI in linear MDPs. LSVI-UCB-Ent is characterized by the following rules.

Sampling rule As for the UCBVI-Ent+ algorithm we start with regularized optimistic planning under the linear function approximation

Q t h(s, a) = ψh(s, a) Twt h + bt h(s, a) ,

V t h(s) = clip max π A

πQ t h(s) λ KL(π, eπh(s)) , 0, H ,

πt+1 h (s) = arg max π A

πQ t h(s) λ KL(π, eπh(s)) ,

Published as a conference paper at ICLR 2024

where bt is some bonus defined as follows

[ψ(s, a)]T[Λt h] 1ψ(s, a)

for B > 0 a bonus scaling factor, and the parameter wt h is obtained by least-square value iteration with Tikhonov regularization parameter α (Jin et al., 2020),

wt h = arg min w Rd

h rh(sk h, ak h) + V t h+1(sk h+1) ψ(sk h, ak h) Tw i2 + α w 2 2 .

We notice that there is a closed-form solution to this problem given by

wt h = Λt h 1 " t X

τ=1 ψτ h h rτ h + V t h+1(sτ h+1) i#

where ψτ h = ψ(sτ h, aτ h) and Λt h = Pt τ=1 ψτ h[ψτ h]T + αI.

Then we also define a family of exploratory policies by, for all h [H],

πt,(h )(a|s) =

( πt,(h )(a|s) = πt h(a|s) if h = h

πt,(h )(a|s) = 1 n a arg maxa A(Q t h(s, a ) Qt h(s, a )) o if h = h , (15)

where Qt is some lower bound on the optimal regularized Q-value defined as follows

wt h = Λt h 1 " t X

τ=1 ψτ h rτ h + V t h+1(sτ h+1) #

Qt h(s, a) = [ψ(s, a)] Twt h B q

[ψ(s, a)]T[Λt h] 1ψ(s, a),

V t h(s) = clip max π A

n πQt h(s) λ KL(π eπh(s)) o , 0, H ,

The sampling rule is then obtained by picking uniformly at random a policy among the exploratory policies, πt = πt,(h ) for h Unif[H]. Notice that it is equivalent to using a non-Markovian mixture policy πmix,t over all h [H]. Additionally, it would be valuable to mention that computation of this policy could be done on-flight since we can compute Q and Q.

Stopping and decision rule In the linear setting we use a simple deterministic stopping rule τ = T for a fixed parameter T. In the finite setting, we were able to define an adaptive stopping rule by leveraging a certain Bernstein-like inequality on the gaps (see Lemma 14). However, it is not clear how to adapt such inequality to the linear setting.

As decision rule LSVI-UCB-Ent returns the non-Markovian policy bπ, the uniform mixture over the optimistic policies { πt}t [T ] The complete procedure is described in Algorithm 4.

Algorithm 4 LSVI-UCB-Ent

1: Input: Number of episodes T, bonus function bt, Tikhonov regularization parameter α. 2: for t [T] do 3: Compute πt by regularized optimistic planning with (14). 4: Sample h Unif{1, . . . , H} and set πt = πt,(h )

5: for h [H] do 6: Play at h πt h(st h) 7: Observe st h+1 ph(st h, at h) 8: end for 9: end for 10: Output bπ the uniform mixture over {πt}t [T ].

Published as a conference paper at ICLR 2024

E.3 CONCENTRATION EVENTS

In this section, the required concentration events for a proof of sample complexity for LSVI-UCB-Ent will be described. First, we define several important objects. Let ψτ h = ψ(sτ h, aτ h) for any τ N, h [H] and define

Λt h = αId +

τ=1 ψτ h[ψτ h] T, Λ t h = αId +

τ=1 Eπmix,τ [ψ(sh, ah)[ψ(sh, ah)] T|s1],

where eπt is a uniform mixture policy of πt,(h ) defined in (15) over all h {0, . . . , H}.

Let βconc : (0, 1) N R+ R+ R+ and βcnt : (0, 1) N R+ be some functions defined later on in Lemma 16. We define the following favorable events for any fixed values of bonus scaling B > 0 and Ridge coefficient α 1 that will be specified later.

Econc(δ, B)

t N, h [H] :

V t h+1(sτ h+1) ph V t h+1(sτ h, aτ h) o [Λt h] 1 2d H p

βconc(δ, t, B)

τ=1 ψτ h V t h+1(sτ h+1) ph V t h+1(sτ h, aτ h) [Λt h] 1 2d H p

βconc(δ, t, B)

t N, h [H] : Λt h 1

2Λ t h βcnt(δ, t)Id

We also introduce an intersection of these events of interest, G(δ, B) Econc(δ, B) Ecnt(δ). We prove that for the right choice of the functions βconc, βcnt the above events hold with high probability. Lemma 16. Let B, α 1 be fixed. For any δ (0, 1) and for the following choices of functions β,

βconc(δ, t, B) 2 log H(1 + t2)

1 + 8d1/2t2 B

βcnt(δ, t) 4 log(8e H(2t + 1)/δ) + 4d log(3t) + 3,

for any fixed α 1 it holds that

P[Econc(δ, B)] 1 δ/2, P[Ecnt(δ)] 1 δ/2,

In particular, P[G(δ, B)] 1 δ.

Proof. Let us fix h [H]. Then for all t N by Lemma 17 we have wt h 2 2H p

dt/α and by a construction of Λt h we have λmin(Λt h) α. Therefore, combination of Lemmas 25 and 26 for any fixed ε > 0 we have with probability at least 1 δ/H

V t h+1(sτ h+1) ph V t h+1(sτ h, aτ h) o

[Λt h] 1 4H2 d

2 log H(t + α)

+ d log 1 + 8Hd1/2t1/2

+ d2 log 1 + 8d1/2B2

Next we take ε = Hd/t and obtain by using α 1 and d 1

V t h+1(sτ h+1) ph V t h+1(sτ h, aτ h) o

[Λt h] 1 4H2d2 2 log H(1 + t2)

1 + 8d1/2t2 B

Published as a conference paper at ICLR 2024

Taking the square root we conclude the first half of the statement, the second half is exactly the same, since Lemma 25 gives a bound uniformly over all value functions.

By Theorem 27 and union bound over h [H], P[Ecnt(δ)] 1 δ/2. The union bound over two prescribed events concludes P[G(δ, B)] 1 δ.

The proof of the following lemma remains exactly the same as in Jin et al. (2020). Lemma 17. [Lemma B.2 by Jin et al. (2020)] For any (t, h) N [H] the weights wt h generated by LSVI-UCB-Ent satisfies wt h 2 2H p

E.4 CONFIDENCE INTERVALS

In this section, we provide the confidence intervals on the optimal Q-function that is required for the proof of sample complexity of LSVI-UCB-Ent.

We start from the specification of the required values of α and B.

α 2(βcnt(δ, T) + 1), B = 32d H

log 24ed HT

where βcnt is defined in Lemma 16. Proposition 3. Let α and B satisfy (17). Then on the event G(δ) defined in Lemma 16 we have

ψ(s, a), wt h Q eπ,λ,h(s, a) = ph[V t h+1 V eπ,λ,h+1](s, a) + t h,

ψ(s, a), wt h Q eπ,λ,h(s, a) = ph[V t h+1 V eπ,λ,h+1](s, a) + t h,

where t h(s, a) and t h(s, a) satisfies

max{| t h(s, a)|, | t h(s, a)|} B q

ψ(s, a), [Λt h] 1ψ(s, a) .

Proof. We provide the proof only for the first equation since proof of one statement completely reassembles the other. By Proposition 2 and Bellman equations we have Q eπ,λ,h(s, a) = ψ(s, a), w h = rh(s, a) + ph V eπ,λ,h+1(s, a), therefore

wt h w h = Λt h 1 " t X

τ=1 ψτ h[rτ h + V t h+1(sτ h+1)]

τ=1 ψτ h ψ(sτ h, aτ h), w h αw h

= α Λt h 1w h | {z } ξ1

+ Λt h 1 t X

V t h+1(sτ h+1) ph V t h+1(sτ h, aτ h) i

+ Λt h 1 t X

τ=1 ψτ hph h

V t h+1 V eπ,λ,h+1 i (sτ h, aτ h)

Next, we analyze the last term ξ3. By Definition 2 we have

V t h+1 V eπ,λ,h+1 i (sτ h, aτ h) = ψτ h, Z

S [V t h+1 V eπ,λ,h+1](s )µh(ds ) ,

ξ3 = Λt h 1 t X

τ=1 ψτ h[ψτ h] T + αId αId

S [V t h+1 V eπ,λ,h+1](s )µh(ds )

S [V t h+1 V eπ,λ,h+1](s )µh(ds ) α Λt h 1 Z

S [V t h+1 V eπ,λ,h+1](s )µh(ds ) | {z } ξ4

Published as a conference paper at ICLR 2024

As a result, moving to Q-values directly we have

ψ(s, a), wt h Qπ h(s, a) = ψ(s, a), wt h w h

= ph[V t h+1 V eπ,λ,h+1](s, a) + ψ(s, a), ξ1 + ξ2 + ξ4 | {z }

Next, we compute an upper bound for t h(s, a). We start from the first term, where we apply Cauchy-Schwartz inequality and the second statement of Proposition 2

| ψ(s, a), ξ1 | = | ψ(s, a), αw h [Λt h] 1| α ψ(s, a) [Λt h] 1 w h [Λt h] 1

dα ψ(s, a) [Λt h] 1,

where we have used that [Λt h] 1 2 1/α. For the third term, we apply exactly the same construction and obtain the same upper bound. For the second term, we also apply Cauchy-Schwartz inequality and Lemma 16

| ψ(s, a), ξ1 | ψ(s, a) [Λt h] 1

V t h+1(sτ h+1) ph V t h+1(sτ h, aτ h) i [Λt h] 1

βconc(δ, t, B) ψ(s, a) [Λt h] 1.

Thus, we have

| t h(s, a)| h 2d H p

βconc(δ, T) + 4H p

2d(βcnt(δ, T) + 1) iq

[ψ(s, a)]T[Λt h] 1ψ(s, a).

The only part is to show that for our particular choice of B it holds

βconc(δ, T, B) + 4H p

2d(βcnt(δ, T) + 1) B.

First, we notice that

2d(βcnt(δ, T) + 1) 16Hd

Thus, it is enough to show that

βconc(δ, T, B) = 2 log H(1 + T 2)

1 + 8d1/2T 2 B

First, we notice that since T 1 and δ (0, 1) then

2 log H(1 + T 2)

+ 5 4 log 2THe2

and also, using the inequality log(1 + x) x for any x 0

1 + 8d1/2T 2 B

+ log 32 8 d1/2T 2

2 + 2 log(16 d T).

Thus, it is enough for B to satisfy the following inequalities

2 , 4 log 2THe2

2 , 2 log(16d T) B

It is clear that the choice B defined in (17) satisfies all required inequalities.

Corollary 5 (Confidence intervals validity). Let constant α and B defined in (17). Then on the event G(δ, B) we have Q t h(s, a) Q eπ,λ,h(s, a) Qt h(s, a) and V t h(s) V eπ,λ,h(s) V t h(s) for any t [T], h [H], (s, a) S A.

Published as a conference paper at ICLR 2024

Proof. Let us prove using backward induction over h [H]. For h = H + 1 this statement is trivially true. Let us assume that the statement holds for any h > h. Thus by Proposition 3

Q t h(s, a) Q eπ,λ,h(s, a) = t h(s, a) + B q

[ψ(s, a)]T[Λt h] 1ψ(s, a) + ph h

V t h+1 V h+1 i (s, a).

Notice that t h(s, a) + B p

[ψ(s, a)]T[Λt h] 1ψ(s, a) 0 and by induction hypothesis V t h+1 V h+1 0 for any s . Thus, we have proven the required statement for Q-values. To show it for V -values, we notice that if upper clipping in the definition V in (14) occurs, then the statement trivially holds. Otherwise, we have

V t h(s) V eπ,λ,h(s) Feπh(s),λ,h(Q t h(s)) Feπh(s),λ,h(Q eπ,λ,h(s)).

However, the function Feπh(s),λ,h is monotone since its gradients lies in a probability simplex. Thus,

V t h(s) V eπ,λ,h(s) 0. Using exactly the same reasoning we may show the lower confidence bound.

E.5 SAMPLE COMPLEXITY BOUNDS

In this section, we provide the sample complexity result of the LSVI-UCB-Ent algorithm. We start from a general result that does not depend on the properties of linear MDPs, however, depends on the properties of the algorithms.

Lemma 18. Let constants α, B be defined by (17). Then on event G(δ, B) defined in Lemma 16 for any t N, s S, h [H]

V eπ,λ,h(s) V πt+1 eπ,λ,h(s) 1

h=1 max a A

Proof. Let us proceed by induction. For h = H + 1 the statement is trivial. Assume that for any h > h the statement holds. Also assume that Gt h(s) < H, otherwise the inequality on the policy error holds trivially. In particular, it holds that nt h(s, a) > 0 for all a A.

We can start analysis from understanding the policy error by applying the smoothness of Feπh(s),λ,h.

V eπ,λ,h(s) V πt+1 eπ,λ,h(s) = Feπh(s),λ,h(Q eπ,λ,h(s, )) πt+1 h Q πt+1 eπ,λ,h(s, ) λ KL( πt+1 h (s) eπh(s))

Feπh(s),λ,h(Q t h(s, )) + Feπh(s),λ,h(Q t h(s, )), Q eπ,λ,h(s, ) Q t h(s, )

2λ Q t h Q eπ,λ,h 2 (s) πt+1 h Q πt+1 eπ,λ,h(s, ) λ KL( πt+1 h (s) eπh(s)) .

Next we recall that

πt+1 h (s) = F(Q t h(s, )), F(Q t h)(s) = πt+1 h Q t h(s) λ KL( πt+1 h (s) eπh(s)),

thus we have

F(Q t h)(s) πt+1 h Q πt+1 eπ,λ,h(s, ) λ KL( πt+1 h (s) eπh(s)) = πt+1 h [Q t h Q πt+1 eπ,λ,h](s)

and, by Bellman equations

V eπ,λ,h(s) V πt+1 eπ,λ,h(s) πt+1 h h Q eπ,λ,h Q πt+1 eπ,λ,h i (s) + 1

2λ Q t h Q eπ,λ,h 2 (s)

πt+1 h ph h V eπ,λ,h+1 V πt+1 eπ,λ,h+1 i (s) + 1

2λ Q t h Q eπ,λ,h 2 (s).

Next, we start by changing the norm and using the properties of Q and Q (see Corollary 5)

Q t h Q eπ,λ,h 2 (s) = max a A

Q t h(s, a) Q eπ,λ,h(s, a) 2 max a A

Q t h(s, a) Qt h(s, a) 2 .

Published as a conference paper at ICLR 2024

Theorem 6. Let ε > 0,δ (0, 1). Then LSVI-UCB-Ent algorithm with a choice of parameters described in (17) is (ε, δ)-PAC for the best policy identification in regularized MDPs after

T = e O H5d2

Proof. First, we notice that the definition of the output policy bπ as a mixture policy over all πt allows us to define the following

V eπ,λ,1(s1) V bπ eπ,λ,1(s1) = 1

i=1 {V eπ,λ,1(s1) V πt eπ,λ,1(s1)},

therefore it is enough to compute only the average regret of the presented procedure. In the sequel we assume the event G(δ, B) for B defined in (17).

Step 1. Study of sub-optimality gap Let us fix t [T], then by Lemma 18 we have

V eπ,λ,1(s1) V πt+1 1 (s1) 1

h=1 E πt+1 max a A

2 (sh, a) .

Next, we analyze each term separately, starting from the difference between Q-values inside. Let us fix h [H], then by Proposition 3

Q t h(s, a) Qt h(s, a) = [Q t h Q eπ,λ,h](s, a) [Qt h Q eπ,λ,h](s, a)

V t h+1 V t h+1 i (s, a) + 4B ψ(s, a) [Λt h] 1. (18)

Next, we have for any h [H] and any s S h

V t h V t h i (s ) πt+1 h h

Q t h Qt h i (s ),

therefore we can roll-out the equation (18) and obtain

Q t h(s, a) Qt h(s, a) 4BE πt+1

h =h ψ(sh , ah ) [Λt h ] 1 (sh, ah) = (s, a)

Next, we apply Lemma 35 for any fixed h [H]

E πt+1 max a A

2 (sh, a)|s1

= Eπt+1,(h)

2 (sh, ah)|s1

and for any h h

E πt+1 ψ(sh , ah ) [Λt h ] 1 (sh, ah) = (s, a) = Eπt+1,(h) ψ(sh , ah ) [Λt h ] 1 (sh, ah) = (s, a) .

Therefore, applying Jensen s inequality to conditional measure and the tower property of conditional expectation

E πt+1 max a A

2 (sh, a)|s1

16B2Eπt+1,(h)

h =h ψ(sh , ah ) [Λt h ] 1

h =h Eπt+1,(h) h ψ(sh , ah ) 2 [Λt h ] 1|s1 i

h =1 Eπt+1,(h) h ψ(sh , ah ) 2 [Λt h ] 1|s1 i .

Published as a conference paper at ICLR 2024

Summing over all h and recalling eπt+1 as a mixture policy of all πt+1,(h)

E πt+1 max a A

2 (sh, a)|s1

h=1 Eπt+1,(h) h ψ(sh , ah ) 2 [Λt h ] 1|s1 i

= 16B2H2 H X

h =1 Eπmix,t+1 h ψ(sh , ah ) 2 [Λt h ] 1|s1 i .

Step 2. Summing sub-optimality gaps Next, we sum all sub-optimality gaps and obtain

t=1 {V eπ,λ,1(s1) V πt eπ,λ,1(s1)} H +

t=1 {V eπ,λ,1(s1) V πt+1 1 (s1)}

t=1 Eπmix,t+1 h ψ(sh , ah ) 2 [Λt h ] 1|s1 i .

Next, we apply the definition of event Ecnt(δ)

2Λ t h βcnt(δ, T)Id [Λt h] 1 2 h

Λ t h 2βcnt(δ, T)Id i 1 .

Notice that by the choice of α = 2(βcnt(δ, t) + 1) we have

eΛt h Λ t h 2βcnt(δ, T)Id = 2Id +

τ=1 Eπmix,τ [ψ(sh, ah)[ψ(sh, ah)] T].

Thus, we may apply Lemma 31

t=1 Eeπt+1 h ψ(sh , ah ) 2 [Λt h ] 1|s1 i 2

t=1 Eπmix,t+1 ψ(sh , ah ) 2 [eΛt h ] 1|s1

4 log det eΛT h .

To upper bound the determinant, we upper bound the operator norm using the triangle inequality

τ=1 Eeπτ [ψ(sh, ah)[ψ(sh, ah)] T]

2 2 + (T 1),

therefore, combining with a definition of B given in (17) we have

t=1 {V eπ,λ,1(s1) V πt eπ,λ,1(s1)} H + 64 322d2H5 log(T + 1)

λ log 24ed HT

V eπ,λ,1(s1) V bπ eπ,λ,1(s1) H

T + 64 322d2H5 log(T + 1)

λT log 24ed HT

In particular, this implies that

T = e O H5d2

is enough to obtain ε-accurate policy.

Published as a conference paper at ICLR 2024

F DEMONSTRATION-REGULARIZED PREFERENCE-BASED LEARNING

F.1 MAXIMUM LIKELIHOOD ESTIMATION FOR REWARD MODEL

In this section, we discuss the maximum likelihood estimation problem for the reward estimation, following Zhan et al. (2023a). Let G be a function class of reward functions that satisfies the following assumption Assumption 6. For a function class G we assume that the true reward belongs to it: r G. Additionally, for the following function family Q = {qr(τ0, τ1) = σ(r(τ1) r(τ0)) : r G} equipped with an ℓ -norm has a finite bracketing dimension, that means there is a dr > 0 and Rr > 0 such that ε (0, 1) : log N[](ε, Q, ) d G log(RG/ε).

The bracketing numbers are commonly used in statistics for MLE, M-estimation, and, more generally, in the empirical processes theory, see van de Geer (2000). Related to our setting, this assumption is satisfied in the setting of tabular MDPs with a dimension d G = SAH and linear MDPs with dimensions d G = d H, see Lemmas 19-20.

For each r G and a pair of trajectories (τ0, τ1) define the induced preference model as follows

qr(τ0, τ1) σ(r(τ1) r(τ0)).

To measure the complexity of the reward class G we will use the bracketing numbers of the function class Q = {qr : T T [0, 1] : r Fr}, where T = (S A)H is a space of all trajectories. See Definition 7 for the definition of bracketing numbers.

Given the dataset of preferences DRM = {(τ k 0 , τ k 1 , ok)}NRM k=1 , we define the maximum likelihood estimate of the reward model as follows

ˆr = arg max r G

k=1 ok log qr(τ k 0 , τ k 1 ) + (1 ok) log(1 qr(τ k 0 , τ k 1 ))

The following result is a standard result on MLE estimation and, generally, M-estimation, see van de Geer (2000). The proof heavily uses PAC-Bayes techniques by Zhang (2006), see also Agarwal et al. (2020) for non-i.i.d. extension. We notice that a similar result could be extracted from Lemma 2 by Zhan et al. (2023a), however, we did not find the proof of exactly this statement in their paper or references within. Proposition 4. Let Assumptions 4-6 hold. Let DRM be a preference dataset and assume that trajectories τ k 0 and τ k 1 were generated i.i.d. by following the policy π. Then for any δ (0, 1) with probability at least 1 δ the following bound for the MLE reward estimate ˆr given by solution to 19 holds

Eτ0,τ1 qπ h q (τ0, τ1) qˆr(τ0, τ1) 2i 2 + 2d G log(RG/N RM) + log(1/δ)

where qπ is a distribution over trajectories induced by policy π

Proof. In the sequel, we drop the superscript from DRM and N RM to simplify the notation.

Let us consider a maximal set of ε-brackets B of size N[](ε, Q, ) and apply Lemma 2.1 by Zhang (2006) (or Lemma 21 by Agarwal et al. (2020)), where consider brackets [ℓ, u] from B as parameters θ, prior π is a uniform over B, and the density w D([ℓ, u]) is equal to a (properly weighted) Dirac measure on a bracket [ℓ , u ] that contains the MLE estimate (ties are resolved arbitrary). It implies that for any function L: B D R it holds

ED h exp n L(D, [ℓ , u ]) log ED h e L(D ,[ℓ ,u ])i log N[](ε, Q, ) oi 1 ,

where D is an independent copy of the dataset D and the KL-divergence between a Dirac measure on an MLE bracket and the uniform distribution is computed exactly. Since we can control the exponential moment, by a simple Chernoff argument we have with probability at last 1 δ

log ED h e L(D ,[l ,u ])i L(D, [l , u ]) + log N[](ε, Q, ) + log(1/δ).

Published as a conference paper at ICLR 2024

Next we choose L(D, [ℓ, u]) as log-likelihood ratio

L(D, [ℓ, u]) = 1

ok log q (τ k 0 , τ k 1 ) + (1 ok) log(1 q (τ k 0 , τ k 1 )) ok log u(τ k 0 , τ k 1 ) + (1 ok) log(1 ℓ(τ k 0 , τ k 1 ))

By the choice of a brackets [ℓ, u] it holds ℓ (τ0, τ1) qˆr(τ0, τ1) u (τ0, τ1). Using the fact that ˆr is a solution to (19)

L(D, [ℓ , u ]) = 1

ok log q (τ k 0 , τ k 1 ) + (1 ok) log(1 q (τ k 0 , τ k 1 )) ok log u(τ k 0 , τ k 1 ) + (1 ok) log(1 ℓ(τ k 0 , τ k 1 ))

ok log q (τ k 0 , τ k 1 ) + (1 ok) log(1 q (τ k 0 , τ k 1 )) ok log qˆr(τ k 0 , τ k 1 ) + (1 ok) log(1 qˆr(τ k 0 , τ k 1 ))

At the same time

log ED h e L(D ,[l ,u ])i = N log E exp 1

2 o log q (τ0, τ1) + (1 o) log(1 q (τ0, τ1)) o log u (τ0, τ1) + (1 o) log(1 ℓ (τ0, τ1))

where in the last expectation τ0, τ1 qπ, o Ber(q (τ0, τ1)).

By Fubini s theorem, we have

N log ED h e L(D ,[l ,u ])i = log Eτ0,τ1

q (τ0, τ1)u(τ0, τ1)

(1 q (τ0, τ1))(1 ℓ(τ0, τ1)) .

Next, we study the expression under the square root. By the definition of the ε-bracket we have ℓ (τ0, τ1) qˆr(τ0, τ1) u (τ0, τ1) and u (τ0, τ1) ℓ (τ0, τ1) ε, therefore p

q (τ0, τ1)u(τ0, τ1) p

q (τ0, τ1)(qˆr(τ0, τ1) + ε) p

q (τ0, τ1)qˆr(τ0, τ1) + ε.

and the similar bound for the second term. Applying inequality log(x) 1 x we have

N log ED h e L(D ,[l ,u ])i 1 Eτ0,τ1

q (τ0, τ1)qˆr(τ0, τ1)

(1 q (τ0, τ1))(1 qˆr(τ0, τ1)) 2 ε.

By the properties of the Hellinger distance d H (see Section 2.4 and Lemma 2.3 by Tsybakov 2008) we have

q (τ0, τ1)qˆr(τ0, τ1) + p

(1 q (τ0, τ1))(1 qˆr(τ0, τ1))

2Eτ0,τ1 d2 H(Ber(q (τ0, τ1)), Ber(qˆr(τ0, τ1)) Eτ0,τ1 h (q (τ0, τ1) qˆr(τ0, τ1))2i .

Overall, we obtain

Eτ0,τ1 h (q (τ0, τ1) qˆr(τ0, τ1))2i 2 ε + log N[](ε, Q, ) + log(1/δ)

Taking ε = 1/N 2 and applying the upper bound on the bracketing number by Assumption 6 we conclude the statement.

And also we have a simple corollary of this result that shows convergence of the reward models. Theorem 7. Let Assumptions 4-6 hold. Let DRM be a preference dataset and assume that trajectories τ k 0 and τ k 1 were generated i.i.d. by following the policy π. Then for any δ (0, 1) with probability at least 1 δ the following bound holds

Eτ0,τ1 qπ h r (τ1) r (τ0) ˆr(τ1) + ˆr(τ1) 2i 2ζ2d G log(RG/N RM) + ζ2 log(e2/δ)

where ζ = 1/(infx [ H,H] σ (x)) the non-linearity measure of the link function σ defined in Assumption 4.

Published as a conference paper at ICLR 2024

Remark 7. We additionally notice that since two trajectories are i.i.d., we have

Eτ0,τ1 qπ h [r (τ1) r (τ0)] [ˆr(τ1) ˆr(τ0)] 2i = 2Varqπ[r (τ1) ˆr(τ1)].

In particular, it means that if the reward function is estimated up to a constant shift, then the MLE estimation error is zero.

Remark 8. In the setting of a sigmoid link function σ(x) = 1/(1 + exp( x)) we have ζ = exp{Θ(H)}, yields the exponential dependence on the reward scaling. This could be avoided by using a more problem-dependent way to obtain a bound on rewards given bound on the induced preference model.

Proof. Follows from the application of Proposition 4 and the following application of mean-value theorem for any two fixed τ0, τ1

[r (τ1) r (τ0)] [ˆr(τ1) ˆr(τ0)] = σ 1(q (τ0, τ1)) σ 1(qˆr(τ0, τ1))

= (σ 1) (ξ)[q (τ0, τ1) qˆr(τ0, τ1)],

where ξ is a point between q (τ0, τ1) and qˆr(τ0, τ1). The observation (σ 1) (ξ) = 1/σ (σ 1(ξ)) yields the statement.

Next, we compute the required quantities d G for the case of finite and linear MDPs for a choice of σ = 1/(1 + exp( x) as a sigmoid function.

Lemma 19. Let a reward function {rh(s, a)}h [H] be an arbitrary function rh : S A [0, 1]. Let us define G = {r(τ) = PH h=1 rh(sh, ah)}. Then Assumption 6 holds with constants d G = HSA and RG = 3H/2.

Proof. Let us define a functional class of interest Q = {qr(τ1, τ2) = σ(r(τ1) r(τ2)) | r G} Since σ is a monotonically increasing function that satisfies σ (x) 1/4. Thus, by combination of Lemma 21 and Lemma 22 we have

N[](ε, Q, ) N[](2ε, G, ).

Next we define a function classes Fh = {rh : S A [0, 1]} of one-step rewards. By Lemma 23 it holds

N[](2ε, G, )

h=1 N[](2ε/H, Fh, ).

Then we can associate a function space Fh with a parameters Θh = [0, 1]SA and by Lemma 24 and a standard results in bounding of covering numbers of balls in normed spaces, see van Handel (2016), N[](ε, Fh, ) N(ε/2, [0, 1]SA, ) (3/ε)SA.

As a result, we have log N[](ε, Q, ) HSA log(3H/(2ε)).

Lemma 20. Let a reward function {rh(s, a)}h [H] be parametrized as rh(s, a) = ψ(s, a)Tθh for ψ: S A Rd that satisfies ψ(s, a) 2 1, and θh Θh for Θh = {θh Rd | θh

d}. Let us define G = {r(τ) = PH h=1 rh(sh, ah)}. Then Assumption 6 holds with constants d G = d H and RG = 3H

Proof. Let us define one-step rewards as follows Fh = {rh(s, a) = ψ(s, a)Tθh | θh Rd, θh

d}. Then following exactly the same reasoning as in Lemma 19

N[](ε, Q, )

h=1 N[](2ε/H, Fh, ).

Published as a conference paper at ICLR 2024

Next we notice that Fh is Lipschtiz in ℓ2-norm with respect to parameters θh with a constant 1:

|rh(s, a) r h(s, a)| = |ψ(s, a) T(θh θ h)| θh θ h 2.

Therefore, since Θh is a ball of radius

d we obtain

N[](ε, Fh, ) N(ε/2, Θh, 2) (3

As a result, we have log N[](ε, Q, ) d H log(3H

F.2 PROPERTIES OF BRACKETING NUMBERS

In this section, we provide a list of elementary properties of bracketing numbers for completeness. See Section 7 of Dudley (2014) for additional information.

Lemma 21. Let (G, ) be a normed space of functions over a set X that takes values in the interval I and let σ: I [0, 1] be a monotonically increasing link function that satisfies supx I σ (x) C for C > 0. Then for F = σ G = {f = σ g | g G} it holds for any ε > 0 N[](ε, F, ) N[](ε/C, G, ).

Proof. Let G be a minimal set of ε brackets that covers G. Then let us take a set of brackets F = {[σ ℓ, σ u] : [l, u] G}. The set of F consists of brackets since σ is monotone, and it covers all the space F. Additionally, we have

|σ u(x) σ ℓ(x)| = σ (ξ)|(u(x) ℓ(x))| Cε,

therefore the set F consists of Cε-brackets. By rescaling we conclude the statement.

Lemma 22. Let (G, ) be a normed space of functions over a set X and let us define a set of functions over X X as F = {f(x, x ) = g(x) g(x ) | g G}. Then it holds

N[](ε, F, ) N[](ε/2, G, ).

Proof. Let G be a minimal set of ε brackets that covers G. Let us define the following set of brackets

F = {[fℓ(x, x ) = ℓ(x) u(x ), fu(x, x ) = u(x) ℓ(x )] | [ℓ, u] G}.

We may check that elements cover all the set F. Let us take f F and the corresponding g G. Then let us take a bracket [ℓ, u] that covers g. In this case, we have

fℓ(x, x ) = ℓ(x) u(x ) g(x) g(x ) u(x) ℓ(x ) = fu(x, x ).

At the same time, we have fℓ fu 2 u ℓ 2ε.

Thus, F is a set of 2ε-brackets that covers F.

Lemma 23. Let {Gk}K k=1 be a sequence of spaces of functions over a set X equipped with a norm and let us define a set F of functions over X K as follows

f(x1, . . . , xk) =

k=1 gk(xk) | gk G

Then the following bound holds

N[](ε, F, )

k=1 N[](ε/K, G, ).

Published as a conference paper at ICLR 2024

Proof. For any k [K] let Gk be a minimal set of ε brackets that covers Gk. Then we construct the set F as follows

k=1 ℓk(xk), fu(x) =

| k [K] : [ℓk, uk] Gk

It holds that |F| QK k=1 |Gk| and also it is clear that F consists of brackets and covers all the set F. Additionally, we notice that

k=1 ℓk uk Kε.

By rescaling we conclude the statement.

Lemma 24. Let Θ be a set of parameters and let F = {fθ(x) | θ Θ} be a set of functions over X. Assume that fθ(x) is L-Lipschitz in θ with respect to an arbitrary norm :

x X : |fθ(x) fθ (x)| L θ θ .

Then we have N[](ε, F, ) N(ε/(2L), Θ, ).

Proof. Let X be a ε-covering of Θ and let us define a set F = {[ℓ= fθ εL, u = fθ+εL] | θ X}. Let us show that F consists of brackets. Let fθ F, then there is ˆθ X. By Lipchitzness

x X : |fθ(x) fˆθ(x)| L θ ˆθ = Lε,

therefore a bracket that corresponding ℓand u indeed satisfy ℓ(x) fθ(x) u(x) for any x X. Also, we notice that ℓ u = 2εL, therefore by rescaling we conclude the statement.

F.3 PROOF FOR DEMONSTRATION-REGULARIZED RLHF

During this section we assume that the MDP is finite, i.e. |S| < + to simplify the manipulations with the trajectory space. However, the state space could be arbitrarily large.

In this section, we provide the proof for the demonstration-regularized RLHF pipeline defined in Algorithm 2. We start from the general oracle version of this inequality. For a reward function r let the value with respect to this value be defined as follows

V π h (s; r) = Eπ

h =h rh (sh , ah ) | sh = s

, V π eπ,λ,h(s; r) = V π h (s; r) λ KLtraj(eπ π).

A similar definition holds for Q-values. Additionally, let us define the following coefficient defined by Zhan et al. (2023a) that additionally shows the closeness of πE and πBC in terms of the family of reward functions.

Cr(G, πE, πBC) max

Eτ0 πE,τ1 πBC[r (τ0) r (τ1) (r(τ0) r(τ1))] r

Eτ0,τ1 πBC h (r (τ0) r (τ1) (r(τ0) r(τ1)))2i

We notice that the denominator could be written as

Eτ 0,τ 1 πBC h (r (τ0) r (τ1) (r(τ0) r(τ1)))2i = 2VarqπBC [r ˆr].

Theorem 8. Let us assume that there is an underlying reward function r such that

1. There is an expert policy πE such that

V 1 (s1; r ) V πE 1 (s1; r ) εE;

Published as a conference paper at ICLR 2024

2. There is a behavior cloning policy πBC that satisfies q

KLtraj(πE πBC) εKL;

3. There is an estimate of reward function ˆr that satisfies q

VarqπBC [r (τ) ˆr(τ)] εRM;

Let πRL be a εRL-optimal policy in the λ-regularized MDP for πBC and using ˆr as rewards:

V πBC,λ,1(s1; ˆr) V πRL πBC,λ,1(s1; ˆr) εRL.

Then we have the following optimality guarantees for πRL in the unregularized MDP equipped with the true reward function

V 1 (s1; r ) V πRL 1 (s1; r ) ε2 RM

λ + 2εE + 2εRL + 2λε2 KL + 3εRMεKL + 4H

Moreover, if we assume that Cr(G, πE, πBC) is finite, we have

V 1 (s1; r ) V πRL 1 (s1; r ) ε2 RM

λ + 2εE + 2εRL + 2λε2 KL + 3Cr(G, πE, πBC) εRM.

Additionally, under the choice λ as a positive solution to the equation 2ε2 KL(λ )2 = 2λεRL + ε2 RM we have the following bound

V 1 (s1; r ) V πRL 1 (s1; r ) 2εE + 4εRL + 6εKLεRM + min 2Hε2 KL, Cr εRM ,

where Cr = Cr(G, πE, πBC) is a concentrability coefficient.

Proof. We start from the following decomposition, using the assumption on the expert policy

V 1 (s1; r ) V πRL 1 (s1; r ) V πE 1 (s1; r ) V πRL 1 (s1; r ) + εE.

Next, we change the reward function as follows

V πE 1 (s1; r ) V πRL 1 (s1; r ) = V πE 1 (s1; ˆr) V πRL 1 (s1; ˆr) | {z } (A)

+ V πE 1 (s1; r ˆr) V πRL 1 (s1; r ˆr) | {z } (B)

We start from the analysis of the term (A). By properties of the behavior cloning policy and the BPI policy πRL

(A) = V πE πBC,λ,1(s1; ˆr) + λ KLtraj(πE πBC) V πRL πBC (s1; ˆr) λ KLtraj(πRL πBC)

V πBC,λ,1(s1; ˆr) V πRL πBC (s1; ˆr) + λε2 KL εRL + λε2 KL.

Next, we have to analyze the second term (B). We decompose it as follows

(B) = V πE 1 (s1; r ˆr) V πBC 1 (s1; r ˆr) | {z } (C)

+ V πBC 1 (s1; r ˆr) V πRL 1 (s1; r ˆr) | {z } (D)

We start from the analysis of (D) because the analysis of term (C) depends on the concentrability assumption.

For the term (D) we can apply the second part of Lemma 32 since the space of the trajectories is finite

(D) = EqπBC [r (τ) ˆr(τ)] EqπRL[r (τ) ˆr(τ)]

2VarqπBC [r (τ) ˆr(τ)] KLtraj(πRL πBC) q

ε2 RM KLtraj(πRL πBC),

Published as a conference paper at ICLR 2024

where we applied an assumption on the reward estimate. Next, we have to estimate the trajectory KL-divergence between πRL and πBC.

Let us apply the ε-optimality of the policy πRL with respect to reward ˆr in the regularized MDP

V πE 1 (s1; ˆr) λ KLtraj(πE πBC) V πRL 1 (s1; ˆr) + εRL λ KLtraj(πRL πBC). By rerranging the terms we have

KLtraj(πRL πBC) 1

V πRL 1 (s1; ˆr) V πE 1 (s1; ˆr) + εRL + ε2 KL

V πRL 1 (s1; ˆr r ) V πE 1 (s1; ˆr r ) + ε2 KL

V πE 1 (s1; r ˆr) V πRL 1 (s1; r ˆr) + ε2 KL + εE + εRL

We notice that (B) V πE 1 (s1; r ˆr) V πRL 1 (s1; r ˆr), then we have the following recursion

λ ((B) + (εE + εRL + λε2 KL)).

Let us denote t2 = (B)+εE +εRL +λε2 KL , a = p

ε2 RM/λ, b = (C)+εE +εRL +λε2 KL. Then we have the standard quadratic inequality in t, the maximal solution of which could be upper bounded as t2 a2 + 2b which implies

λ + εE + εRL + λε2 KL + 2(C).

Next, we analyze the term (C).

Without concentrability assumption By the first part of Lemma 32 we have (C) = EqπE[r (τ) ˆr(τ)] EqπBC[r (τ) ˆr(τ)]

2VarqπBC KLtraj(πE πBC) + 2H

3 KLtraj πE πBC

2 εRMεKL + 2H

Overall, we have the final rates

V 1 (s1; r ) V πRL 1 (s1; r ) ε2 RM

λ + 2εE + 2εRL + 2λε2 KL + 3εRMεKL + 4H

With concentrability assumption Now we assume that Cr(G, πE, πBC) is finite. In this situation, we can upper bound (C) as follows (C) = Eτ0 qπE,τ1 qπBC [r (τ0) r (τ1) + ˆr(τ1) ˆr(τ0)]

2C2r(G, πE, πBC) VarqπBC [r ˆr]

2 Cr(G, πE, πBC) εRM.

Overall, we have the final rates

V 1 (s1; r ) V πRL 1 (s1; r ) ε2 RM

λ + 2εE + 2εRL + 2λε2 KL + 3Cr(G, πE, πBC) εRM.

Under the choice of λ To show the last part of the statement, we choose λ as a solution to the following quadratic equation 2(λ )2ε2 KL = 2λ εRL + ε2 RM. In particular, its positive solution is equal to

λ = εRL + p

ε2 RL + 2ε2 KLε2 RM 2ε2 KL .

Applying firstly the quadratic formula and then the exact formula above we have ε2 RM λ + 2εRL + 2λ ε2 KL = 4λ ε2 KL = 2εRL + 2 q

ε2 RL + 2ε2 KLε2 RM 4εRL + 3εKLεRM.

Combining this bound with the bound on performance we have

V 1 (s1; r ) V πRL 1 (s1; r ) 2εE + 4εRL + 6εKLεRM + min 2Hε2 KL, Cr εRM ,

where Cr = Cr(G, πE, πBC) is a concentrability coefficient.

Published as a conference paper at ICLR 2024

G DEVIATION INEQUALITIES

G.1 DEVIATION INEQUALITY FOR CATEGORICAL DISTRIBUTIONS

Next, we state the deviation inequality for categorical distributions by Jonsson et al. (2020, Proposition 1). Let (Xt)t N be i.i.d. samples from a distribution supported on {1, . . . , m}, of probabilities given by p m 1, where m 1 is the probability simplex of dimension m 1. We denote by bpn the empirical vector of probabilities, i.e., for all k {1, . . . , m},

ℓ=1 1{Xℓ= k}.

Note that an element p m 1 can be seen as an element of Rm 1 since pm = 1 Pm 1 k=1 pk. This will be clear from the context. Theorem 9. For all p m 1 and for all δ [0, 1],

P( n N , n KL(bpn, p) > log(1/δ) + (m 1) log(e(1 + n/(m 1)))) δ.

G.2 DEVIATION INEQUALITY FOR SEQUENCE OF BERNOULLI RANDOM VARIABLES

Below, we state the deviation inequality for Bernoulli distributions by Dann et al. (2017, Lemma F.4). Let Ft for t N be a filtration and (Xt)t N be a sequence of Bernoulli random variables with P(Xt = 1|Ft 1) = Pt with Pt being Ft 1-measurable and Xt being Ft-measurable. Theorem 10. For all δ > 0,

t=1 Pt/2 log 1

G.3 DEVIATION INEQUALITY FOR BOUNDED DISTRIBUTIONS

Below, we state the self-normalized Bernstein-type inequality by Domingues et al. (2021b). Let (Yt)t N , (wt)t N be two sequences of random variables adapted to a filtration (Ft)t N. We assume that the weights are in the unit interval wt [0, 1] and predictable, i.e. Ft 1 measurable. We also assume that the random variables Yt are bounded |Yt| b and centered E[Yt|Ft 1 ] = 0. Consider the following quantities

s=1 ws Ys, Vt

s=1 w2 s E Y 2 s |Fs 1 , and Wt

and let h(x) (x + 1) log(x + 1) x be the Cram er transform of a Poisson distribution of parameter 1. Theorem 11 (Bernstein-type concentration inequality). For all δ > 0,

P t 1, (Vt/b2 + 1)h b|St|

log(1/δ) + log(4e(2t + 1)) δ.

The previous inequality can be weakened to obtain a more explicit bound: if b 1 with probability at least 1 δ, for all t 1,

2Vt log(4e(2t + 1)/δ) + 3b log(4e(2t + 1)/δ) .

G.4 DEVIATION INEQUALITY FOR VECTOR-VALUED SELF-NORMALIZED PROCESSES

Next, we state Lemma D.4 by Jin et al. (2020). For any symmetric positive definite matrix A we define x A =

x TAx. Lemma 25 (Jin et al. (2020)). Let {sτ} τ=1 be a stochastic process on the state space S adapted to a filtration {Fτ} τ=0. Let {Xτ} τ=0 be an Rd-valued stochastic process where ψτ is F-predictable (Xτ is Fτ 1 measurable) and Xτ 1. Let Λt = αId + Pt τ=1 Xt X T t and let V be a family of

Published as a conference paper at ICLR 2024

function over the state-space S such that V V, s S : 0 V (s) H. Then for any δ (0, 1) with probability at least 1 δ

τ=1 Xτ{V (sτ) Eτ 1[V (sτ)]}

2 log t + α

where Nε is a minimal ε-cover of V with respect to the distance ρ(V, V ) = sups S |V (s) V (s)|.

Also, we state the result on the covering dimension of the class of bonus function, see Lemma D.6 by Jin et al. (2020) for a similar result.

Lemma 26. Let V be a class of functions over S such that

s S : V (s) = clip max π A

n π h w Tψ(s, ) + β p

ψ(s, )TΛ 1ψ(s, ) i λΦh,s(π) o , 0, H

is parameterized by a tuple (w, β, Λ) such that w L, β [0, B], λmin(Λ) α. Assume ψ(s, a) 1 for any (s, a) S A. Then the covering number |Nε| of the function space V with respect to the distance ρ(V, V ) = sups S |V (s) V (s)| satisfies

log N(ε, V, ρ) d log 1 + 4L

+ d2 log 1 + 8d1/2B2

Proof. Following the approach of Jin et al. (2020), we reparametrize the following set by setting A = β2Λ 1 and obtain

V (s) = clip max π A

n π h w Tψ(s, ) + p

ψ(s, )TAψ(s, ) i λΦh,s(π) o , 0, H ,

for w 2 L, A 2 B2α 1. Next, let V1, V2 V be two functions that corresponds to parameters (w1, A1) and (w2, A2). Then, using non-expanding property of clip( , 0, H) and maxπ A{ } we have

ρ(V1, V2) sup s S max π A π (w T 1ψ(s, ) + p

ψ(s, )TA1ψ(s, )) (w T 2ψ(s, ) + p

ψ(s, )TA2ψ(s, ))

sup s,a S A

[w1 w2] Tψ(s, a) + p

ψ(s, a)TA1ψ(s, a)) p

ψ(s, a)TA2ψ(s, a)

sup ψ: ψ 1 | w1 w2, ψ | + sup ψ: ψ 1

|ψT(A1 A2)ψ|

w1 w2 2 + p

The rest of the proof follows Lemma D.6 by Jin et al. (2020) and uses the result on covering numbers of Euclidean balls in Rd.

G.5 DEVIATION INEQUALITY FOR SAMPLE COVARIANCE MATRICES

The following result generalizes Theorem 10 in the case of linear MDPs and generalized counters.

Let {Xt} t=1 be a sequence of random vectors of dimension d adapted to a filtration {Ft} t=1 such that Xt 2 1 a.s.. Define a sequence of positive semi-definite matrices At = E[Xt X T t |Ft 1]. Notice that At 2 = σmax(At) 1. Also define Λt = λId + Pt j=1 Xj X T j and Λt = λId + Pt j=1 Aj.

Lemma 27. Let δ (0, 1). Then the following event

Ecnt (δ) = { t 1 : Λt 1

2Λt β(δ, t)Id}

under the choice β(δ, t) = 4 log(4e(2t+1)/δ)+4d log(3t)+3 holds with probability at least 1 δ.

Published as a conference paper at ICLR 2024

Proof. Let us fix a vector v from a unit sphere v 2 = 1. Next, note that

v TΛtv v TΛtv =

j=1 v, Xj 2 E v, Xj 2|Fj 1

Notice that Mj is a martingale-difference sequence that satisfied | Mj| 2 and

j=1 E M 2 j |Fj 1 =

j=1 E M 2 j |Fj 1

j=1 E v, Xj 4|Fj 1 v TΛtv.

Therefore, applying self-normalized Bernstein inequality (Theorem 11) we have that with probability at least 1 δ for all t 1

|v TΛtv v TΛtv| q

2v TΛtv log(4e(2t + 1)/δ) + 3 log(4e(2t + 1)/δ).

Next, inequality 2ab a2 + b2 implies that

|v TΛtv v TΛtv| 1

2v TΛtv + 4 log(4e(2t + 1)/δ).

Let us denote by Nε a ε-net over a unit sphere of dimension d. By union bound over this net we have with probability at least 1 δ

ˆv Nε : |ˆv TΛtˆv v TΛtˆv| 1

2 ˆv TΛtˆv + 4 log(4e(2t + 1)/δ) + 4 log(|Nε|).

Let v be an arbitrary vector on a unit sphere and let ˆv Nε be the closest vector to v in the ε-net. Then for any matrix A Rd d we have

|v TAv ˆv TAˆv| |v TA(v ˆv)| + |(v ˆv) TAˆv| 2ε A 2.

Next we notice that Λt Λt 2 2t and Λt t. Then we have

v Sd : |v TΛtv vΛtv| 1

2v TΛtv + 4 log(4e(2t + 1)/δ) + 4 log(|Nε|) + 3tε.

Finally, we have the upper bound on covering number |Nε| (3/ε)d and, taking ε = 1/t for all fixed t 1 we have

v Sd : |v TΛtv vΛtv| 1

2v TΛtv + 4 log(4e(2t + 1)/δ) + 4d log(3t) + 3.

Thus, with probability at least 1 δ we have for all t 1

3 2Λt + β(δ, t)Id Λt 1

2Λt β(δ, t)Id.

where β(δ, t) = 4 log(4e(2t + 1)/δ) + 4d log(3t) + 3.

Published as a conference paper at ICLR 2024

H TECHNICAL LEMMAS

H.1 COUNTS TO PSEUDO-COUNTS

Here we state Lemma 8 and Lemma 9 by M enard et al. (2021).

Lemma 28. On event Ecnt, for any β(δ, ) such that x 7 β(δ, x)/x is non-increasing for x 1, x 7 β(δ, x) is non-decreasing h [H], (s, a) S A,

t N , β(δ, nt h(s, a)) nt h(s, a) 1 4β(δ, nt h(s, a)) nt h(s, a) 1

Lemma 29. For T N and (ut)t N , for a sequence where ut [0, 1] and Ut Pt l=1 uℓ, we get

ut+1 Ut 1 4 log(UT +1 + 1).

H.2 COUNTS TO PSEUDO-COUNTS IN LINEAR MDPS

Let {Xt} t=1 be a sequence of random vectors of dimension d adapted to a filtration {Ft} t=1 such that Xt 2 1 a.s.. Define a sequence of positive semi-definite matrices At = E[Xt X T t |Ft 1]. Notice that At 2 = σmax(At) 1. Also define Λt = λId + Pt j=1 Xj X T j and Λt = λId + Pt j=1 Aj.

Lemma 30. Let A 0 be a positive semi-definite matrix such that A 2 1. Then

log det(I + A) Tr(A) 2 log det(I + A).

Proof. Follows from eigendecomposition for A and numeric inequality log(1+x) x 2 log(1+ x) for all x [0, 1].

The next result generalized Lemma D.2 by Jin et al. (2020); see also (Abbasi-Yadkori et al., 2011). Also, it could be treated as a generalization of Lemma 29 for linear MDPs.

Lemma 31. Let At 2 1 for all t 1. Then for any T 1

log det(ΛT )

t=1 E X T t [Λt 1] 1Xt|Ft 1 2 log det(ΛT )

Proof. First, we notice that Λt 1 is Ft 1-measurable, thus

E X T t [Λt 1] 1Xt|Ft 1 = E Tr [Λt 1] 1Xt X T t |Ft 1 = Tr [Λt 1] 1E[Xt X T t |Ft 1]

= Tr [Λt 1] 1At = Tr [Λt 1] 1/2At[Λt 1] 1/2 .

Notice that Σt = [Λt 1] 1/2At[Λt 1] 1/2 is positive semi-definite matrix. Then by Lemma 30

log det(Id + Σt) E X T t [Λt 1] 1Xt|Ft 1 2 log det(Id + Σt).

At the same time, we have

det(Λt) = det(Λt 1 + At) = det(Λt 1) det Id + [Λt 1] 1/2At[Λt 1] 1/2 .

Thus by telescoping property

log det(ΛT )

t=1 E X T t [Λt 1] 1Xt|Ft 1 2 log det(ΛT )

Published as a conference paper at ICLR 2024

H.3 ON THE BERNSTEIN INEQUALITY

We restate here a Bernstein-type inequality by Talebi & Maillard (2018). Lemma 32 (Corollary 11 by Talebi & Maillard, 2018). Let p, q S, where S denotes the probability simplex of dimension S. For all functions f : S 7 [0, b] defined on S,

2Varq(f) KL(p, q) + 2

3b KL(p, q) ,

2Varq(f) KL(p, q) .

where use the expectation operator defined as pf Es pf(s) and the variance operator defined as Varp(f) Es p f(s) Es pf(s ) 2 = p(f pf)2.

Lemma 33. Let p, q S and a function f : S 7 [0, b], then

Varq(f) 2Varp(f) + 4b2 KL(p, q) ,

Varp(f) 2Varq(f) + 4b2 KL(p, q).

Lemma 34. For p, q S, for f, g : S 7 [0, b] two functions defined on S, we have that

Varp(f) 2Varp(g) + 2bp|f g| and

Varq(f) Varp(f) + 3b2 p q 1,

where we denote the absolute operator by |f|(s) = |f(s)| for all s S.

H.4 CHANGE OF POLICY

Let π and π be two Markovian policies and let π(h ) for h [H] be a family of policies defined as follows

π(h ) h (s) = πh(s) h = h , π h(s) h = h .

Lemma 35. For any measurable function f : S R and any h h it holds

Eπ[f(sh)|s1] = Eπ(h )[f(sh)|s1].

Moreover, for any measurable g: S A R and any h h

Eπ[g(sh, ah)|(sh , ah ) = (s, a)] = Eπ(h )[g(sh, ah)|(sh , ah ) = (s, a)].

Proof. At first, we recall that by definition of a kernel ph we have for any measurable f :

Eπ[f(sh+1)|sh, ah] = phf(sh, ah), Eπ[g(sh, ah)|sh] = πhg(sh). (21)

We show the first statement by induction over h h . For h = 1 the statement is trivial. Next, we assume that it holds for h and we have to show for h + 1 h . By the tower property of conditional expectation and (21)

Eπ[f(sh+1)|s1] = Eπ[Eπ[f(sh+1)|sh, ah]|s1] = Eπ[phf(sh, ah)|s1] = Eπ[Eπ[phf(sh, ah)|sh]|s1] = Eπ[πh[phf](sh)|s1].

We notice that since h < h then πh = π(h ) h . Then we can apply the induction hypothesis to a

function π(h ) h [phf](sh)

Eπ[f(sh+1)|s1] = Eπ[πh[phf](sh)|s1] = Eπ[π(h ) h [phf](sh)|s1] = Eπ(h )[f(sh+1)|s1].

To show the second statement, we use induction over all h h . For h = h the statement is trivial. Next, we assume that it holds for h h and we have to show it for h + 1. Again, using the tower property

Eπ[g(sh+1, ah+1)|sh , ah ] = Eπ[Eπ[g(sh+1, ah+1)|sh+1]|sh , ah ] = Eπ[πh+1g(sh+1)|sh , ah ].

Published as a conference paper at ICLR 2024

We notice that πh+1 = π(h ) h+1 since h h . Thus, again applying the tower property, (21), and induction hypothesis

Eπ[g(sh+1, ah+1)|sh , ah ] = Eπ[Eπ[π(h ) h+1g(sh+1)|sh, ah]|sh , ah ]

= Eπ[ph[π(h ) h+1g](sh, ah)|sh , ah ]

= Eπ(h )[ph[π(h ) h+1g](sh, ah)|sh , ah ] = Eπ(h )[g(sh+1, ah+1)|sh , ah ].