# scalable_online_exploration_via_coverability__27e72fd6.pdf

Scalable Online Exploration via Coverability

Philip Amortila * 1 Dylan J. Foster * 2 Akshay Krishnamurthy * 2

Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives policy optimization objectives that enable downstream maximization of any reward function as a conceptual framework to systematize the study of exploration. We introduce a new objective, L1-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata:

1. Intrinsic complexity control. L1-Coverage is associated with a structural parameter, L1-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing L1-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. L1-Coverage enables the first computationally efficient model-based and model-free algorithms for online (rewardfree or reward-driven) reinforcement learning in MDPs with low coverability.

Empirically, we find that L1-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.

1. Introduction

Many applications of reinforcement learning and control demand agents to maneuver through complex environments with high-dimensional state spaces that necessitate function approximation and sophisticated exploration. Toward addressing the high sample complexity of existing empirical

1University of Illinois, Urbana-Champaign. 2Microsoft Research. Full version appears at [ar Xiv:2403.06571].

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

paradigms (Mnih et al., 2015; Silver et al., 2016; Kober et al., 2013; Lillicrap et al., 2015; Li et al., 2016), a recent body of theoretical research provides structural conditions that facilitate sample-efficient exploration, as well as an understanding of fundamental limits (Russo & Van Roy, 2013; Jiang et al., 2017; Wang et al., 2020b; Du et al., 2021; Jin et al., 2021a; Foster et al., 2021; 2023). Yet, computational efficiency remains a barrier: outside of simple settings (Azar et al., 2017; Jin et al., 2020b), our understanding of algorithmic primitives for efficient exploration is limited.

In this paper, we propose exploration objectives as a conceptual framework to develop efficient algorithms for exploration. Informally, an exploration objective is an optimization objective that incentivizes a policy (or policy ensemble) to explore the state space and gather data that can be used for downstream tasks (e.g., policy optimization or evaluation). To enable practical and efficient exploration, such an objective should satisfy three desiderata:

1. Intrinsic complexity control. Any policy (ensemble) that optimizes the objective should cover the state space to the best extent possible, enabling sample complexity guarantees for downstream learning that reflect the intrinsic statistical difficulty of the underlying MDP.

2. Efficient planning. When the MDP of interest is known, it should be possible to optimize the objective in a computationally efficient manner, ideally in a way that flexibly integrates with existing pipelines.

3. Efficient exploration. When the MDP is unknown, it should be possible to optimize the objective in a computationally and statistically efficient manner; the first two desiderata are necessary, but not sufficient here.

Our development of exploration objectives, particularly our emphasis on integrating with existing pipelines, is motivated by the large body of empirical research equipping policy gradient methods and value-based methods with exploration bonuses (Bellemare et al., 2016; Tang et al., 2017; Pathak et al., 2017; Martin et al., 2017; Burda et al., 2018; Ash et al., 2022). Although a number of prior theoretical works either implicitly or explicitly develop exploration objectives, they are either too general to admit efficient planning and exploration (Jiang et al., 2017; Dann et al., 2018; Jin et al.,

Scalable Online Exploration via Coverability

2021a; Foster et al., 2021; Chen et al., 2022b; Xie et al., 2023; Liu et al., 2023b), or too narrow to apply to practical problems of interest without losing fidelity to the theoretical foundations (e.g., Azar et al., 2017; Jin et al., 2018; Dani et al., 2008; Li et al., 2010; Wagenmaker & Jamieson, 2022). The difficulty is that designing exploration objectives is intimately tied to understanding what makes an MDP easy or hard to explore, a deep statistical problem. A useful objective must succinctly distill such understanding into a usable, operational form, but finding the right sweet spot between generality and tractability is challenging. We believe our approach and our results strike this balance.

Contributions. We introduce a new exploration objective, L1-Coverage, which flexibly supports computationally and statistically efficient exploration, satisfying desiderata (1), (2), and (3). L1-Coverage is associated with an intrinsic structural parameter, L1-Coverability (Xie et al., 2023; Amortila et al., 2024), which controls the sample complexity of reinforcement learning in nonlinear function approximation settings, subsuming Block and Low-Rank MDPs. We prove that data gathered with a policy ensemble optimizing L1-Coverage supports downstream policy optimization with general function approximation (Appendix D).

For planning in a known MDP, L1-Coverage can be optimized efficiently through reduction to (reward-driven) policy optimization, allowing for integration with offthe-shelf methods such as policy gradient (e.g., PPO) or Q-learning (e.g., DQN). For online reinforcement learning, L1-Coverage yields the first computationally and statistically efficient model-based and model-free algorithms for (reward-free/driven) exploration in L1-coverable MDPs. Technically, these results can be viewed as a successful algorithmic application of the Decision-Estimation Coefficient (DEC) framework of Foster et al. (2021; 2023).

We complement these theoretical results with an empirical validation, where we find that the L1-Coverage objective effectively integrates with off-the-shelf policy optimization algorithms, augmenting them with the ability to explore the state space widely.

Paper organization. In what follows, we formalize exploration objectives (Section 2), then introduce the L1Coverage objective (Section 3) and give algorithms for efficient planning (Section 4) and model-based exploration (Section 5), along with experiments (Section 6).

2. Online Reinforcement Learning and Exploration Objectives

This paper focuses on reward-free reinforcement learning (Jin et al., 2020a; Wang et al., 2020a; Chen et al., 2022b), in which the aim is to compute an exploratory ensemble of

policies that enables optimization of any downstream reward function; we consider planning (computing exploratory policies in a known MDP) and online exploration (discovering exploratory policies in an unknown MDP).

We work in an episodic finite-horizon setting. With H denoting the horizon, a (reward-free) Markov decision process M = X, A, {P M h }H h=0 , consists of a state space X, an action space A, a transition distribution P M h : X A (X), with the convention that P M 0 ( | ) is the initial state distribution. An episode in the MDP M proceeds according to the following protocol. At the beginning of the episode, the learner selects a randomized, non-stationary policy π = (π1, . . . , πH), where πh : X (A); we let Πrns denote the set of all such policies, and Πns denote the subset of deterministic policies. The episode evolves through the following process, beginning from x1 P M 0 ( | ). For h = 1, . . . , H: ah πh(xh) and xh+1 P M h ( | xh, ah). We let PM,π denote the law under this process, and let E M,π

denote the corresponding expectation.

For planning, where the underlying MDP is known, we denote it by M. For online exploration, where the MDP is unknown, we denote it by M ; in this framework, the learner must explore by interacting with M in a sequence of episodes in which they execute a policy π and observe the trajectory (x1, a1), . . . , (x H, a H) that results.

Additional notation. To simplify presentation, we assume that X and A are countable; our results extend to handle continuous variables with an appropriate measure-theoretic treatment. For an MDP M and policy π, we define the induced occupancy measure for layer h via d M,π h (x, a) = PM,π[xh = x, ah = a] and d M,π h (x) = PM,π[xh = x]. We use πunif to denote the uniform policy. We define π h π

as the policy that follows π for layers h < h and follows π for h h. For a set Z, we let (Z) denote the set of all probability distributions over Z. We write f = e O(g) to denote that f = O(g max{1, polylog(g)}).

2.1. Exploration Objectives

We introduce exploration objectives as a conceptual framework to study exploration in reinforcement learning. Exploration objectives are policy optimization objectives; they are defined over ensembles of policies (policy covers), represented as distributions p (Π) for a class Π Πrns of interest. The defining property of an exploration objective is to incentivize policy ensembles p (Π) to explore the state space and gather data that can be used for downstream policy optimization or evaluation (i.e., offline RL).

Definition 2.1 (Exploration objective). For a reward-free MDP M and policy class Π, a function ΦM : (Π) R+ is an exploration objective if for any policy ensemble p (Π), one can optimize any downstream reward

Scalable Online Exploration via Coverability

function R to precision ϵ in M (i.e., produce bπ such that maxπ J M R (π) J M R (bπ) ϵ)1 using poly(ΦM(p), ϵ 1) trajectories drawn from π p (under standard function approximation assumptions).

Note that we allow the exploration objective to depend on the underlying MDP M; if M is unknown, then evaluating ΦM(p) may be impossible without first exploring. We also consider per-layer objectives denoted as ΦM h , optimized by a collection {ph}H h=1 of policy ensembles. We deliberately leave the details for reward optimization vague in Definition 2.1, as this will depend on the policy optimization and function approximation techniques under consideration; see Appendix D for examples.

With Definition 2.1 in mind, the desiderata in Section 1 can be restated formally as follows.

1. Intrinsic complexity control. The optimal objective value Opt M := infp (Π) ΦM(p) is bounded for a large class of MDPs M of interest, ideally in a way that depends on intrinsic structural properties of the MDP.

2. Efficient planning. When the MDP M is known, we can solve arg minp (Π) ΦM(p) approximately (up to multiplicative or additive approximation factors) in a computationally efficient fashion, ideally in a way that reduces to standard computational primitives such as reward-driven policy optimization.

3. Efficient exploration. When the MDP M is unknown, we can approximately solve arg minp (Π) ΦM(p) in a sample-efficient fashion, with sample complexity polynomial in the optimal objective value Opt M, and with computational efficiency comparable to planning.

As a basic example, for tabular MDPs where |X| is small, the work of Jin et al. (2020a) can be viewed as optimizing the per-layer objective

Φ M h (p) = max x X 1 Eπ p[dπ h(x)] (1)

for each layer h [H]; the optimal value for this objective satisfies Opt M h = O(|X|/η), where η := minx X maxπ Πrns dπ h(x) is a reachability parameter. This objective supports efficient planning and exploration, satisfying desiderata (2) and (3), but is restrictive in terms of intrinsic complexity (desideratum (1)) because the optimal value, which scales with |X|, does not sharply control the sample complexity of reinforcement learning in the function approximation regime. Other (implicit or explicit) objectives studied in prior work similarly do not satisfy desideratum (1) (Hazan et al., 2019; Jin et al., 2020b; Agarwal et al.,

1J M R (π) denotes the expected reward of π in M under R.

2020; Modi et al., 2024; Uehara et al., 2022; Zhang et al., 2022; Mhammedi et al., 2023b), or are too general to admit computationally efficient planning and exploration, failing to satisfy desiderata (2) and (3) (Jiang et al., 2017; Dann et al., 2018; Jin et al., 2021a; Foster et al., 2021; Chen et al., 2022b; Xie et al., 2023; Liu et al., 2023b). See Appendix A for further discussion.

Remark 2.1. Many existing algorithms for example, those that use count-based bonuses (Azar et al., 2017; Jin et al., 2018) or elliptic bonuses (Auer, 2002; Dani et al., 2008; Li et al., 2010; Jin et al., 2020b) implicitly allude to specific exploration objectives, but do not appear to explicitly optimize them. We hope that by separating algorithms from objectives, our definition can bring clarity and better guide algorithm design going forward.

3. The L1-Coverage Objective

This section introduces our main exploration objective, L1Coverage. Throughout the section, we work with a fixed (known) MDP M, and an arbitrary set of policies Π Πrns.

L1-Coverage objective. For a policy ensemble p (Πrns) and parameter ε [0, 1] we define the L1Coverage objective by

Ψ M h,ε(p) = sup π Π E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε d M,π h (xh, ah)

where we slightly overload notation for occupancy measures and write d M,p h (x, a) := Eπ p d M,π h (x, a) .2 This objective encourages the ensemble p (Π) to cover the state space at least as well as any individual policy π Π, but in an average-case sense (with respect to the state distribution induced by π itself) that discounts hard-to-reach states. Importantly, L1-Coverage only considers the relative probability of visiting states (that is, the ratio of occupancies), which is fundamentally different from tabular objectives such as Eq. (1) from prior work (Jin et al., 2020a; Hazan et al., 2019) and is essential to drive exploration in large state spaces. The approximation parameter ε > 0 allows one to discard regions of the state space that have low relative probability for all policies, removing the reachability assumption required by Eq. (1).

L1-Coverage is closely related to previous optimal designbased objectives in online RL, and to several standard notions of coverage in offline RL. Indeed, L1-Coverage can be viewed as a generalization of previously proposed optimal design-based objectives (Wagenmaker et al., 2022; Wagenmaker & Jamieson, 2022; Li et al., 2023; Mhammedi

2Likewise, EM,p and PM,p denote the expectation and law for the process where we sample policy π p and execute it in M.

Scalable Online Exploration via Coverability

et al., 2023a); see Appendices A and F for details. Regarding the latter, when ε = 0, ΨM h,ε(p) coincides with L1concentrability (Farahmand et al., 2010; Xie & Jiang, 2020; Zanette et al., 2021), and is equivalent to the χ2-divergence between dπ and dp up to a constant shift.

Before turning to algorithmic development, we first show that L1-Coverage is indeed a valid exploration objective in the sense of Definition 2.1, then show that it satisfies desideratum (1), providing meaningful control over the intrinsic complexity of exploration.

L1-Coverage enables downstream policy optimization. We prove that L1-Coverage enables downstream policy optimization through a change-of-measure lemma, which shows that it is possible to transfer the expected value for any function g of interest (e.g., Bellman error) under any policy π to the expected value under p. Proposition 3.1 (Change of measure for L1-Coverage). For any distribution p (Πrns), we have that for all functions g : X A [0, B], all π Π, and all ε > 0,3

E M,π[g(xh, ah)] (3)

ΨM h,ε(p) E M,p[g2(xh, ah)] + Ψ M h,ε(p) (εB).

Using this result, one can prove that, using data gathered from p, standard offline reinforcement learning algorithms such as Fitted Q-Iteration (FQI) succeed with sample complexity scaling with maxh ΨM h,ε(p). One can similarly analyze hybrid offline/online methods (online methods that require access to exploratory data) such as PSDP (Bagnell et al., 2003) and NPG (Agarwal et al., 2021); see Appendix D for details. Such methods will prove to be critical for the reward-free reinforcement learning guarantees we present in the sequel.

In light of Proposition 3.1, which shows that L1-Coverage satisfies Definition 2.1, we refer to any p (Πrns) that (approximately) optimizes the L1-Coverage objective as a policy cover going forward.4

L1-Coverability provides intrinsic complexity control. Of course, the guarantee in Proposition 3.1 is only useful if desideratum (1) is satisfied, i.e. there exist distributions p (Π) for which the L1-Coverage objective is bounded. To this end, we define the optimal value for the L1-Coverage objective, which we refer to as L1-Coverability, as

Cov M h,ε = inf p (Π) Ψ M h,ε(p), (4)

3This result is meaningful in the parameter regime where ΨM h,ε(p) < 1/ε. We refer to this regime as non-trivial, as ΨM h,ε(p) 1/ε holds vacuously for all p. 4This definition generalizes most notions of policy cover found in prior work (Du et al., 2019; Misra et al., 2020; Mhammedi et al., 2023b;a; Huang et al., 2023).

and define Cov M ε = maxh [H] Cov M h,ε.

We show that the L1-Coverability value Cov M h,ε can be interpreted as an intrinsic structural parameter for the MDP M, and is bounded for standard MDP classes of interest. To do so, we draw a connection a structural parameter introduced by Xie et al. (2023) as a means to bridge online and offline RL, which we refer to as L -Coverability:

C M ;h = inf µ (X A) sup π Π sup (x,a) X A

d M,π h (x, a) µ(x, a)

with CM = maxh [H] CM ;h.

L -Coverability measures the best possible (worst-case) density ratio that can be achieved if one optimally designs the data distribution µ with knowledge of the underlying MDP. The main differences between the value CM ;h and the L1-Coverability value in Eq. (4) are that (i) Eq. (5) considers worst-case (L -type) rather than average-case coverage, and (ii) Eq. (5) allows the distribution µ (X A) to be arbitrary, while Eq. (4) requires the distribution to be realized as a mixture of occupancies (in other words, Eq. (5) allows for non-admissible mixtures).5 Due to the latter difference, it is unclear at first glance whether one can relate the two objectives, since allowing for non-admissible mixtures could potentially make the objective (5) much smaller. Nonetheless, the following result, which uses a non-trivial application of the minimax theorem, shows that L1-Coverability is indeed bounded by CM .

Proposition 3.2. For all ε > 0, we have Cov M h,ε CM ;h.

Examples for which CM and consequently Cov M ε is bounded by small problem-dependent constants include Block MDPs (Xie et al., 2023) (CM |S||A|, where S is the number of latent states), linear/low-rank MDPs (Huang et al., 2023) (CM d|A|, where d is the feature dimension), and analytically sparse low-rank MDPs (Golowich et al., 2023) (CM k|A|, where k is the sparsity level). Importantly, these examples (particularly Block MDPs and Low-Rank MDPs) require nonlinear function approximation. See Appendix H.2 for details; see also Xie et al. (2023); Amortila et al. (2024).

We note that L1and L -Coverability are less general than structural parameters defined in terms of Bellman errors, such as Bellman rank (Jiang et al., 2017), Bellman-Eluder dimension (Jin et al., 2021a), and Bilinear rank (Du et al., 2021), which are known to not admit computationally efficient learning algorithms (Dann et al., 2018). In this sense, L1-Coverage strikes a balance between generality and tractability. We discuss this in more detail and discuss connections to other structural parameters in Appendix F.

5We adopt the notation Cov M for admissible variants of coverability and CM for non-admissible variants of coverability.

Scalable Online Exploration via Coverability

Algorithm 1 Approximate Policy Cover Computation via L -Coverability Relaxation

1: input: Layer h [H], precision ε [0, 1], distribution

µ (X A) w/ C CM ;h(µ), opt. tol. εopt > 0.

2: Set T = 1

3: for t = 1, 2, , T do 4: Compute πt Π such that

EM,πt " µ(xh, ah) P

i<t d M,πi h (xh, ah) + C µ(xh, ah)

sup π Π EM,π " µ(xh, ah) P

i<t d M,πi h (xh, ah) + C µ(xh, ah)

5: Return p = Unif(π1, . . . , πT).

4. Optimizing L1-Coverage: Efficient Planning

Directly optimizing the L1-Coverage objective (2) presents challenges because the objective is quadratic in the occupancy d M,π. To address this issue, this section provides two relaxations that is, relaxed objectives that upper bound L1-Coverage that are directly amenable to optimization (via reduction to standard reward-driven policy optimization), yet are still bounded for MDPs of interest. Section 4.1 presents a relaxation based on a connection to L - Coverability (Eq. (5)), and Section 4.2 presents a relaxation based on a connection to Pushforward Coverability. The first relaxation is tighter, but requires stronger knowledge of the underlying MDP. To motivate our results, recall that given a C-approximate minimizer p (Π) for which ΨM h,ε(p) C Cov M h,ε, the sample complexity in Proposition 3.1 degrades only by an O(C) factor.

4.1. The L -Coverability Relaxation

Our first relaxation of the L1-Coverage objective assumes access to a distribution µ (X A) for which the L -concentrability coefficient CM ;h(µ) :=

supπ Π sup(x,a) X A n d M,π h (x,a) µ(x,a) o (Chen & Jiang, 2019) is bounded. For such a distribution µ, abbreviating C CM ;h(µ), we define

Ψ M µ;h,ε(p) := sup π Π E M,π µ(xh, ah) d M,p h (xh, ah) + ε C µ(xh, ah)

and Cov M µ;h,ε := infp (Π) ΨM µ;h,ε(p). This objective upper bounds L1-Coverage, and any p (Πrns) that optimizes it has ΨM h,ε(p) 2CM ;h(µ).

Proposition 4.1. For a distribution µ with C CM ;h(µ),

it holds that for all p (Πrns),

Ψ M h,ε(p) 2C Ψ M µ;h,ε(p). (7)

Furthermore, Cov M µ;h,ε 1 for all ε > 0.

Notably, given access to a distribution µ (X A) that achieves the L -Coverability value CM ;h in Eq. (5), any distribution p (Πrns) that optimizes the relaxation in Eq. (6) achieves L1-Coverage value ΨM h,ε(p) 2CM ;h. However, the relaxation supports arbitrary distributions µ (X A), allowing one to trade off approximation value and computation. Indeed, in some cases, it may be simpler to compute a distribution µ (X A) that has suboptimal, yet reasonable concentrability. Because µ is not required to be admissible, such a distribution can be computed easily or in closed form for many MDP families of interest. For example, in tabular MDPs we can simply take µ = Unif(X A). See Appendix H.2 for more examples.

The algorithm. Algorithm 1 provides an iterative algorithm to compute a distribution p (Πrns) that optimizes the L -relaxation in Eq. (6). The algorithm proceeds in T steps. At each step t [T], given a sequence of policies π1, . . . , πt 1 computed so far, the algorithm computes a new policy πt by solving the policy optimization problem

π t = arg max π Π E M,π " µ(xh, ah) P

i<t d M,πi h (xh, ah) + C µ(xh, ah)

in Line 4 (up to tolerance εopt > 0). After all T rounds conclude, the algorithm returns the uniform mixture p = Unif(π1, . . . , πT) as a policy cover.

The optimization problem (8) aims to find a policy πt that explores regions of the state space not already covered by π1, . . . , πt 1. Critically, it is a standard reward-driven policy optimization problem with reward function

r t h(x, a) := µ(x, a) P

i<t d M,πi h (x, a) + C µ(x, a) .

In practice, one can solve Eq. (8) using standard policy optimization algorithms (we take this approach in Section 6). However, since the MDP M is known, we can also take advantage of the vast literature on algorithms for planning with a known model, as well as algorithms like Policy Search by Dynamic Programming (Bagnell et al. (2003)) or Natural Policy Gradient (Agarwal et al. (2021)) that can use µ itself as a high-quality reset distribution.

The following theorem shows that Algorithm 1 converges to a policy cover p (Πrns) that optimizes the relaxation in Eq. (6) (up to a small log(ε 1) multiplicative approximation factor) in a small number of iterations.

Theorem 4.1. For any ε [0, 1] and h [H], given a distribution µ with C CM ;h(µ), whenever εopt

Scalable Online Exploration via Coverability

ε log(2ε 1), Algorithm 1 with T = ε 1 produces a distribution p (Π) with |supp(p)| ε 1 such that

Ψ M µ;h,ε(p) 3 log(2ε 1), (9)

and consequently ΨM h,ε(p) 6C log(2ε 1).

See Appendix H.2 for examples. Note that the algorithmic template in Algorithm 1 can be applied in unknown MDPs as long as Line 4 can be implemented sample-efficiently; this observation is the basis for the model-free online exploration algorithm given in Appendix E.

4.2. The Pushforward Coverability Relaxation

The main drawback behind the L -Coverability relaxation in the prequel is the assumption of access to a covering distribution µ (X A).6 The next relaxation, which is inspired by the notion of pushforward concentrability in offline reinforcement learning (Xie & Jiang, 2021; Foster et al., 2022), removes this assumption at the cost of giving a looser upper bound. This objective takes the form

Ψ M push;h,ε(p) = sup π Π E M,π " P M h 1(xh | xh 1, ah 1) d M,p h (xh) + ε P M h 1(xh | xh 1ah 1)

with Cov M push;h,ε := infp (Π) ΨM push;h,ε(p). This objective replaces the covering distribution µ with the transition distribution P M h 1 for M, making it more practical to compute. Unlike the L -Coverability relaxation (6), the optimal value Cov M push;h,ε may not be bounded by the parameter CM ;h. However, we show that the value can be controlled by a related pushforward coverability parameter given by

CM push;h = infµ (X) sup(x,a,x ) X A X n P M h 1(x |x,a)

with CM push = maxh [H] CM push;h.7

Proposition 4.2. Fix h [H]. For any p (Πrns), if we define p (Πrns) as the distribution induced by sampling π p and executing π h πunif, we have that for all ε > 0,

Ψ M h,ε(p ) |A| Ψ M push;h,ε(p). (11)

Furthermore, Cov M push;h,ε CM push;h for all ε > 0.

In particular, any distribution p (Πrns) that optimizes the relaxation in Eq. (10) achieves L1-Coverage value ΨM h,ε(p ) |A| CM push;h.8 Notable special cases where pushforward coverability is bounded include: (i) Tabular MDPs have CM push |X|, (ii) Block MDPs (Du et al., 2019; Misra et al., 2020; Zhang et al., 2022; Mhammedi et al., 2023b) with latent state space S have CM push |S|, (iii) Low-Rank MDPs (30) have CM push d (Xie & Jiang, 2021).

6Note that since we consider planning, this is a computational assumption, not a statistical assumption. 7This is inspired by pushforward concentrability (Xie & Jiang, 2021); CM ;h |A| CM push;h but the converse is not true. 8The dependence on |A| in this result is natural, as pushforward coverability only considers state occupancies.

The algorithm. An iterative algorithm to optimize the pushforward coverability relaxation is given in Algorithm 5 (deferred to Appendix H.1 for space). The algorithm follows the same template as Algorithm 1, but at each step t, the policy πt is computed by solving an objective based on Eq. (10). The main guarantee is as follows.

Theorem 4.2. For any ε [0, 1] and h [H], whenever εopt CM push;h ε log(2ε 1), Algorithm 5 produces a distribution p (Π) with |supp(p)| ε 1 such that

Ψ M push;h,ε(p) 5C M push;h log(2ε 1). (12)

Consequently, if we define p (Πrns) as the distribution induced by sampling π p and executing π h πunif, we have that ΨM h,ε(p ) 5|A|CM push;h log(2ε 1).

5. Efficient Online Exploration via L1-Coverage

In this section, we turn our attention to sample-efficient online exploration for the setting in which the underlying MDP M is unknown. Throughout the section, we work with an arbitrary user-specified subset of policies Π Πrns.

Model-based reinforcement learning setup. We focus on model-based reinforcement learning, and assume access to a model class M that contains the true MDP M .

Assumption 5.1 (Realizability). The learner has access to a class M containing the true model M .

The class M is user-specified, and can be parameterized by deep neural networks or any other flexible function class, with the best choice depending on the problem domain.

For M M, we use M(π) as shorthand for the law of the trajectory (x1, a1), . . . , (x H, a H) for policy π in M. We define the squared Hellinger distance for probability measures P and Q by D2 H(P, Q) = R

Estimation oracles. Our algorithms and regret bounds use the primitive of an estimation oracle, denoted by Alg Est, a user-specified algorithm for estimation that is used to estimate the underlying model M from data (Foster & Rakhlin, 2020; 2023) sequentially. At each episode t, given the data Ht 1 = (π1, o1), . . . , (πt 1, ot 1) collected so far, where ot := (xt 1, at 1), . . . , (xt H, at H), the estimation oracle constructs an estimate c M t = Alg Est {(πi, oi)}t 1 i=1 for the true MDP M . We assume that Alg Est is an offline estimation oracle, in the sense that each estimator c M t has good out-of-sample performance on the historical dataset Ht 1.

Assumption 5.2 (Offline estimation oracle). At each time t [T], an offline estimation oracle Alg Est for M, given Ht 1 = (π1, o1), . . . , (πt 1, ot 1) with oi M (πi) and πi pi, returns an estimator c M t M such that

Scalable Online Exploration via Coverability

Algorithm 2 Coverage-Driven Exploration (CODEX)

1: input: Estimation oracle Alg Est, number of episodes T N, approximation parameters C 1, ε [0, 1]. 2: for t = 1, 2, , T do 3: Estimate model: c M t = Alg t Est {(πi, oi)}t 1 i=1 .

// Plug-in approximation to L1-Coverage objective. 4: For each h [H], compute (C, ε)-approx. policy cover pt h for c M t: Ψ c Mt h,ε(pt h) =

sup π Π E c Mt,π " d c Mt,π h (xh, ah)

d c Mt,pt h h (xh, ah) + ε d c Mt,π h (xh, ah)

5: Sample πt qt = Unif(pt 1, . . . , pt H) and observe trajectory ot = (xt 1, at 1), . . . , (xt H, at H).

6: return (p1, . . . , p H), where ph := Unif(p1 h, . . . , p T h).

Estoff H (t) := X

i<t Eπi pi D2 H c M t(π i), M (π i) Estoff H (M, T, δ),

for all t [T] with probability at least 1 δ whenever M M, where Estoff H (M, T, δ) is a known upper bound.

As an example, the standard maximum likelihood estimator (MLE) satisfies Assumption 5.2 with Estoff H (M, T, δ) O(log(|M|T/δ)) (e.g., Foster & Rakhlin (2023)).

Algorithm. Our main algorithm for reward-free reinforcement learning, Coverage-Driven Exploration (CODEX; Algorithm 2), is based on a simple plug-in estimationoptimization paradigm: Repeatedly compute an estimate c M t for the true model M , then compute a policy cover p (Π) that optimizes the L1-Coverage objective for c M t (a plug-in approximation to the true L1-Coverage objective) and use this to collect data; proceed until this process arrives at a high-quality cover for M .

In more detail, Algorithm 2 proceeds in T episodes. At each episode t, the algorithm invokes the user-specified estimation oracle Alg Est to produce an estimate c M t for the model M based on the data collected so far. Given this estimate, for each layer h [H], the algorithm computes a (C, ε)- approximate policy cover pt h (Π) for c M t: Ψ c Mt h,ε(pt h) =

sup π Π E c Mt,π " d c Mt,π h (xh, ah)

d c Mt,pt h h (xh, ah) + ε d c Mt,π h (xh, ah)

where C > 0 is made sufficiently large to ensure Eq. (14) is feasible. This is a plug-in approximation to the true L1-Coverage objective ΨM h,ε(p). Given the approximate covers pt 1, . . . , pt H, the algorithm collects a new trajectory ot by sampling πt qt := Unif(pt 1, . . . , pt H). This

trajectory is used to update the estimator c M t, and the algorithm proceeds to the next episode. Once all episodes conclude, the algorithm returns ph := Unif(p1 h, . . . , p T h) as the final cover for each layer h.

The plug-in L1-Coverage objective in Eq. (14) can be solved efficiently using the relaxation-based methods in Section 4, since the model c M t M is known, making this a pure (nonstatistical) planning problem. We leave the approximation parameter C 1 as a free parameter to accommodate the approximation factors these relaxations incur (the sample complexity degrades linearly with C).

Main result. In its most general form, Algorithm 2 leads to sample complexity guarantees based on L1-Coverability. Due to space constraints, the main result we present here is a simpler guarantee based on L -Coverability. To state the guarantee in the most compact form possible, we make the following assumption on the estimation error rate. Assumption 5.3 (Parametric estimation rate). The offline estimation oracle satisfies Estoff H (M, T, δ) O(dest log(Best T/δ)) for parameters dest, Best N.

Theorem 5.1 (Main guarantee for CODEX). Let ε > 0 be given. Let C CM , and suppose that (i) we restrict M such that all M M have CM C , and (ii) we solve Eq. (13) with C = C for all t.9 Then, given an offline estimation oracle satisfying Assumptions 5.2 and 5.3, using

T = e O H8(CM )3dest log(Best/δ)

ε4 episodes, Algorithm 2

produces policy covers p1, . . . , p H (Π) such that

h [H] : Ψ M h,ε(ph) 12H C M (15)

w.p. at least 1 δ. For a finite class M, if we use MLE as

the estimator, we can take T = e O H8(CM )3 log(|M|/δ)

Theorem 5.1 shows for the first time that it is possible to perform sample-efficient and computationally efficient rewardfree exploration under coverability. We refer to Appendix I.4 for the proof, which is based on a connection to the Decision Estimation Coefficient of Foster et al. (2021; 2023). Some key features of the result are as follows.

Computational efficiency. As maximum likelihood estimation (MLE) is a valid estimation oracle, Algorithm 2 is computationally efficient whenever 1) MLE can be performed efficiently, and 2) the plug-in L1-Coverage objective in Eq. (13) can be approximately optimized efficiently. As the objective involves only the estimated model c M t (and hence is a computational problem), we can use the relaxations in Section 4 to do this efficiently (via off-the-shelf methods).

Regarding the latter point, note that while Theorem 5.1 assumes for simplicity that Eq. (13) is solved with C = CM ,

9We can take CM C w.l.o.g. when C is known. In this case, solving Eq. (13) with C = C is feasible by Proposition 3.2.

Scalable Online Exploration via Coverability

if we solve the objective for C > CM the result continues to hold with CM replaced by C in the sample complexity bound and approximation guarantee. Consequently, if we solve Eq. (13) using Algorithm 1, the guarantees in Theorem 5.1 continue to hold up to an e O(1) approximation factor.

Statistical efficiency. CODEX achieves sample complexity guarantees based on L -Coverability in a computationally eficient fashion for the first time. Compared to previous inefficient algorithms based on coverability (Xie et al., 2023; Liu et al., 2023a; Amortila et al., 2024) our theorem has somewhat looser sample complexity, and requires modelbased function approximation. See Appendix E for a modelfree counterpart based on weight function learning.

CODEX is not limited to L -Coverability. Appendix I.1 gives more general results (Theorem I.1, Corollary I.1) which achieve sample complexity poly(Cov M ε , H, log|M|, ε 1), showing that L1-Coverability is itself a sufficiently powerful structural parameter to enable sample-efficient learning with nonlinear function approximation.

Application to downstream policy optimization. By Proposition 3.1, the policy covers p1:H returned by Algorithm 2 can be used to optimize any downstream reward function using offline RL, giving end-to-end guarantees for reward-driven online RL. We sketch an example using MLE for policy optimization in Appendix I.2 (cf. Appendix D).

6. Experimental Evaluation

We present proof-of-concept experiments to validate our theoretical results.10 We focus on the planning problem (Section 4), and consider the classical Mountain Car environment (Brockman et al., 2016). We optimize the L1Coverage objective via Algorithm 1 and compare this to the Max Ent algorithm (Hazan et al., 2019) and uniform exploration. We find that L1-Coverage explores the state space more quickly and effectively than the baselines.

Experimental setup. Mountain Car is a continuous domain with two-dimensional states and actions. We consider a deterministic starting state near the bottom of the valley in the environment, so that deliberate exploration is required.

We optimize L1-Coverage using Algorithm 1, approximating the occupancies dπi via count-based estimates on a discretization of the state-action space. We set µ to the uniform distribution and C to the number of state-action pairs (in the discretization). For Max Ent, we use the implementation from Hazan et al. (2019). Algorithm 1 and Max Ent both require access to a subroutine for reward-driven planning (to solve the problem in Eq. (8) or an analogous subproblem). For both, we use a policy gradient method (REINFORCE; Sutton et al. (1999)) as the reward-driven planner, following

10Code available at github.com/philip-amortila/l1-coverability.

Hazan et al. (2019); our policy class is parameterized as a two layer neural network. For details on the environment, architectures, and hyperparameters, see Appendix B.

Results. Results are visualized in Figure 1. We measure the number of unique discretized states discovered by each algorithm s policy covers, and find that L1-Coverage (Algorithm 1) outperforms the Max Ent and uniform exploration baselines by a factor of two. We also perform a qualitative comparison by visualizing the occupancies induced by the learned policy covers, and find that the cover obtained with L1-Coverage explores a much larger portion of the state space than the baselines; notably, L1-Coverage explores both hills in the Mountain Car environment, while the baselines fail to do so. We find these results promising, and plan to perform a large-scale evaluation in future work; see Appendix B for further experimental results.

0 5 10 15 Epochs

Number of states

Unique States Visited

Occupancy Heatmap

L1-Cov Max Ent Uniform

Figure 1. Unique discrete states visited (mean/standard error over 10 runs) and occupancy heatmaps for each policy cover obtained by L1-Coverage (Algorithm 1), Max Ent, and uniform actions. Each epoch comprises one policy cover update. Heatmap legend: velocity (x-axis), position (y-axis), start state ( ), goal state ( ).

7. Discussion and Open Problems

Our results show that the L1-Coverage objective serves as a scalable primitive for exploration in reinforcement learning, providing sample efficiency, computational efficiency, and flexibility. On the theoretical side, our results raise a number of interesting questions for future work: (i) What are the weakest representation conditions under which coverability leads to computationally efficient online reinforcement learning guarantees? (ii) Can we generalize the L1-Coverage objective further while still allowing for practical/computationally efficient optimization?

Additional results. Several additional results are deferred to the appendix due to space constraints. Appendix E builds on Sections 4 and 5 to provide efficient algorithms for reward-free exploration under a weaker form of model-free function approximation based on density ratio realizability, and Appendix F provides connections between L1-Coverability and other intrinsic structural properties.

Scalable Online Exploration via Coverability

Impact Statement

This paper presents work whose goal is to advance the theoretical foundations of reinforcement learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 2014.

Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. FLAMBE: Structural complexity and representation learning of low rank MDPs. Advances in Neural Information Processing Systems, 2020.

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 2021.

Amortila, P., Foster, D. J., Jiang, N., Sekhari, A., and Xie, T. Harnessing density ratios for online reinforcement learning. International Conference on Learning Representations, 2024.

Antos, A., Szepesvári, C., and Munos, R. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 2008.

Ash, J. T., Zhang, C., Goel, S., Krishnamurthy, A., and Kakade, S. Anti-concentrated confidence bonuses for scalable exploration. In International Conference on Learning Representations, 2022.

Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 2002.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, 2017.

Bagnell, J., Kakade, S. M., Schneider, J., and Ng, A. Policy search by dynamic programming. Advances in Neural Information Processing Systems, 2003.

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Processing Systems, 2016.

Block, A. and Polyanskiy, Y. The sample complexity of approximate rejection sampling with applications to

smoothed online learning. In Conference on Learning Theory, 2023.

Block, A., Dagan, Y., Golowich, N., and Rakhlin, A. Smoothed online learning is as easy as statistical learning. In Conference on Learning Theory, 2022.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI gym. ar Xiv:1606.01540, 2016.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Exploration by random network distillation. In International Conference on Learning Representations, 2018.

Chen, F., Mei, S., and Bai, Y. Unified algorithms for RL with decision-estimation coefficients: No-regret, PAC, and reward-free learning. ar Xiv:2209.11745, 2022a.

Chen, J. and Jiang, N. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, 2019.

Chen, J. and Jiang, N. Offline reinforcement learning under value and density-ratio realizability: The power of gaps. In Uncertainty in Artificial Intelligence, 2022.

Chen, J., Modi, A., Krishnamurthy, A., Jiang, N., and Agarwal, A. On the statistical efficiency of reward-free exploration in non-linear rl. Advances in Neural Information Processing Systems, 2022b.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, 2008.

Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. On oracle-efficient PAC RL with rich observations. In Advances in Neural Information Processing Systems, 2018.

Daskalakis, C., Golowich, N., Haghtalab, N., and Shetty, A. Smooth nash equilibria: Algorithms and complexity. In Innovations in Theoretical Computer Science, 2023.

Du, S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dudik, M., and Langford, J. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, 2019.

Du, S. S., Kakade, S. M., Lee, J. D., Lovett, S., Mahajan, G., Sun, W., and Wang, R. Bilinear classes: A structural framework for provable generalization in RL. International Conference on Machine Learning, 2021.

Farahmand, A.-m., Szepesvári, C., and Munos, R. Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 2010.

Scalable Online Exploration via Coverability

Foster, D. J. and Krishnamurthy, A. Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination. Neural Information Processing Systems, 2021.

Foster, D. J. and Rakhlin, A. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. International Conference on Machine Learning, 2020.

Foster, D. J. and Rakhlin, A. Foundations of reinforcement learning and interactive decision making. ar Xiv:2312.16730, 2023.

Foster, D. J., Kakade, S. M., Qian, J., and Rakhlin, A. The statistical complexity of interactive decision making. ar Xiv:2112.13487, 2021.

Foster, D. J., Krishnamurthy, A., Simchi-Levi, D., and Xu, Y. Offline reinforcement learning: Fundamental barriers for value function approximation. In Conference on Learning Theory, 2022.

Foster, D. J., Golowich, N., and Han, Y. Tight guarantees for interactive decision making with the decision-estimation coefficient. Conference on Learning Theory, 2023.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, 2010.

Golowich, N., Rohatgi, D., and Moitra, A. Exploring and learning in sparse linear mdps without computationally intractable oracles. ar Xiv:2309.09457, 2023.

Haghtalab, N., Roughgarden, T., and Shetty, A. Smoothed analysis of online and differentially private learning. Advances in Neural Information Processing Systems, 2020.

Haghtalab, N., Han, Y., Shetty, A., and Yang, K. Oracleefficient online learning for smoothed adversaries. Advances in Neural Information Processing Systems, 2022a.

Haghtalab, N., Roughgarden, T., and Shetty, A. Smoothed analysis with adaptive adversaries. In Symposium on Foundations of Computer Science (FOCS), 2022b.

Hazan, E., Kakade, S., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, 2019.

Huang, A., Chen, J., and Jiang, N. Reinforcement learning in low-rank MDPs with density features. International Conference on Machine Learning, 2023.

Jiang, N. and Huang, J. Minimax value interval for offpolicy evaluation and policy optimization. Neural Information Processing Systems, 2020.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, 2017.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? Advances in Neural Information Processing Systems, 2018.

Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, 2020a.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, 2020b.

Jin, C., Liu, Q., and Miryoosefi, S. Bellman eluder dimension: New rich classes of RL problems, and sampleefficient algorithms. Neural Information Processing Systems, 2021a.

Jin, Y., Yang, Z., and Wang, Z. Is pessimism provably efficient for offline RL? In International Conference on Machine Learning, 2021b.

Kakade, S. M. On the sample complexity of reinforcement learning. University College London, 2003.

Katdare, P., Jiang, N., and Driggs-Campbell, K. Marginalized importance sampling for off-environment policy evaluation. In Conference on Robot Learning, 2023.

Kiefer, J. and Wolfowitz, J. The equivalence of two extremum problems. Canadian Journal of Mathematics, 1960.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.

Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 2013.

Li, G., Zhan, W., Lee, J. D., Chi, Y., and Chen, Y. Rewardagnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning. In Advances in Neural Information Processing Systems, 2023.

Li, J., Monroe, W., Ritter, A., Jurafsky, D., Galley, M., and Gao, J. Deep reinforcement learning for dialogue generation. In Conference on Empirical Methods in Natural Language Processing, 2016.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web, 2010.

Scalable Online Exploration via Coverability

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2015.

Liu, F., Viano, L., and Cevher, V. Provable benefits of general coverage conditions in efficient online RL with function approximation. International Conference on Machine Learning, 2023a.

Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 2018.

Liu, Z., Lu, M., Xiong, W., Zhong, H., Hu, H., Zhang, S., Zheng, S., Yang, Z., and Wang, Z. Maximize to explore: One objective function fusing estimation, planning, and exploration. In Advances in Neural Information Processing Systems, 2023b.

Martin, J., Sasikumar, S. N., Everitt, T., and Hutter, M. Count-based exploration in feature space for reinforcement learning. In International Joint Conference on Artificial Intelligence, 2017.

Mhammedi, Z., Block, A., Foster, D. J., and Rakhlin, A. Efficient model-free exploration in low-rank MDPs. Advances in Neural Information Processing Systems, 2023a.

Mhammedi, Z., Foster, D. J., and Rakhlin, A. Representation learning with multi-step inverse kinematics: An efficient and optimal approach to rich-observation RL. International Conference on Machine Learning, 2023b.

Misra, D., Henaff, M., Krishnamurthy, A., and Langford, J. Kinematic state abstraction and provably efficient richobservation reinforcement learning. In International conference on machine learning, 2020.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 2015.

Modi, A., Chen, J., Krishnamurthy, A., Jiang, N., and Agarwal, A. Model-free representation learning and exploration in low-rank mdps. Journal of Machine Learning Research, 2024.

Munos, R. Error bounds for approximate policy iteration. In International Conference on Machine Learning, 2003.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 2010.

Ozdaglar, A. E., Pattathil, S., Zhang, J., and Zhang, K. Revisiting the linear-programming framework for offline RL with general function approximation. In International Conference on Machine Learning, 2023.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, 2017.

Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 2021.

Rashidinejad, P., Zhu, H., Yang, K., Russell, S., and Jiao, J. Optimal conservative offline RL with general function approximation via augmented lagrangian. In International Conference on Learning Representations, 2023.

Rendle, S., Freudenthaler, C., and Schmidt-Thieme, L. Factorizing personalized Markov chains for next-basket recommendation. In International Conference on World Wide Web, 2010.

Russo, D. and Van Roy, B. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, 2013.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.

Sion, M. On general minimax theorems. Pacific Journal of Mathematics, 1958.

Song, Y., Zhou, Y., Sekhari, A., Bagnell, D., Krishnamurthy, A., and Sun, W. Hybrid RL: Using both offline and online data can make RL efficient. In International Conference on Learning Representations, 2022.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, 2019.

Scalable Online Exploration via Coverability

Sutton, R. S., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 1999.

Tang, H., Houthooft, R., Foote, D., Stooke, A., Xi Chen, O., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. # exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, 2017.

Uehara, M., Huang, J., and Jiang, N. Minimax weight and Q-function learning for off-policy evaluation. In International Conference on Machine Learning, 2020.

Uehara, M., Imaizumi, M., Jiang, N., Kallus, N., Sun, W., and Xie, T. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and firstorder efficiency. ar Xiv:2102.02981, 2021.

Uehara, M., Zhang, X., and Sun, W. Representation learning for online and offline RL in low-rank mdps. In International Conference on Learning Representations, 2022.

Wagenmaker, A. and Jamieson, K. G. Instance-dependent near-optimal policy identification in linear mdps via online experiment design. Advances in Neural Information Processing Systems, 2022.

Wagenmaker, A. and Pacchiano, A. Leveraging offline data in online reinforcement learning. In International Conference on Machine Learning, 2023.

Wagenmaker, A. J., Simchowitz, M., and Jamieson, K. Beyond no regret: Instance-dependent PAC reinforcement learning. In Conference on Learning Theory, 2022.

Wang, R., Du, S. S., Yang, L., and Salakhutdinov, R. R. On reward-free reinforcement learning with linear function approximation. Advances in Neural Information Processing Systems, 2020a.

Wang, R., Salakhutdinov, R. R., and Yang, L. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 2020b.

Xie, T. and Jiang, N. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, 2020.

Xie, T. and Jiang, N. Batch value-function approximation with only realizability. In International Conference on Machine Learning, 2021.

Xie, T., Foster, D. J., Bai, Y., Jiang, N., and Kakade, S. M. The role of coverage in online reinforcement learning. International Conference on Learning Representations, 2023.

Yang, M., Nachum, O., Dai, B., Li, L., and Schuurmans, D. Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 2020.

Yao, H., Szepesvári, C., Pires, B. Á., and Zhang, X. Pseudo MDPs and factored linear action models. In Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2014.

Zanette, A., Wainwright, M. J., and Brunskill, E. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in Neural Information Processing Systems, 2021.

Zhan, W., Huang, B., Huang, A., Jiang, N., and Lee, J. Offline reinforcement learning with realizability and singlepolicy concentrability. In Conference on Learning Theory, 2022.

Zhang, X., Song, Y., Uehara, M., Wang, M., Agarwal, A., and Sun, W. Efficient reinforcement learning in block MDPs: A model-free representation learning approach. In International Conference on Machine Learning, 2022.

Contents of Appendix

A Additional Related Work 14

B Experimental Evaluation: Details and Additional Results 15 B.1 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.3 Additional Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

C Technical Tools 20 C.1 Minimax Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.3 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.5 Helper Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

I Additional Results and Discussion 22

D L1-Coverage: Application to Downstream Policy Optimization 22 D.1 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

E Efficient Model-Free Exploration via L1-Coverage 25 E.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 E.2 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

F L1-Coverage: Structural Properties 29 F.1 Connection to Non-Admissible L1-Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 F.2 Connection to Feature Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 F.3 Insufficiency of Alternative Notions of Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

II Proofs 31

G Proofs from Section 3 31

H Proofs and Additional Details from Section 4 32 H.1 Omitted Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 H.2 Examples for Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 H.3 Proofs from Section 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 H.4 Proofs from Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

I Proofs and Additional Details from Section 5 36 I.1 General Guarantees for Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 I.2 Applying Algorithm 2 to Downstream Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 38 I.3 Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 I.4 Proofs from Section 5 and Appendix I.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

J Proofs from Appendix F 47

K Proofs from Appendix E 49 K.1 General Guarantees for Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 K.2 Technical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 K.3 Weight Function Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 K.4 Policy Optimization Subroutines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 K.5 Proof of Theorem K.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Scalable Online Exploration via Coverability

A. Additional Related Work

This section discusses additional related work not already covered in detail.

Theoretical objectives for exploration. On the computational side, our work can be viewed as building on a recent line of research that tries to make exploration computationally efficient by inductively building policy covers and using them to guide exploration. Up to this point, Block MDPs (Du et al., 2019; Misra et al., 2020; Zhang et al., 2022; Mhammedi et al., 2023b) and Low-Rank MDPs (Agarwal et al., 2020; Modi et al., 2024; Uehara et al., 2022; Zhang et al., 2022; Mhammedi et al., 2023b) are the most general classes that have been addressed by this line of research. Our work expands the scope of problems for which efficient exploration is possible beyond these classes to include the more general coverability framework. This is meaningful generalization, as it allows for nonlinear transition dynamics for the first time.

The L1-Coverage objective can be viewed as a generalization of optimal design, an exploration objective which has previously been studied in the context of tabular and Low-Rank MDPs (Wagenmaker et al., 2022; Wagenmaker & Jamieson, 2022; Li et al., 2023; Mhammedi et al., 2023a); Appendix F.2 makes this connection explicit. In particular, Li et al. (2023) consider an optimal design objective for tabular MDPs which is equivalent to Eq. (6) with µ = Unif(X). Outside of this paper, the only work we are aware of that considers optimal design-like objectives for nonlinear settings beyond Low-Rank MDPs is Foster et al. (2021), but their algorithms are not efficient in terms of offline estimation oracles.

Lastly, a theoretical exploration objective worth mentioning is maximum entropy exploration (Hazan et al., 2019; Jin et al., 2020a). Existing theoretical guarantees for this objective are limited to tabular MDPs, and we suspect the objective is not sufficiently powerful to allow for downstream policy optimization in more general classes of MDPs.

Coverability. Compared to previous guarantees based on coverability (Xie et al., 2023; Liu et al., 2023a; Amortila et al., 2024) our guarantees have somewhat looser sample complexity and require stronger function approximation. However, these works only consider reward-driven exploration and are computationally inefficient. A second contribution of our work is to show that L1-Coverability, which can be significantly weaker than L -Coverability, is sufficient for sample-efficient online reinforcement learning on its own.11

Instance-dependent algorithms and complexity. Wagenmaker et al. (2022); Wagenmaker & Jamieson (2022) provide instance-dependent regret bounds for tabular MDPs and linear MDPs that scale with problem-dependent quantities closely related to L1-Coverability. These results are tailored to the linear setting, and their bounds contain lower-order terms that scale explicitly with the dimension and/or number of states.

General-purpose complexity measures. A long line of research provides structural complexity measures that enable sample-efficient exploration in reinforcement learning (Russo & Van Roy, 2013; Jiang et al., 2017; Sun et al., 2019; Wang et al., 2020b; Du et al., 2021; Jin et al., 2021a; Foster et al., 2021; Xie et al., 2023; Foster et al., 2023). Coverability is incomparable to most prior complexity measures (Xie et al., 2023), but is subsumed by the Decision-Estimation Coefficient (Foster et al., 2021), as well as the (less general) Sequential Extrapolation Coefficient and related complexity measures (Liu et al., 2023b). The main contrast between these works and our own is that they do not provide computationally efficient algorithms in general.

A handful of works extend the approaches above to accommodate reward-free reinforcement learning, but are still computationally inefficient, and do not explicitly suggest exploration objectives (Chen et al., 2022b; Xie et al., 2023; Chen et al., 2022a).

Further related work. Coverability is closely related to a notion of smoothness used in a line of recent work on smoothed online learning and related problems (Haghtalab et al., 2020; 2022a;b; Block et al., 2022; Block & Polyanskiy, 2023; Daskalakis et al., 2023). We are not aware of explicit technical connections between our techniques and these works, but it would be interesting to explore this in more detail.

11Liu et al. (2023a) also provide guarantees based on L1-Coverability, but their regret bounds contain a lower-order term that can be as large as the number of states in general.

Scalable Online Exploration via Coverability

B. Experimental Evaluation: Details and Additional Results

B.1. Experimental Details

Max Ent baseline. We implement our algorithm by building on top of the codebase for Max Ent from Hazan et al. (2019). The Max Ent algorithm follows a similar template to ours, in that it iteratively defines a reward function based on past occupancies and optimizes it to find a new exploratory policy. In Max Ent, the reward function is defined as

rt = d DKL(Unif X)

where bdpt is the estimated occupancy for the policy cover pt at time t and Unif denotes the uniform distribution over X A. We use the same neural network architecture (described in detail below) to represent policies, the same algorithm for approximate planning, the same discretization scheme to approximate occupancies, and the exact same hyperparameters for discretization resolution, number of training steps, length of rollouts, learning rates, and so on. We found that our method worked out-of-the-box without any modifications to their architecture or hyperparameters.

Starting from their codebase, we obtain an implementation of Algorithm 1 simply by modifying the reward function given to the planner from the Max Ent reward function to the L1-Coverage reward function (Line 4 in Algorithm 1).

Environment. We evaluate on the Mountain Car Continuous-v0 environment (henceforth simply Mountain Car) from the Open AI Gym (Brockman et al., 2016). The state space of the environment is two-dimensional, with a position value (denoted by ξ) that is in the interval [ 1.2, 0.6] and a velocity value (denoted by ρ) that is in the interval [ 0.07, 0.07]. The dynamics are deterministic and defined by the physics of a sinusoidal valley, we refer the reader to the documentation for the precise equations.12 The goal state (the flag ) is at the top of the right hill, with a position ξ = +0.45. The bottom of the valley corresponds to the coordinate ξ = π/6 0.52. We modify the environment to have a deterministic starting state with a position of ξ = 0.5 and a velocity of ρ = 0, so that more deliberate exploration is required to find a high-quality cover. This means that occupancies and covers are evaluated when rolling out from this deterministic starting state. The action space is continuous in the interval [ 1, 1], with negative values corresponding to forces applied to the left and positive values corresponding to forces applied to the right. To simplify, we will consider that the environment only has 3 actions, namely we only allow actions in { 1, 0, 1}. Finally, we take a horizon of H = 200, meaning that we terminate rollouts after 200 steps (if the goal state has not been reached yet).

Implementation of Algorithm 1. We make a few changes to Algorithm 1. Firstly, while Mountain Car has a time horizon and is thus non-stationary, we approximate the dynamics as stationary. Thus, we replace d M,π h in Line 4 of Algorithm 1 by a stationary analogue d M,π stat, and only invoke Algorithm 1 once rather than at each layer h. Namely, we define

d M,π stat := 1

h d M,π h ,

where H = 200 is the horizon which we define for Mountain Car. We will henceforth simply write dπ := d M,π stat for compactness. Similarly, we choose a stationary distribution µ (to be described shortly) to pass as input to the algorithm. Secondly, rather than solving the policy optimization problem for the reward function

r(x, a) = µ(x, a) P

i<t d M,πi(x, a) + C µ(x, a)

as written Line 4, we instead apply some regularization and solve the policy optimization problem for the more general L1-Coverage-based reward function defined in Eq. (6), namely

r(x, a) = µ(x, a) P

i<t d M,πi(x, a) + εC µ(x, a), (16)

for a small parameter ε > 0. This has the effect of magnifying the reward difference between visited and unvisited states. We also (linearly) renormalize so that this reward function lies in the interval [0, 1]. We found that regularizing with ε helps our approximate planning subroutine (discussed shortly) solve the policy optimization and recover a better policy for the L1-Coverage-based reward function.

12https://www.gymlibrary.dev/environments/classic_control/mountain_car_continuous/

Scalable Online Exploration via Coverability

Approximating dπ and µ. To approximate dπ and choose µ, we define a discretized state space for Mountain Car. Namely, we discretize the position space [ 1.2, 0.6] in 12 evenly-spaced intervals and the velocity space [ 0.07, 0.07] in 11 evenly-spaced intervals. This gives a tabular discretization with 12 11 = 132 states. We then approximate dπ(x, a) via a simple count-based estimate on the discretized space. That is, letting the discretized space be denoted by B = {b1, . . . , b132} where each bi is a bin, and given trajectories (xn h, an h)n [N],h [H], we estimate

dπ disc(bi, a) := 1

h=1 dπ disc,h(bi, a), where dπ disc,h(bi, a) = 1

n=1 I{x n h bi, a n h = a}

and then for any x, a we assign dπ(x, a) = dπ disc(bix, a),

where ix is the index of the bin bi in which state x lies. For the distribution µ and the L coverability parameter C , we take µ to be uniform over the discretized state-action space, that is we take

µ(x, a) = 1 |B||A| = 1 396,

and C = |B||A| = 396.

Approximate planner. As an approximate planner, we take a simple implementation of the REINFORCE algorithm (Sutton et al., 1999) given in the Py Torch (Paszke et al., 2019) Git Hub repository13, which only differs from the classical REINFORCE in that it applies variance-smoothing (with some parameter σ) to the returns. When solving the policy optimization problem, we allow REINFORCE to use the original stochastic reset distribution for Mountain Car, that is the reset distribution which samples ξ [ 0.6, 0.4] uniformly and has ρ = 0.

Architecture, optimizer, hyperparameters. The policy class we use, and which REINFORCE optimizes over, is obtained by a set of fully-connected feedforward neural nets with Re LU activation, 1 hidden layer, and 128 hidden units. The input is 2-dimensional (corresponding to the 2-dimensional state space of Mountain Car) and the output is a 3-dimensional vector; we obtain a distribution over the action set { 1, 0, 1} by taking a softmax over the output. We use Xavier initialization (Glorot & Bengio, 2010).

For REINFORCE, we take a discount factor of 0.99, and a variance smoothing parameter of σ = 0.05. We train REINFORCE with horizons of length 400. We take πt, the policy which approximates Line 4 of Algorithm 1, to be the policy returned after 1000 REINFORCE updates, with one update after each rollout. The update in REINFORCE use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 3.

We estimate all occupancies with N = 100 rollouts of length H = 200. We calculate the mixture occupancies dp, for p = Unif(π1, . . . , πt), by estimating the occupancy for each dπi separately and averaging via

dp(x, a) = 1

i=1 dπi(x, a).

We train for 20 epochs, corresponding to T = 20 in the loop of Line 3 of Algorithm 1. For the regularized reward of Eq. (16), we take ε = 10 4.

B.2. Additional Experimental Results

B.2.1. MOUNTAINCAR

In addition to the results of Section 6, in Figure 2 we report the entropy and L1-Coverability of the state distributions for the policy covers found by the three algorithms throughout training. Namely, for each cover p, we estimate its occupancy dp via the procedure defined above, and measure the entropy of our estimate for dp. In the second plot, we take the estimate of dp

13https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py

Scalable Online Exploration via Coverability

0 5 10 15 Epochs

Policy Cover Entropy

0 5 10 15 Epochs

L1-Coverability

Policy Cover L1-Coverability

L1-Cov Max Ent Uniform

Figure 2. Entropy and L1-Coverability measured on each policy cover obtained from L1-Coverage (Algorithm 1), Max Ent, and uniform exploration on the Mountain Car environment. We plot the mean and standard error across 10 runs. Each epoch corresponds to a single policy update in Algorithm 1 and Max Ent, obtained through 1000 steps of REINFORCE with rollouts of length 400.

and measure its objective value ΨM µ;h,ε(p), where ΨM µ;h,ε(p) is the L -Coverability relaxation of L1-Coverability which we recall is defined via

Ψ M µ;h,ε(p) = sup π Π E M,π µ(xh, ah) d M,p h (xh, ah) + ε C µ(xh, ah)

We measure this quantity with ε = 10 4, and in the same way that we approximate Line 4 in Algorithm 1, namely by calling REINFORCE with a stochastic starting distribution as an approximate planner, with the same parameters and architecture. We note that there are two sources of approximation error in our measurements of ΨM µ;h,ε(p), namely the estimation error of dp

(via a count-based estimate on the discretized space) and the optimization error of REINFORCE, and thus values we report should be taken as approximations of the true L1-Coverability values.14

Results. We find that L1-Coverage and Max Ent recover policy covers with similar entropy values, with the entropy of the Max Ent cover being slightly larger (despite the Max Ent cover visiting fewer states, as seen in our results in Section 6). Interestingly, the Max Ent baseline has higher entropy while visiting fewer states (as seen in our results in Section 6), indicating that entropy may not be the best proxy for exploration. For L1-Coverage, we find that Algorithm 1 attains the smallest L1-Coverage values, indicating a better cover.

B.2.2. PENDULUM

Environment. We evaluate on the Pendulum-v0 environment (henceforth simply Pendulum) from the Open AI Gym (Brockman et al., 2016). The state space of the environment is two-dimensional, with an angle value (denoted by θ) that is in the interval [ π, π] and a velocity value (denoted by ρ) that is in the interval [ 8, 8]. The dynamics are deterministic and defined by the physics of an inverted pendulum. We modify the starting distribution to be a deterministic starting state with a position of θ = π and a velocity of ρ = 0. This means that occupancies and covers are evaluated when rolling out from this deterministic starting state. The action space is continuous in the interval [ 2, 2], with negative values corresponding to torque applied to the left and positive values corresponding to torque applied to the right. To simplify, we only allow actions in { 1, 0, 1}, so that |A| = 3. Finally, we take a horizon of H = 200, meaning that we terminate rollouts after 200 steps.

Discretization and other hyperparameters. We apply all the same implementation details as in Appendix B.1. We take a discretization resolution for Pendulum of 8 8. We take a REINFORCE planner with a uniform starting distribution with θ [ π, π] and ρ [ 1, 1]. The architecture, optimizer, and hyperparameters are the same as for Mountain Car.

14For instance, notice the variability over epochs of the uniform random baseline, which has a constant ΨM µ;h,ε(p) value.

Scalable Online Exploration via Coverability

0 5 10 15 Epochs

Number of states

Unique States Visited

Occupancy Heatmap

L1-Cov Max Ent Uniform

Figure 3. Number of discrete states visited (mean and standard error over 10 runs) and occupancy heatmaps for each policy cover obtained from L1-Coverage (Algorithm 1), Max Ent, and uniform exploration in the Pendulum environment. Each epoch comprises a single policy update in Algorithm 1 and Max Ent, obtained through 1000 steps of REINFORCE with rollouts of length 400. Heatmap axes: torque (x-axis) and angle (y-axis). Start state indicated by .

0 5 10 15 Epochs

Policy Cover Entropy

0 5 10 15 Epochs

L1-Coverability

Policy Cover L1-Coverability

L1-Cov Max Ent Uniform

Figure 4. Entropy and L1-Coverability measured on each policy cover obtained from L1-Coverage (Algorithm 1), Max Ent, and uniform exploration on the Pendulum environment. We plot the mean and standard error across 10 runs. Each epoch corresponds to a single policy update in Algorithm 1 and Max Ent, obtained through 1000 steps of REINFORCE with rollouts of length 400.

Scalable Online Exploration via Coverability

Results. As with Mountain Car, we measure the number of unique states visited and the occupancy heatmaps (Figure 3) as well as the entropy and L1-Coverage values (Figure 4). Overall, we find that L1-Coverage and Max Ent obtain similar values and are able to explore every state in the (discretized) Pendulum environment, indicating that exploration is somewhat easier than for the Mountain Car environment.

B.3. Additional Discussion

While the Mountain Car and Pendulum environments are fairly simple, we note that our algorithm exhibits robustness in several different ways, indicating that it may scale favorably to more challenging domains. Firstly, we do not expect that REINFORCE is finding a near-optimal policy for the planning problem in Line 4, which indicates that our method is robust to optimization errors. We also note that the choice of µ used in our implementation of Algorithm 1, which is defined as uniform over the discretized space, is a heuristic and is not guaranteed to have good coverage with respect to the true continuous MDP. This indicates that our method is robust to the choice of µ and C . Lastly, our method worked out-of-the-box with the same hyperparameters used in the Max Ent implementation in Hazan et al. (2019), and was found to behave similarly with different hyperparameters, which indicates that our method is robust to the choice of hyperparameters.

Scalable Online Exploration via Coverability

C. Technical Tools

C.1. Minimax Theorem

Lemma C.1 (Sion s Minimax Theorem (Sion, 1958)). Let X and Y be convex sets in linear topological spaces, and assume X is compact. Let F : X Y R be such that (i) f(x, ) is concave and upper semicontinuous over Y for all x X and (ii) f( , y) is convex and lower semicontinuous over X for all y Y. Then

inf x X sup y Y f(x, y) = sup y Y inf x X f(x, y). (18)

C.2. Concentration

Lemma C.2. For any sequence of real-valued random variables (Xt)t T adapted to a filtration (Ft)t T , it holds that with probability at least 1 δ, for all T T,

t=1 log Et 1 e Xt

t=1 Xt + log(δ 1). (19)

Lemma C.3 (Freedman s inequality (e.g., Agarwal et al. (2014))). Let (Xt)t T be a real-valued martingale difference sequence adapted to a filtration (Ft)t T . If |Xt| R almost surely, then for any η (0, 1/R), with probability at least 1 δ, T X

t=1 Et 1 X2 t + log(δ 1)

Lemma C.4 (Corollary of Lemma C.3). Let (Xt)t T be a sequence of random variables adapted to a filtration (Ft)t T . If 0 Xt R almost surely, then with probability at least 1 δ,

t=1 Et 1[Xt] + 4R log(2δ 1),

t=1 Et 1[Xt] 2

t=1 Xt + 8R log(2δ 1).

C.3. Information Theory

Lemma C.5 (e.g., Foster et al. (2021)). For any pair of random variables (X, Y ),

EX PX D2 H PY |X, QY |X 4D2 H(PX,Y , QX,Y ).

Lemma C.6 (Foster et al. (2021)). Let P and Q be probability measures on (X, F). For all h : X R with 0 h(X) B almost surely under P and Q, we have

EP[h(X)] 3 EQ[h(X)] + 4B D2 H(P, Q). (20)

C.4. Reinforcement Learning

Proofs for the following lemmas can be found in Foster et al. (2021). Lemma C.7 (Global simulation lemma). Let M and M be MDPs with PH h=1 rh [0, 1] almost surely, and let π ΠRNS. Then we have f M(π) f M (π) DH(M(π), M (π)) 1

2D2 H(M(π), M (π)) η > 0. (21)

Lemma C.8 (Local simulation lemma). For any pair of MDPs M = (P M, RM) and M = (P M, RM) with the same initial state distribution and PH h=1 rh [0, 1],

M,π DH P M h (xh, ah), P

M h (xh, ah) + DH R M h (xh, ah), R

M h (xh, ah) . (22)

Scalable Online Exploration via Coverability

C.5. Helper Lemmas

Lemma C.9 (E.g., Xie et al. (2023)). Let d1, d2, . . . , d T be an arbitrary sequence of distributions over a set Z, and let µ (Z) be a distribution such that dt(z)/µ(z) C for all (z, t) Z [T]. Then for all z Z, we have

i<t di(z) + C µ(z) 2 log(2T).

Lemma C.10. For any distribution p (Πrns), π Πrns, µ RX A + , and ε, δ > 0, we have that

E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε d M,π h (xh, ah)

E M,π d M,π h (xh, ah) d M,p h (xh, ah) + δ µ(xh, ah)

ε E M,π µ(xh, ah) d M,p h (xh, ah) + δ µ(xh, ah)

Proof of Lemma C.10. The result follows by observing that we can bound

E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε d M,π h (xh, ah)

E M,π d M,π h (xh, ah) d M,p h (xh, ah) + δ µ(xh, ah)

(d M,π h (x, a))2 δ µ(x, a) ε d M,π h (x, a)

(d M,p h (x, a) + ε d M,π h (x, a))(d M,p h (x, a) + δ µ(x, a))

(d M,π h (x, a))2(δ µ(x, a)) (d M,p h (x, a) + ε d M,π h (x, a))(d M,p h (x, a) + δ µ(x, a))

d M,π h (x, a)µ(x, a) d M,p h (x, a) + δ µ(x, a).

Lemma C.11. For any π Πrns, d RX +, µ RX +, and ε, δ > 0, we have that

E M,π P M h 1(xh | xh 1, ah 1) d(xh) + ε P M h 1(xh | xh 1, ah 1)

E M,π P M h 1(xh | xh 1, ah 1)

d(xh) + δ µ(xh)

ε E M,π µ(xh) d(xh) + δ µ(xh)

Proof of Lemma C.11. The result follows using similar reasoning to Lemma C.10:

E M,π P M h 1(xh | xh 1, ah 1) d(xh) + ε P M h 1(xh | xh 1, ah 1)

E M,π P M h 1(xh | xh 1, ah 1)

d(xh) + δ µ(xh)

x X,a A,x X d M,π h 1(x, a)(P M h 1(x | x, a))2 δ µ(x ) ε P M h 1(x | x, a)

(d(x ) + ε P M, h 1(x | x, a))(d(x ) + δ µ(x ))

x X,a A,x X d M,π h 1(x, a) (P M h 1(x | x, a))2(δ µ(x )) (d(x ) + ε P M, h 1(x | x, a))(d(x ) + δ µ(x ))

x X,a A,x X d M,π h 1(x, a)P M h 1(x | x, a)µ(x )

d(x ) + δ µ(x )

ε E M,π µ(xh) d(xh) + δ µ(xh)

Scalable Online Exploration via Coverability

Part I Additional Results and Discussion

D. L1-Coverage: Application to Downstream Policy Optimization

In this section, we show how to use data gathered using policy covers with bounded L1-Coverage to perform offline policy optimization. Here, there is an underlying (reward-free) MDP M = X, A, {P M h }H h=0 and a given policy class Π, and we have access to policy covers p1, . . . , p H such that the L1-Coverage objective ΨM h,ε(ph) is small for all h. For an arbitrary

user-specified reward distribution R = {Rh}H h=1 with PH h=1 rh [0, 1] almost surely, we define

J M R (π) := E M ,π " H X

as the value under rh R(xh, ah). Our goal is to use trajectories drawn from p1, . . . , p H to compute a policy bπ such

max π Π J M R (π) J M R (bπ) ϵ

using as few samples as possible.

We present guarantees for two standard offline policy optimization methods: Maximum Likelihood Estimation (MLE) and Fitted Q-Iteration (FQI). While both of these algorithms are fully offline, combining them with the online algorithms for reward-free exploration in Section 5 leads to end-to-end algorithms for online reward-driven exploration. We expect that similar guarantees can be proven for other standard offline policy optimization methods (Munos, 2003; Antos et al., 2008; Chen & Jiang, 2019; Xie & Jiang, 2020; 2021; Jin et al., 2021b; Rashidinejad et al., 2021; Foster et al., 2022; Zhan et al., 2022), as well as offline policy evaluation methods (Liu et al., 2018; Uehara et al., 2020; Yang et al., 2020; Uehara et al., 2021) and hybrid offline/online methods (Bagnell et al., 2003; Agarwal et al., 2021; Song et al., 2022).

Maximum likelihood. Let M be a realizable model class for which M M. We define the Maximum Likelihood algorithm as follows:

For each h [H], gather n trajectories {oh,t}t [n], where oh,t = (x h,t 1 , a h,t 1 , r h,t 1 ), . . . , (x h,t H , a h,t H , r h,t H ) by executing πh,t ph in M with R as the reward distribution.

Set c M = arg max M M PH h=1 Pn t=1 log(M(oh,t | πh,t)), where M(o | π) denotes the likelihood of trajectory o under π in M.

Let bπ = arg maxπ Π J M R (π).

Proposition D.1. Assume that M M and let R = {Rh}H h=1 be a reward distribution with PH h=1 rh [0, 1] almost surely for all M M. For any n N, given policy covers p1, . . . , p H, the Maximum Likelihood algorithm ensures that with probability at least 1 δ,

J M R (π ) J M R (bπ) 8H

max h ΨM h,ε(ph) log(|M|/δ)

n + max h Ψ M h,ε(ph) ε

and uses H n trajectories total.

Fitted Q-iteration. Let a value function class Q = Q1, . . . , QH (X A) [0, 1] be given. For a value function Q, define the Bellman operator for M with reward distribution R by T M R,h Q (x, a) := E M [rh + maxa A Q(xh+1, a ) | xh = x, ah = a] under rh R(xh, ah). We make the Bellman completeness assumption that for all Q Qh+1, T M R,h Q Qh. We define the L1-Coverage objective with respect to the policy class Π = {πQ}Q Q, where πQ,h(x) := arg maxa A Qh(x, a).

The Fitted Q-Iteration algorithm is defined as follows:

Scalable Online Exploration via Coverability

For each h = H, . . . , 1:

Gather n trajectories {oh,t}t [n], where oh,t = (x h,t 1 , a h,t 1 , r h,t 1 ), . . . , (x h,t H , a h,t H , r h,t H ) by executing πh,t ph in M with R as the reward distribution. Set b Qh = arg min Q Qh Pn t=1 Q(x h,t h , a h,t h ) r h,t h maxa A b Qh+1(x h,t h+1, a ) 2.

Let bπh(x) := arg maxa A b Qh(x, a).

Proposition D.2. Let R = {Rh}H h=1 be a reward distribution with rh [0, 1] and PH h=1 rh [0, 1] almost surely, and assume that for all Q Qh+1, T M R,h Q Qh. For any n N, given policy covers p1, . . . , p H, the Fitted Q-Iteration algorithm ensures that with probability at least 1 δ,

J M R (π ) J M R (bπ) O(H)

max h ΨM h,ε(ph) log(|Q|H/δ)

n + max h Ψ M h,ε(ph) ε

and uses H n trajectories total.

D.1. Proofs

Proof of Proposition D.1. By the standard generalization bound for MLE (e.g., Foster & Krishnamurthy (2021); Foster & Rakhlin (2023)), we are guaranteed that with probability at least 1 δ,

h=1 E M ,ph D2 H P c M h (xh, ah), P M h (xh, ah)

h=1 Eπ ph h D2 H c M(π), M (π) i 2log(|M|/δ)

Let π := arg maxπ Π J M R (π). Using Lemma C.8, we have that

J M R (π ) J M R (bπ) J M R (π ) J c M R (π ) + J c M R (bπ) J M R (bπ) 2 max π Π

h=1 EM ,π DH P c M h (xh, ah), P M h (xh, ah) .

Since DH P c M h (xh, ah), P M h (xh, ah) [0,

2], we can use Proposition 3.1 to bound

max π Π E M ,π DH P c M h (xh, ah), P M h (xh, ah) 2 q

ΨM h,ε(ph) E M ,ph D2 H P c M h (xh, ah), P M h (xh, ah)

2Ψ M h,ε(ph)ε

2ΨM h,ε(ph) log(|M|/δ)

2 Ψ M h,ε(ph) ε.

Combining this with the preceding inequalities yields the result.

Proof of Proposition D.2. By a standard generalization bound for FQI (e.g., Xie & Jiang (2020); Xie et al. (2023)), under the Bellman completeness assumption, it holds that with probability at least 1 δ,

h=1 E M ,ph b Qh(xh, ah) T M R,h b Qh+1 (xh, ah) 2 O log(|Q|H/δ)

At the same time, using a finite-horizon adaptation of Xie & Jiang (2020, Lemma 4), we have that

J M R (π ) J M R (bπ) 2 max π Π

h=1 EM ,πh b Qh(xh, ah) T M R,h b Qh+1 (xh, ah) i .

Scalable Online Exploration via Coverability

Since b Qh(xh, ah) T M R,h b Qh+1 (xh, ah) [0, 1], we can use Proposition 3.1 to bound

max π Π E M ,πh b Qh(xh, ah) T M R,h b Qh+1 (xh, ah) i

ΨM h,ε(ph) E M ,phh b Qh(xh, ah) T M R,h b Qh+1 (xh, ah) 2i + Ψ M h,ε(ph) ε

ΨM h,ε(ph) log(|Q|H/δ)

n + Ψ M h,ε(ph) ε

Scalable Online Exploration via Coverability

E. Efficient Model-Free Exploration via L1-Coverage

Our algorithms in the previous section show that the L1-Coverage objective and L1-Coverability parameter enable sampleefficient online reinforcement learning, but one potential drawback is that they require model-based realizability, a strong form of function approximation that may not always be realistic. In this section, we give model-free algorithms to perform reward-free exploration and optimize the L1-Coverage objective that do not require model realizability, and instead require a weaker form of density ratio or weight function realizability, a modeling approach that has been widely used in offline reinforcement learning (Liu et al., 2018; Uehara et al., 2020; Yang et al., 2020; Uehara et al., 2021; Jiang & Huang, 2020; Xie & Jiang, 2020; Zhan et al., 2022; Chen & Jiang, 2022; Rashidinejad et al., 2023; Ozdaglar et al., 2023) and recently adapted to the online setting (Amortila et al., 2024). The main algorithm we present computes a policy cover that achieves a bound on the L1-Coverability objective that scales with the pushforward coverability parameter (Section 4.2), but the weaker modeling assumptions make it applicable in a broader range of settings.

Throughout this section, we take Π = Πns as the set of all non-stationary policies.

E.1. Algorithm

Our model-free algorithm, CODEX.W, is presented in Algorithm 3. The algorithm builds a collection of policy covers p1, . . . , p H (Πrns) layer-by-layer in an inductive fashion. For each layer h [H], given policy covers p1, . . . , ph 1 for the preceding layers, the algorithm computes ph by (approximately) implementing the iterative algorithm for policy cover construction given in Algorithm 5 (Section 4.2), in a data-driven fashion.15

In more detail, recall the pushforward coverability relaxation

Ψ M push;h,ε(p) = sup π Π E M ,π " P M h 1(xh | xh 1, ah 1)

d M ,p h (xh) + ε P M h 1(xh | xh 1ah 1)

for the L1-Coverage objective given in Section 4.2. For layer h, Algorithm 3 approximately minimizes this objective by computing a sequence of policies πh,1, . . . , πh,T, where each policy

π h,t arg max π Π E M ,π " P M h 1(xh | xh 1, ah 1) P

i<t dπh,i h (xh) + P M h 1(xh | xh 1, ah 1)

is computed in a data-driven, online fashion that makes use of the preceding policy covers p1, . . . , ph 1. The algorithm then computes the cover ph via ph = Unif(πh,1 h πunif, . . . , πh,T h πunif).

Our planning analysis in Section 4.2 shows that as long as the approximation error in Eq. (25) is small, ph will indeed be an approximate policy cover that minimizes ΨM push;h,ε(ph). To achieve this, Algorithm 3 makes use of two subroutines. The first subroutine, Estimate Weight Function (Algorithm 4), invoked in Line 6, uses function approximation to estimate a weight function bwt h such that

bw t h(xh | xh 1, ah 1) w t h(xh | xh 1, ah 1) := P M h 1(xh | xh 1, ah 1) P

i<t dπh,i h (xh) + P M h 1(xh | xh 1, ah 1). (26)

The second subroutine, Policy Optimization, is a hyperparameter to the algorithm, and approximately solves the policy optimization problem

π h,t arg max π Π E M ,π[ bw t h(xh | xh 1, ah 1)],

treating the estimated weight function bwt h as a reward. The Policy Optimization subroutine makes use of exploratory data collected using preceding policy covers p1, . . . , ph 1, and hence does not have to perform systematic exploration. Indeed, we show that any hybrid offline/online method (that is, any online method that requires access to an exploratory policy) that satisfies a certain local policy optimization guarantee is sufficient, with PSDP (Algorithm 6; (Bagnell et al., 2003)) and Natural Policy Gradient (Agarwal et al., 2021) being natural choices; for our analysis, we make use of PSDP.

15Note that while the results in Section 4 are presented for the setting in which the underlying MDP M is known to the learner (planning), the algorithmic template in Algorithm 1 can be applied even when M is unknown, as long as the policy optimization step in Line 4 can be implemented in a sample-efficient fashion. This observation is the basis for the algorithms we present in this section.

Scalable Online Exploration via Coverability

Algorithm 3 Coverage-Driven Exploration via Weight Function Estimation (CODEX.W)

1: parameters:

Weight function class W = W1:H.

Policy opt. subroutine Policy Optimizationh(r1:h, p1:h, ϵ, δ).// Optimizes reward rh using policy covers p1:h.

Approximation parameter ε (0, 1/2), failure probability δ (0, e 1).

2: Set T = 1

ε and p1 = πunif.

3: Set ϵw = c (CM push/|A|)1/2ε1/2, ϵopt = c ε2, and δw = δopt = δ/(2HT), for suff. small constants c, c > 0.

4: for h = 2, , H do

5: for t = 1, . . . , T do

6: bwt h Estimate Weight Functionh,t(ph 1, {πh,i}i<t; ϵw, δw, W). // Algorithm 4.

// Estimate for wt h(xh | xh 1, ah 1) = P M h 1(xh|xh 1,ah 1)

P i<t dπh,i h (xh)+P M h 1(xh|xh 1,ah 1) .

7: Define reward function rh,t via r h,t h 1(xh | xh 1, ah 1) = bwt h(xh | xh 1, ah 1) and r h,t h = 0 h = h 1.

// rh,t h 1 can be interpreted as a stochastic reward for layer h 1.

8: πh,t Policy Optimizationh 1(p1:h 1, rh,t; ϵopt, δopt).

// Approximately solve arg maxπ Π EM ,π " P M h 1(xh|xh 1,ah 1)

P i<t dπh,i h (xh)+P M h 1(xh|xh 1,ah 1)

9: Set ph = Unif(πh,1 h πunif, . . . , πh,T h πunif).

10: return Policy covers (p1, . . . , p H).

In what follows, we describe the Estimate Weight Function and Policy Optimization subroutines and the corresponding statistical assumptions in more detail.

Weight function estimation and realizability. To perform weight function estimation, we assume access to a weight function class W = W1:H, with Wh (X A X R+) that is capable of representing the weight function wt h in Eq. (26). While we can directly assume that the weight function in Eq. (26) is realized by W (cf. Assumption K.1), it turns out (cf. Proposition K.1) that the following weaker form of weight function realizability is sufficient.

Assumption E.1 (Weight function realizability). For all h 2 and all π Πns

wπ h(x | x, a) := P M h 1(x | x, a)

d M ,π h (x ) Wh.

Assumption E.1 is new to the best of our knowledge, and is naturally suited to the pushforward coverability objective. While this assumption involves the forward transition probability, it is weaker than model-based realizability because it only requires modeling the relative transition probability, as the following example shows.

Example E.1. For a Block MDP (Du et al., 2019; Misra et al., 2020; Zhang et al., 2022; Mhammedi et al., 2023b) with latent state space S and decoder class Φ (X S), we can satisfy Assumption E.1 with log|W| e O(|S|2|A| + log|Φ|),16

yet any realizable model class must have log|M| = Ω(|X|) |S| in general.

See Amortila et al. (2024) for further discussion around weight function realizability in online RL.

Our weight function estimation subroutine, Estimate Weight Function, is given in Algorithm 4. To motivate the algorithm, consider the following abstract setting. Let Z be a set. We receive samples D1 = {z1 µ, . . . , zn µ} Z and D2 = {z1 ν, . . . , zn ν} Z, where zt µ µt (Z) and zt ν νt (Z). The distributions µt and νt can be chosen in an adaptive fashion based on z1 µ, z1 ν, . . . , zt 1 µ , zt 1 ν . We define µ = 1

n Pn t=1 µt and ν = 1

n Pn t=1 νt, and our goal is to estimate the density ratio w (z) := µ(z)

ν(z). Given realizable weight function class W with w W, Nguyen et al. (2010) (see also

16Formally this requires a standard covering argument; we omit the details.

Scalable Online Exploration via Coverability

Algorithm 4 Weight Function Estimation (Estimate Weight Functionh,t(ph 1, {πi}i<t; ϵ, δ, W))

1: parameters:

Layer h 2, iteration t N.

Distribution ph 1 (Πrns) , policies π1, . . . , πt 1 Πns.

Error tolerance ϵ (0, 1), failure probability δ (0, 1).

Weight function class W = W1:H with Wh (X A X [0, 1]).

2: Let n = nweight(ϵ, δ) := 40 log(|W|δ 1)

3: Let q := 1

2ph 1 + 1 2(t 1) P i<t πi h 1 πunif if t 1 and q := ph 1 otherwise.

4: Let D1 = D2 = .

5: For each j [n], draw π q and sample (x j h 1, a j h 1, x j h) π. Add (x j h 1, a j h 1, x j h) to both D1 and D2.

6: for i < t do

7: Draw n samples (x j h 1, a j h 1, x j h)

j [n] independently by drawing π q and (x j h 1, a j h 1, x j h) π.

8: Draw n samples ( x j h 1, a j h 1, x j h)

j [n] by sampling ( x j h 1, a j h 1, x j h) πi.

9: Add (x j h 1, a j h 1, x j h)

j [n] to D1 and add (x j h 1, a j h 1, x j h)

j [n] to D2.

10: Set bw := arg maxw Wh b ED1[log(w(xh | xh 1, ah 1))] t b ED2[w(xh | xh 1, ah 1)]. // See Eq. (54).

11: return bw.

Katdare et al. (2023)) propose the estimator

bw := arg max w W b ED1[log(w)] b ED2[w], (27)

where b ED[ ] denotes the empirical expectation with respect to a dataset D. We show (Theorem K.2 in Appendix K) that this estimator ensures that

w (z) 2 O( w ) log(|W|δ 1)

with high probability. Algorithm 4 simply applies this technique to estimate the weight function in Eq. (26); to ensure realizability of Eq. (26) under Assumption E.1, we apply the method with an expanded weight function class (defined in Proposition K.1); see Appendix K.1 for details.

Remark E.1 (Comparison to contrastive learning (Misra et al., 2020)). We remark our weight function estimation subroutine (Algorithm 4) can be viewed as a form of contrastive learning, with the target (population) weight function in Eq. (26) bearing strong similarity to the target (Bayes-optimal) function from the regression problem used in the HOMER algorithm (cf. Line 11 of Algorithm 3 and Lemma 9 of Misra et al.). We also note that both CODEX.W and HOMER find policy covers inductively via policy optimization on exploratory reward functions, and thus we can view CODEX.W as a natural generalization of the HOMER algorithm to settings beyond the Block MDP. Additionally, our analysis improves dependencies present in HOMER s sample complexity (notably, the dependence on the minimal visitation probability).

Policy optimization subroutine. The policy optimization subroutine, Policy Optimization, is a hyperparameter to the algorithm, and can be any subroutine that approximately solves the policy optimization problem

π h,t arg max π Π E M ,π[ bw t h(xh | xh 1, ah 1)],

which treats the estimated weight function bwt h as a reward for layer h 1. Policy Optimization does not need to perform global policy optimization instead, we only require a certain form of local guarantee with respect to the approximate policy covers p1, . . . , ph 1; see Assumption K.2 for details.

For concreteness, we make use PSDP (Bagnell et al., 2003), described in Appendix K.4, as the Policy Optimization subroutine. PSDP optimizes a given reward function rh 1 by collecting exploratory data with p1, . . . , ph 1 and applying

Scalable Online Exploration via Coverability

approximate dynamic programming with a user-specified value function class Q. We make the following value-based realizability assumption, which asserts that Q is expressive enough to allow PSDP to optimize any weight function in the class W.

Assumption E.2. We have access to a value function class Q = Q1:H with Qh (X A [0, 1]) such that for all h [H], w Wh, and π Πns, we have Q M ,π ℓ ( , ; w) Qℓfor all ℓ h 1, where

Q M ,π ℓ (x, a; w) = E M ,π[wh(xh | xh 1, ah 1) | xℓ= x, aℓ= a]

is the Q-function for π under the (stochastic) reward function defined via rh 1 = w(xh | xh 1, ah 1) and rh = 0 for h = h 1.

E.2. Main Result

The main guarantee for Algorithm 3 applied with PSDP is as follows.

Theorem E.1 (Main result for Algorithm 3 with PSDP). Let ε (0, 1/2) and δ (0, e 1) be given, and suppose that we have a weight function class W and value function class Q such that Assumptions E.1 and E.2 are satisfied. Then with PSDP (Algorithm 6) as a policy optimization subroutine, and using an expanded weight function class defined in Proposition K.1, Algorithm 3 produces policy covers p1, . . . , p H (Πrns) such that with probability at least 1 δ, for all h [H],

Ψ M push;h,ε(ph) 170H log(ε 1) C M push,

and does so using at most

N e O H|A| log(|W|δ 1)

ε4 + H4|A| log(|Q|δ 1)

Theorem E.1 is a special case of a more general result (Theorem K.1) which allows for general policy optimization algorithms that satisfy a certain local optimality guarantee (Assumption K.2); we obtain the result by verifying that PSDP satisfies this condition. Let us discuss some key features.

Sample efficiency. The sample complexity in Theorem E.1 scales as poly(H, |A|, log|W|, log|Q|, ε 1), and hence is efficient for large state spaces; we expect that the precise polynomial factors can be tightened, as can the approximation ratio in the objective value, though we leave this for future work. As with our results in preceding sections, the resulting policy covers p1, . . . , p H can be directly used for downstream policy optimization.

Computational efficiency. With PSDP as the Policy Optimization subroutine, Algorithm 3 is computationally efficient whenever i) the weight function estimation objective in Line 10 of Algorithm 4 can be solved efficiently over W, and ii) square loss regression over the class Q can be solved efficiently. We consider these to be fairly mild assumptions.

Practicality. In practice, we expect that the subroutine Policy Optimization can be implemented using off-the-shelf deep RL methods (e.g., PPO or SAC), which are known to perform well given access to exploratory data. In this sense, Algorithm 3 can be viewed as a new approach to equipping existing deep RL methods with exploration, with the weight function-based rewards in Line 7 acting as exploration bonuses.

As a concrete example, we can instantiate Theorem E.1 for Block MDPs to derive end-to-end guarantees under standard assumptions.

Example E.2 (Sample complexity for Block MDP). For a Block MDP (Du et al., 2019; Misra et al., 2020; Zhang et al., 2022; Mhammedi et al., 2023b) with latent state space S and decoder class Φ (X S), we can satisfy Assumption E.1 with log|W| e O(|S|2|A| + log|Φ|) and log|Q| e O(|S||A| + log|Φ|), so Theorem E.1 gives sample complexity

N e O H4|S|2|A|2 log(|Φ|δ 1)

We view the results in this section as a proof of concept, showing that the planning methods derived in Section 4 for iteratively optimizing L1-Coverage can be implemented in a data-driven fashion. We leave sample-efficient counterparts for other relaxations for future work.

Scalable Online Exploration via Coverability

F. L1-Coverage: Structural Properties

In this section we draw connections between L1-Coverability, the structural parameter induced by L1-Coverage, and other structural parameters and objectives, focusing on (i) a non-admissible variant of L1-Coverability (Appendix F.1), and (ii) feature coverage (Appendix F.2). We also show that alternative exploration objectives do not induce meaningful structural parameters in the same fashion as L1-Coverage (Appendix F.3).

F.1. Connection to Non-Admissible L1-Coverage

Inspired by Eq. (5), we can also define a non-admissible counterpart to L1-Coverability as follows.

C M avg;h = inf µ (X A) sup π Π E M,π d M,π h (xh, ah) µ(xh, ah)

and CM avg = maxh [H] CM avg;h. This quantity was used to provide sample complexity bounds for online reinforcement learning by Liu et al. (2023a) (generalizing the results of Xie et al. (2023)), and the associated concentrability coefficient

CM avg;h(µ) = supπ Π E M,πh d M,π h (xh,ah) µ(xh,ah) i is widely used in offline reinforcement learning (Farahmand et al., 2010; Xie & Jiang, 2020). The following result, which uses the minimax theorem in a similar fashion to Proposition 3.2, shows that it is possible to bound L1-Coverability by this quantity in spite of non-admissibility, albeit with some loss in rate.

Proposition F.1. For all ε > 0, it holds that Cov M h,ε 1 + 2 q

Note that while the bound in Proposition F.1 grows as q

1 ε, it still leads to non-trivial sample complexity bounds through

our main results (which give meaningful guarantees whenever Cov M h,ε grows sublinearly with ε 1), though the resulting rates are worse than in the case where Cov M h,ε is bounded by an absolute constant.

F.2. Connection to Feature Coverage

Another well-studied notion of coverage from offline reinforcement learning is feature coverage (Jin et al., 2021b; Zanette et al., 2021; Wagenmaker & Pacchiano, 2023). Consider the Low-Rank MDP framework (Rendle et al., 2010; Yao et al., 2014; Agarwal et al., 2020; Modi et al., 2024; Zhang et al., 2022; Mhammedi et al., 2023a), in which the transition distribution is assumed to factorize as

P M h 1(xh | xh 1, ah 1) = ϕh 1(xh 1, ah 1), ψh(xh) , (30)

where ϕh 1(x, a), ψh(x) Rd are (potentially unknown) feature maps. For offline reinforcement learning in Low-Rank MDPs (Jin et al., 2021b; Zanette et al., 2021; Wagenmaker & Pacchiano, 2023), feature coverage for a distribution µ (X A) is given by CM ϕ;h(µ) = supπ Π E M,π[ϕh(xh, ah)] 2 Σ 1 µ . We define the associated feature coverability coefficient by

C M ϕ;h = inf µ (X A) sup π Π E M,π[ϕh(xh, ah)] 2 Σ 1 µ , (31)

where Σµ := E(x,a) µ ϕh(x, a)ϕh(x, a) , and define CM ϕ = maxh [H] CM ϕ;h. Note that one always has CM ϕ d, as a consequence of the existence of G-optimal designs (Kiefer & Wolfowitz, 1960; Huang et al., 2023). The following result shows that L1-Coverability is always bounded by feature coverability.

Proposition F.2. Suppose the MDP M obeys the low-rank structure in Eq. (30). Then for all h [H], we have

CM avg;h |A| CM ϕ;h 1, and consequently Cov M h,ε 1 + 2 q

|A| CM ϕ;h 1 ε .

We remark that if we restrict the distribution µ (X A) in Eq. (31) to be realized as a mixture of occupancies, Proposition F.2 can be strengthened to give Cov M h,ε O |A| CM ϕ;h 1 . Note that through our main results, Proposition F.2, gives guarantees that scale with |A|. This is necessary in the setting where the feature map ϕh is unknown, which is covered by our results, but is suboptimal in the setting where ϕ is known.

Scalable Online Exploration via Coverability

F.3. Insufficiency of Alternative Notions of Coverage

L1-Coverage measures coverage of the mixture policy p (Πrns) with respect to the L1(d M,π h )-norm. It is also reasonable to consider variants of the objective based on Lq-norms for q > 1, which provide stronger coverage, but may have larger optimal value. In particular, a natural L -type analogue of Eq. (2) is given by

Ψ M ;h,ε(p) = sup π Π sup (x,a) X A

d M,π h (x, a) d M,p h (x, a) + ε d M,π h (x, a)

is identical to the L -coverability coefficient CM (5) studied in Xie et al. (2023), except that we restrict the data distribution µ (X A) to be admissible, i.e. realized by a mixture policy p (Πrns) (we also incorporate the term ε d M,π h (x, a) in the denominator to ensure the ratio is well-defined). The following lemma shows, perhaps surprisingly, that in stark contrast to L1-Coverage, it is not possible to bound the optimal value of the admissible L -Coverage objective in Eq. (32) in terms of the non-admissible coverability coefficient CM .

Proposition F.3. There exists an MDP M and policy class Π Πrns with horizon H = 1 such that CM ;h 2 (and hence Cov M h,ε 2 as well), yet for all ε > 0,

inf p (Π) Ψ M ;h,ε(p) 1

and in particular infp (Π) ΨM ;h,0(p) = .

Note that one trivially has infp (Π) ΨM ;h,ε(p) 1

ε, and hence Eq. (33) shows that the optimal value is vacuously large for the MDP in this example. In contrast, we have Cov M h,ε 2 even when ε = 0. More generally, one can show that similar failure modes hold for admissible Lq-Coverage for any q > 1.

Scalable Online Exploration via Coverability

Part II Proofs

G. Proofs from Section 3

Proposition 3.1 (Change of measure for L1-Coverage). For any distribution p (Πrns), we have that for all functions g : X A [0, B], all π Π, and all ε > 0,17

E M,π[g(xh, ah)] (3)

ΨM h,ε(p) E M,p[g2(xh, ah)] + Ψ M h,ε(p) (εB).

Proof of Proposition 3.1. Let p (Πrns) and g : X A R be given. We first prove the following, more general inequality:

E M,π[g(xh, ah)] q

ΨM h,ε(p) (E M,p[g2(xh, ah)] + ε E M,π[g2(xh, ah)]). (34)

Using Cauchy-Schwarz, we have

E M,π[g(xh, ah)] = X

x X,a A d M,π h (x, a)g(x, a)

x X,a A d M,π h (x, a) (dp M,p h (x, a) + ε d M,π h (x, a))1/2

(d M,p h (x, a) + ε d M,π h (x, a))1/2 g(x, a)

(d M,π h (x, a))2

d M,p h (x, a) + ε d M,π h (x, a)

x X,a A (d M,p h (x, a) + ε d M,π h (x, a))g2(x, a)

ΨM h,ε(p) (E M,p[g2(xh, ah)] + ε E M,π[g2(xh, ah)]).

This establishes Eq. (34). To prove Eq. (3), we first bound q

ΨM h,ε(p) (E M,p[g2(xh, ah)] + ε E M,π[g2(xh, ah)]) q

ΨM h,ε(p) E M,p[g2(xh, ah)]

ΨM h,ε(p) ε E M,π[g2(xh, ah)].

Next, we note that if g [0, B], we can use AM-GM to bound

ΨM h,ε(p) ε E M,π[g2(xh, ah)] q

ΨM h,ε(p) (εB) E M,π[g(xh, ah)] ΨM h,ε(p) (εB)

2 E M,π[g(xh, ah)].

The result now follows by rearranging.

Proposition 3.2. For all ε > 0, we have Cov M h,ε CM ;h.

Proof of Proposition 3.2. Let δ > 0 be given. Using Lemma C.10 and the definition of CM ;h, there exists µ (X A) such that

Cov M h,ε 1 + δ

C M ;h inf p (Π) sup π Π E M,π " µ(xh, ah) d M,p h (xh, ah) + δ CM ;hµ(xh, ah)

C M ;h inf p (Π) sup q (Π) Eπ q E M,π " µ(xh, ah) d M,p h (xh, ah) + δ CM ;hµ(xh, ah)

17This result is meaningful in the parameter regime where ΨM h,ε(p) < 1/ε. We refer to this regime as non-trivial, as ΨM h,ε(p) 1/ε holds vacuously for all p.

Scalable Online Exploration via Coverability

Observe that the function

(p, q) 7 Eπ q E M,π " µ(xh, ah) d M,p h (xh, ah) + δ CM ;hµ(xh, ah)

is convex-concave. In addition, it is straightforward to see that the function is jointly Lipschitz with respect to total variation distance whenever ε, δ > 0. Hence, using the minimax theorem (Lemma C.1), we have that

inf p (Π) sup q (Π) Eπ q E M,π " µ(xh, ah) d M,p h (xh, ah) + δ CM ;hµ(xh, ah)

= sup q (Π) inf p (Π) Eπ q E M,π " µ(xh, ah) d M,p h (xh, ah) + δ CM ;hµ(xh, ah)

sup q (Π) Eπ q E M,π " µ(xh, ah) d M,q h (xh, ah) + δ CM ;hµ(xh, ah)

d M,q h (x, a)µ(x, a) d M,q h (x, a) + δ CM ;hµ(x, a) 1.

To conclude, we take δ 0.

H. Proofs and Additional Details from Section 4

H.1. Omitted Algorithms

Algorithm 5 Approximate Policy Cover Computation via Pushforward Coverability Relaxation

1: input: Layer h [H], precision parameter , ε [0, 1], optimization tolerance εopt > 0.

2: Set T = 1

3: for t = 1, 2, , T do 4: Compute πt Π such that

EM,πt " P M h 1(xh | xh 1, ah 1) P

i<t d M,πi h (xh) + P M h 1(xh | xh 1, ah 1)

sup π Π EM,π " P M h 1(xh | xh 1, ah 1) P

i<t d M,πi h (xh) + P M h 1(xh | xh 1, ah 1)

5: Return p = Unif(π1, . . . , πT).

H.2. Examples for Algorithm 1

As discussed in Section 4.1, the L -coverability relaxation (6) used by Algorithm 1 can be optimized efficiently whenever a (non-admissible) state-action distribution µ (X A) with low L -concentrability CM ;h can be computed efficiently for the MDP M. Examples of MDP classes the admit efficiently computable distributions with low concentrability include:

When M is a tabular MDP, the distribution µ(x, a) = 1 |X||A| (which clearly admits a closed form representation) achieves CM ;h(µ) |X||A|.

For a Block MDPs (Du et al., 2019; Misra et al., 2020; Zhang et al., 2022; Mhammedi et al., 2023b) with latent state space S, emission distribution q : S (X), and decoder ϕ : X S, the distribution µ(x, a) := q(x | ϕ (x)) 1 |S||A| achieves CM ;h(µ) |S||A| (Xie et al., 2023). Again, this distribution admits a closed form representation when M is explicitly specified.

For low-rank MDPs with the structure in Eq. (30), the distribution given by µ(x, a) = ψh(x) 2 R ψh(x ) 2dx 1 |A| achieves

CM ;h(µ) B|A| under the standard normalization assumption that R ψh(x ) 2dx B and ϕ(x, a) 2 1 for some

Scalable Online Exploration via Coverability

(typically dimension-dependent) constant B > 0 (Mhammedi et al., 2023a; Golowich et al., 2023). Alternatively we can compute a set of policies π1, . . . , πd that form a barycentric spanner for the set {E M,π[ϕ(xh, ah)]}π Π and choose

µ(x, a) = 1 d|A| Pd i=1 d M,πi

h (x), which achieves CM ;h(µ) d|A| (Huang et al., 2023).

These examples highlight that for many settings of interest, computing a covering distribution µ (X A) when the model is known is significantly simpler than computing an explicit policy cover p (Πrns), showcasing the utility of Algorithm 1.

H.3. Proofs from Section 4.1

Proposition 4.1. For a distribution µ with C CM ;h(µ), it holds that for all p (Πrns),

Ψ M h,ε(p) 2C Ψ M µ;h,ε(p). (7)

Furthermore, Cov M µ;h,ε 1 for all ε > 0.

Proof of Proposition 4.1. Fix µ and abbreviate C CM ;h(µ). Observe that for any π Π and p (Πrns), Lemma C.10 implies that we can bound

E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε d M,π h (xh, ah)

E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε C µ(xh, ah)

+ C E M,π µ(xh, ah) d M,p h (xh, ah) + ε C µ(xh, ah)

2C E M,π µ(xh, ah) d M,p h (xh, ah) + ε C µ(xh, ah)

For the claim that Cov M µ;h,ε 1, see the proof of Proposition 3.2.

Theorem 4.1. For any ε [0, 1] and h [H], given a distribution µ with C CM ;h(µ), whenever εopt ε log(2ε 1), Algorithm 1 with T = ε 1 produces a distribution p (Π) with |supp(p)| ε 1 such that

Ψ M µ;h,ε(p) 3 log(2ε 1), (9)

and consequently ΨM h,ε(p) 6C log(2ε 1).

Proof of Theorem 4.1. Let us abbreviate edt = P

i<t d M,πi. Observe that for T = 1

Ψ M µ;h,ε(p) = sup π Π E M,π " µ(xh, ah)

Eπ p d M,π h (xh, ah) + C

T µ(xh, ah)

= T sup π Π E M,π " µ(xh, ah) ed T +1 h (xh, ah) + C µ(xh, ah)

and hence it suffices to bound the quantity on the right-hand side. Observe that for all t [T], we have that

sup π Π E M,π " µ(xh, ah) edt h(xh, ah) + C µ(xh, ah)

sup π Π E M,π " µ(xh, ah) ed t 1 h (xh, ah) + C µ(xh, ah)

and consequently

T sup π Π E M,π " µ(xh, ah) ed T +1 h (xh, ah) + C µ(xh, ah)

t=1 sup π Π E M,π " µ(xh, ah) edt h(xh, ah) + C µ(xh, ah)

t=1 E M,πt " µ(xh, ah) edt h(xh, ah) + C µ(xh, ah)

Scalable Online Exploration via Coverability

Finally, we note that

t=1 E M,πt " µ(xh, ah) edt h(xh, ah) + C µ(xh, ah)

t=1 µ(x, a) d M,πt(x, a) edt h(x, a) + C µ(x, a) .

Since supπ Π d M,π h (x, a) C µ(x, a) for all (x, a) X A, Lemma C.9 implies that

t=1 µ(x, a) d M,πt(x, a) edt h(x, a) + C µ(x, a) 2 log(2T),

allowing us to conclude that

Ψ M µ;h,ε(p) 2 log(2T) + εopt T.

H.4. Proofs from Section 4.2

Proposition 4.2. Fix h [H]. For any p (Πrns), if we define p (Πrns) as the distribution induced by sampling π p and executing π h πunif, we have that for all ε > 0,

Ψ M h,ε(p ) |A| Ψ M push;h,ε(p). (11)

Furthermore, Cov M push;h,ε CM push;h for all ε > 0.

Proof of Proposition 4.2. We first note that

Ψ M h,ε(p ) |A| sup π Π E M,π d M,π h (xh) d M,p h (xh) + ε d M,π h (xh)

Next, we write

E M,π d M,π h (xh) d M,p h (xh) + ε d M,π h (xh)

(d M,π h (x))2

d M,p h (x) + ε d M,π h (x).

We now state and prove a basic technical lemma.

Lemma H.1. For all ε, δ > 0, the function f(x) = x2 δ+εx is convex over R+.

Proof of Lemma H.1. This follows by verifying through direct calculation that

f (x) = ε x2

(δ + εx)2 , and f (x) = 4εδ x (δ + εx)3 0.

By Lemma H.1, the function

d M,p h (x) + ε d

is convex for all x. Hence, writing d M,π h (x) = E M,π P M h 1(x | xh 1, ah 1) , Jensen s inequality implies that for all x,

(d M,π h (x))2

d M,p h (x) + ε d M,π h (x) E M,π " (P M h 1(x | xh 1, ah 1))2

d M,p h (x) + ε P M h 1(x | xh 1, ah 1)

Scalable Online Exploration via Coverability

We conclude that

E M,π d M,π h (xh) d M,p h (xh) + ε d M,π h (xh)

(P M h 1(x | xh 1, ah 1))2

d M,p h (x) + ε P M h 1(x | xh 1, ah 1)

= E M,π " P M h 1(xh | xh 1, ah 1) d M,p h (x) + ε P M h 1(xh | xh 1, ah 1)

Ψ M push;h,ε(p).

We now prove the bound on Cov M push;h,ε. Let δ > 0 be given. Using the definition of CM push;h and the same argument as Lemma C.10, there exists µ (X) such that

Cov M push;h,ε 1 + δ

C M push;h inf p (Π) sup π Π E M,π " µ(xh) d M,p h (xh) + δ CM push;hµ(xh)

C M push;h inf p (Π) sup q (Π) Eπ q E M,π " µ(xh) d M,p h (xh) + δ CM push;hµ(xh)

Observe that the function

(p, q) 7 Eπ q E M,π " µ(xh) d M,p h (xh) + δ CM push;hµ(xh)

is convex-concave. In addition, it is straightforward to see that the function is jointly Lipschitz with respect to total variation distance whenever ε, δ > 0. Hence, using the minimax theorem (Lemma C.1), we have that

inf p (Π) sup q (Π) Eπ q E M,π " µ(xh) d M,p h (xh) + δ CM push;hµ(xh)

= sup q (Π) inf p (Π) Eπ q E M,π " µ(xh) d M,p h (xh) + δ CM push;hµ(xh)

sup q (Π) Eπ q E M,π " µ(xh) d M,q h (xh) + δ CM push;hµ(xh)

d M,q h (x)µ(x) d M,q h (x) + δ CM push;hµ(x) 1.

To conclude, we take δ 0.

Theorem 4.2. For any ε [0, 1] and h [H], whenever εopt CM push;h ε log(2ε 1), Algorithm 5 produces a distribution p (Π) with |supp(p)| ε 1 such that

Ψ M push;h,ε(p) 5C M push;h log(2ε 1). (12)

Consequently, if we define p (Πrns) as the distribution induced by sampling π p and executing π h πunif, we have that ΨM h,ε(p ) 5|A|CM push;h log(2ε 1).

Proof of Theorem 4.2. Let us abbreviate edt h = P

h . Observe that for T = 1

Ψ M push;h,ε(p) = sup π Π E M,π " P M h 1(xh | xh 1, ah 1)

Eπ p d M,π h (xh) + 1

T P M h 1(xh | xh 1, ah 1)

= T sup π Π E M,π " P M h 1(xh | xh 1, ah 1) ed T +1 h (xh) + P M h 1(xh | xh 1, ah 1)

Scalable Online Exploration via Coverability

and hence it suffices to bound the quantity on the right-hand side. Observe that for all t [T], we have that

sup π Π E M,π " P M h 1(xh | xh 1, ah 1) edt h(xh) + P M h 1(xh | xh 1, ah 1)

sup π Π E M,π " P M h 1(xh | xh 1, ah 1) ed t 1 h (xh) + P M h 1(xh | xh 1, ah 1)

and consequently

T sup π Π E M,π " P M h 1(xh | xh 1, ah 1) ed T +1 h (xh) + P M h 1(xh | xh 1, ah 1)

t=1 sup π Π E M,π " P M h 1(xh | xh 1, ah 1) edt h(xh) + P M h 1(xh | xh 1, ah 1)

t=1 E M,πt " P M h 1(xh | xh 1, ah 1) edt h(xh) + P M h 1(xh | xh 1, ah 1)

Now, let µ (X) attain the value of CM push;h. Using Lemma C.11 with ε = 1 and δ = CM push;h, we have that for all π Π,

E M,π " P M h 1(xh | xh 1, ah 1) edt h(xh) + P M h 1(xh | xh 1, ah 1)

E M,π " P M h 1(xh | xh 1, ah 1) edt h(xh) + CM push;hµ(xh)

+ C M push;h E M,π " µ(xh) edt h(xh) + CM push;hµ(xh)

2C M push;h E M,π " µ(xh) edt h(xh) + CM push;hµ(xh)

Hence, we can bound

t=1 E M,πt " P M h 1(xh | xh 1, ah 1) edt h(xh) + P M h 1(xh | xh 1, ah 1)

2C M push;h

t=1 E M,πt " µ(xh) edt h(xh) + CM push;hµ(xh)

= 2C M push;h X

t=1 µ(x) d M,πt(x) edt h(x) + CM push;hµ(x) .

Since supπ Π d M,π h (x) supx X,a A P M h 1(x | x , a) CM push;hµ(x) for all x X, Lemma C.9 implies that

t=1 µ(x) d M,πt(x) edt h(x) + C µ(x) 2 log(2T).

We conclude that

t=1 E M,πt " P M h 1(xh | xh 1, ah 1) edt h(xh) + P M h 1(xh | xh 1, ah 1)

4C M push;h log(2T) (35)

and ΨM push;h,ε(p) 4CM push;h log(2T) + εopt T.

I. Proofs and Additional Details from Section 5

This section is organized as follows:

Appendix I.1 presents our most general guarantee for Algorithm 2, Theorem I.1, and derives sample complexity bounds based on L1-Coverability as a consequence.

Appendix I.2 presents applications of these results to downstream policy optimization.

Appendix I.3 presents preliminary technical lemmas.

Appendix I.4 proves Theorem I.1, proving Theorem 5.1 as a corollary.

Scalable Online Exploration via Coverability

I.1. General Guarantees for Algorithm 2

In this section, we present general guarantees for CODEX (Algorithm 2) that (i) make us of online (as opposed to offline) estimation oracles, allowing for faster rates, and (ii) enjoy sample complexity scaling with L1-Coverability, improving upon the L -Coverability-based guarantees in Section 5.

I.1.1. ONLINE ESTIMATION ORACLES

For online estimation, we measure the oracle s estimation performance in terms of cumulative Hellinger error, which we assume is bounded as follows. Assumption I.1 (Online estimation oracle for M). At each time t [T], an online estimation oracle Alg Est for M returns, given Ht 1 = (π1, o1), . . . , (πt 1, ot 1) with oi M (πi) and πi pi, an estimator c M t M such that whenever M M,

Eston H (T) :=

t=1 Eπt pt D2 H c M t(π t), M (π t) Eston H (M, T, δ),

with probability at least 1 δ, where Eston H (M, T, δ) is a known upper bound.

See Section 4 of Foster et al. (2021) or Foster & Rakhlin (2023) for further background on online estimation. Algorithm 2 supports offline and online estimators, but is most straightforward to analyze for online estimators, and gives tighter sample complexity bounds in this case. The requirement in Assumption I.1 that the online estimator is proper (i.e., has c M t M) is quite stringent, as generic online estimation algorithms (e.g., Vovk s aggregating algorithm) are improper, and proper algorithms are only known for specialized MDP classes such as tabular MDPs (see discussion in Foster et al. (2021)).18 This contrasts with offline estimation, where most standard algorithms such as MLE are proper. As such, we present bounds based on online estimators as secondary results, with our bounds based on offline estimation serving as the main results.

Offline-to-online conversion. On the technical side, our interest in proper online estimation arises from the following structural result, which shows that whenever the L1-Coverability parameter is bounded, any algorithm with low offline estimation error also enjoys low online estimation error (with polynomial loss in rate).

Lemma I.1 (Offline-to-online). Any offline estimator c M t that satisfies Assumption 5.2 with estimation error bound Estoff H (M, T, δ) satisfies Assumption I.1 with Eston H (M, T, δ) e O H CM avg (1 + Estoff H (M, T, δ)) 1/3T 2/3 .

Note that CM avg Cov M 0 ; we leave an extension to Cov M ε for ε > 0 to future work. We also make use of a tighter offline-to-online lemma based on the (larger) L -Coverability parameter CM .

Lemma I.2 (Xie et al. (2023)). Any offline estimator c M t that satisfies Assumption 5.2 with estimation error bound Estoff H (M, T, δ) satisfies Assumption I.1 with Eston H (M, T, δ) e O H CM T Estoff H (M, T, δ) 1/2 + HCM .

Both lemmas lead to a degradation in rate with respect to T, but lead to sublinear online estimation error whenever the offline estimation error bound is sublinear.

I.1.2. GENERAL GUARANTEES FOR ALGORITHM 2

Our most general guarantee for Algorithm 2, which assumes access to an online estimation oracle, is as follows. Theorem I.1 (General guarantee for Algorithm 2). With parameters T N, C 1, and ε > 0 and an online estimation oracle satisfying Assumption I.1, whenever the optimization problem in Eq. (13) is feasible at every round, Algorithm 2 produces a policy covers p1, . . . , p H (Π) such that with probability at least 1 δ, h [H]: ΨM h,ε(ph)

H3C Eston H (M, T, δ) T + 8H

ε2 Eston H (M, T, δ)

Theorem 5.1 is derived by combining this result with Lemma I.2. The next result instantiates Theorem I.1 with Lemma I.1, allowing us to give sample complexity guarantees based on L1-Coverability that support offline estimation oracles.

18On the statistical side, it is straightforward to extend the results in this section to accommodate improper online estimators; we impose this restriction for computational reasons, as this enables the application of the efficient planning results in Section 4.

Scalable Online Exploration via Coverability

Corollary I.1 (Main guarantee for Algorithm 2 under L1-Coverability). Let ε > 0 be given. Suppose that (i) we restrict M such that all M M have Cov M ε Cov M ε , and (ii) we solve Eq. (13) with C = Cov M ε for all t (which is always feasible). Then, given access to an offline estimation oracle satisfying Assumptions 5.2 and 5.3, using T = e O H12(Cov M 0 )4dest log(Best/δ)

ε6 episodes, Algorithm 2 produces policy covers p1, . . . , p H (Π) such that

h [H] : Ψ M h,ε(ph) 12H Cov M ε (37)

with probability at least 1 δ. In particular, for a finite class M, if we use MLE as the estimator, we can take T = e O H12(Cov M 0 )4 log(|M|/δ) ε6 .

This result shows that L1-Coverability is itself a sufficiently powerful structural parameter to enable sample-efficient learning with nonlinear function approximation. Note that while Corollary I.1 assumes for simplicity that Eq. (13) is solved with C = Cov M ε , it should be clear that if we solve the objective for C > Cov M ε the result continues to hold with Cov M ε replaced by C in the sample complexity bound and approximation guarantee.

I.2. Applying Algorithm 2 to Downstream Policy Optimization

By Proposition 3.1 (see also Appendix D), the policy covers p1, . . . , p H returned by Algorithm 2 can be used to optimize any downstream reward function using standard offline RL algorithms. This leads to end-to-end guarantees for reward-driven PAC RL. For concreteness, we sketch an example which uses maximum likelihood (MLE) for offline policy optimization; see Appendix D for further examples and details.

Corollary I.2 (Application to reward-free reinforcement learning). Given access to H n trajectories from the policy covers p1, . . . , p H produced by Algorithm 2 (configured as in Theorem 5.1) and a realizable model class with M M, for any reward distribution R = {Rh}H h=1 with PH h=1 rh [0, 1], the Maximum Likelihood Estimation algorithm (described in Appendix D) produces a policy bπ such that with probability at least 1 δ, J M R (π ) J M R (bπ) O(H) q

HCM log(|M|/δ)

n + HCM ε , where J M R (π) := E M ,πh PH h=1 rh i denotes the expected reward in M when R is the reward distribution.

We now sketch some basic examples in which Corollary I.2 can be applied.

Example I.1 (Tabular MDPs). For tabular MDPs with |X| S and |A| A, we can construct online estimators for which Eston H (M, T, δ) = e O(HS2A), so that Theorem 5.1 gives sample complexity T = poly(H,S,A)

ε2 to compute policy covers such that ΨM h,ε(ph) 12H CM .

Example I.2 (Low-Rank MDPs). Consider the Low-Rank MDP model in Eq. (30) with dimension d and suppose, following Agarwal et al. (2020); Uehara et al. (2022), that we have access to classes Φ and Ψ such that ϕh Φ and ψh Ψ. Then MLE achieves Estoff H (M, T, δ) = e O(log(|Φ||Ψ|)), and we can take CM d|A|, so Theorem 5.1 gives sample complexity T = poly(H,d,|A|,log(|Φ||Ψ|))

ε4 to compute policy covers such that ΨM h,ε(ph) 12H CM .

I.3. Technical Lemmas

Lemma I.1 (Offline-to-online). Any offline estimator c M t that satisfies Assumption 5.2 with estimation error bound Estoff H (M, T, δ) satisfies Assumption I.1 with Eston H (M, T, δ) e O H CM avg (1 + Estoff H (M, T, δ)) 1/3T 2/3 .

Proof of Lemma I.1. Let us abbreviate dt h = d M ,πt

h . By Lemma A.11 of Foster et al. (2021), we have that

Eston H (T) =

t=1 Eπt pt h D2 H M (π t), c M t(π t) i

t=1 Eπt pt E M ,πth D2 H P M h (xh, ah), P c Mt h (xh, ah) + D2 H R M h (xh, ah), R c Mt h (xh, ah) i .

Scalable Online Exploration via Coverability

At the same time, for all t, we have by Lemma C.5 that

i<t Eπi pi E M ,πih D2 H P M h (xh, ah), P c Mt h (xh, ah) + D2 H R M h (xh, ah), R c Mt h (xh, ah) i

i<t Eπi pi h D2 H M (π i), c M t(π i) i 4Estoff H (t).

The result now follows by applying Lemma I.3 stated and proven below to the function g(x, a) = D2 H P M h (x, a), P c Mt h (x, a) + D2 H RM h (x, a), R c Mt h (x, a) for each layer h [H], which has B 4.

Lemma I.2 (Xie et al. (2023)). Any offline estimator c M t that satisfies Assumption 5.2 with estimation error bound Estoff H (M, T, δ) satisfies Assumption I.1 with Eston H (M, T, δ) e O H CM T Estoff H (M, T, δ) 1/2 + HCM .

Proof of Lemma I.2. Let us abbreviate dt h = d M ,πt

h . By Lemma A.11 of Foster et al. (2021), we have that

Eston H (T) =

t=1 Eπt pt h D2 H M (π t), c M t(π t) i

t=1 Eπt pt E M ,πth D2 H P M h (xh, ah), P c Mt h (xh, ah) + D2 H R M h (xh, ah), R c Mt h (xh, ah) i .

At the same time, for all t, we have by Lemma C.5 that

i<t Eπi pi E M ,πih D2 H P M h (xh, ah), P c Mt h (xh, ah) + D2 H R M h (xh, ah), R c Mt h (xh, ah) i

i<t Eπi pi h D2 H M (π i), c M t(π i) i 4Estoff H (t).

The result now follows by applying Lemma I.4 stated and proven below to the function g(x, a) = D2 H P M h (x, a), P c Mt h (x, a) + D2 H RM h (x, a), R c Mt h (x, a) for each layer h [H], which has B 4.

Lemma I.3. Fix an MDP M and layer h [H]. Suppose we have a sequence of functions g1, . . . , g T [0, B] and policies π1, . . . , πT such that

i<t E M,πi (g t(xh, ah))2 β2. (38)

Then it holds that

t=1 E M,πt[g t(xh, ah)] = 2(C M avg;h)1/3(β2B + B3)1/3T 2/3. (39)

Proof of Lemma I.3. Let µ (X A) denote a distribution that achieves the value CM avg;h. Let us abbreviate

dt h(x, a) = d M,πt

h (x, a) and

ed t h(x, a) = X

h (x, a). (40)

Observe that by Eq. (38) and the assumption that gt [0, B], we have that X

x X,a A edt+1 h (x, a)(g t(x, a))2 β2 + B2 =: α2. (41)

Scalable Online Exploration via Coverability

To begin, fix a parameter λ > 0 and define

τ(x, a) := min n t | edt+1 h (x, a) λµ(x, a) o . (42)

We can bound

t=1 E M,πt[g t(xh, ah)]

t=1 E M,πt[g t(xh, ah)I{t τ(xh, ah)}] + B

t=1 E M,πt[I{t < τ(xh, ah)}]. (43)

For the second term above, we can write

t=1 E M,πt[I{t < τ(xh, ah)}] = X

t=1 d t h(x, a)I{t < τ(x, a)}

x,a ed τ(x,a) h (x, a) < λ X

x,a µ(x, a) = λ,

where the final inequality uses the definition of τ(x, a).

For the first, term, using Cauchy-Schwarz, we can bound

t=1 E M,πt[g t(xh, ah)I{t τ(xh, ah)}]

x X,a A d t h(x, a)g t(x, a)I{t τ(x, a)}

(ed t+1 h (x, a))1/2 I{t τ(x, a)} (ed t+1 h (x, a))1/2g t(x, a)

(dt h(x, a))2

ed t+1 h (x, a) I{t τ(x, a)}, and B :=

x X,a A ed t+1 h (x, a)(g t(x, a))2. (44)

Eq. (41) implies that

x X,a A ed t+1 h (x, a)(g t(x, a))2 α2T.

It remains to bound term A. From the definition of τ(x, a), we can bound

(dt h(x, a))2

ed t+1 h (x, a) I{t τ(x, a)} 1

(dt h(x, a))2

µ(x, a) CM avg;h T

where the last inequality uses that µ achieves the value of CM avg;h.

Combining the results so far, we have that

t=1 E M,πt[g t(xh, ah)] (λ 1C M avg;h T)1/2(α2T)1/2 + λB = (C M avg;hα2)1/2T/λ1/2 + λB.

We choose λ = (CM avg;hα2)1/3T 2/3/B2/3 to balance the terms, which gives a bound of the form

2(C M avg;hα2B)1/3T 2/3 = 2(C M avg;h)1/3(β2B + B3)1/3T 2/3.

Scalable Online Exploration via Coverability

Lemma I.4. Fix an MDP M and layer h [H]. Suppose we have a sequence of functions g1, . . . , g T [0, B] and policies π1, . . . , πT such that

i<t E M,πi (g t(xh, ah))2 β2. (45)

Then it holds that

t=1 E M,πt[g t(xh, ah)] = O q

CM ;h T log(T) β2 + C M ;h B . (46)

Proof of Lemma I.4. See proof of Theorem 1 in Xie et al. (2023).

I.4. Proofs from Section 5 and Appendix I.1

Theorem I.1 (General guarantee for Algorithm 2). With parameters T N, C 1, and ε > 0 and an online estimation oracle satisfying Assumption I.1, whenever the optimization problem in Eq. (13) is feasible at every round, Algorithm 2 produces a policy covers p1, . . . , p H (Π) such that with probability at least 1 δ, h [H]: ΨM h,ε(ph)

H3C Eston H (M, T, δ) T + 8H

ε2 Eston H (M, T, δ)

Overview of proof. Before diving into the proof of Theorem I.1, we sketch the high-level idea. The crux of the analysis is to show that for each round t, we have that

Cov M h,ε(p t h) max h Cov c Mt h,ε(p t h) + 1

H3C Eπ qt D2 H c M t(π), M (π) + H

ε2 Eπ qt D2 H c M t(π), M (π) . (47)

Averaging across all rounds t, this allows us to conclude that

t=1 Cov M h,ε(p t h) HC + 1

v u u t H3C 1

t=1 Eπ qt D2 H c M t(π), M (π) + H

t=1 Eπ qt D2 H c M t(π), M (π) .

Since Cov M h,ε(ph) 1

T PT t=1 Cov M h,ε(pt h) and PT t=1 Eπ qt D2 H c M t(π), M (π) Eston H (M, T, δ) with probability at least 1 δ, this proves the result.

Eq. (47) can be thought as a reward-free analogue of a bound on the Decision-Estimation Coefficient (DEC) of Foster et al. (2021; 2023) and makes precise the reasoning that by optimizing the plug-in approximation to the L1-Coverage objective, we either 1) cover the true MDP M well, or 2) achieve large information gain (as quantified by the instantaneous estimation error Eπ qt D2 H c M t(π), M (π) ).

Proof of Theorem I.1. Observe that by Jensen s inequality, we have that for all h [H],

T Ψ M h,ε(ph)

t=1 Ψ M h,ε(p t h).

We will show how to bound the right-hand side above. To begin, we state two technical lemmas, both proven in the sequel.

Lemma I.5. Fix h [H]. For all mixture policies p (Π) and MDPs c M = X, A, {P c M h }H h=0 , it holds that

Ψ M h,ε(p) max π E M ,π " d M ,π h (xh, ah) + d c M,π h (xh, ah)

d M ,p h (xh, ah) + ε (d M ,π h (xh, ah) + d c M,π h (xh, ah))

For the next result, for a given reward-free MDP M = X, A, {P M h }H h=0 and reward distribution R = {Rh}H h=1, we define

J M R (π) := E M,π " H X

as the value under rh R(xh, ah).

Scalable Online Exploration via Coverability

Lemma I.6. Consider the reward-free setting. Let an MDP c M = X, A, {P c M h }H h=0 be given, and let p1, . . . , p H (Π) be (C, ε)-policy covers for c M, i.e.

Ψ c M h,ε(ph) C h [H]. (48)

Then the distribution q := Unif(p1, . . . , p H) ensures that for all MDPs M = X, A, {P M h }H h=0 , all reward distributions R = {Rh}H h=1 with PH h=1 rh [0, B] almost surely, and all policies π Π,

J M R (π) J c M R (π) 4B q

H3C Eπ q D2 H c M(π), M(π) +

For the remainder of the proof, we abbreviate d M,π h d M,π h (xh, ah) whenever the argument is clear from context. Let h [H] be fixed. We observe that by Lemma I.5,

t=1 Ψ M h,ε(p t h)

t=1 max π Π E M ,π " d M ,π h d M ,pt h h + ε (d M ,π h + d c Mt,π h )

+ max π Π E M ,π " d c Mt,π h d M ,pt h h + ε (d M ,π h + d c Mt,π h )

For any π Π, by applying Lemma I.6 for each step t with the deterministic rewards

r t h(x, a) = d M ,π h (x, a)

d M ,pt h (x, a) + ε (d M ,π h (x, a) + d c Mt,π h (x, a)) I{h = h},

which satisfy PH h=1 rt h 0, ε 1 almost surely, we can bound

E M ,π " d M ,π h d M ,pt h h + ε (d M ,π h + d c Mt,π h )

E c Mt,π " d M ,π h d M ,pt h h + ε (d M ,π h + d c Mt,π h )

H3C Eπ qt D2 H c M t(π), M (π) +

= E M ,π " d c Mt,π h d M ,pt h h + ε (d M ,π h + d c Mt,π h )

H3C Eπ qt D2 H c M t(π), M (π) +

where we have simplified the second term by noting that

2HC. It follows that

t=1 Ψ M h,ε(p t h) 2

t=1 max π Π E M ,π " d c Mt,π h d M ,pt h h + ε (d M ,π h + d c Mt,π h )

H3C Eπ qt D2 H c M t(π), M (π)

We now appeal to Lemma I.6 once more. For any π, by applying Lemma I.6 again at each step t with the rewards

r t h(x, a) = d c Mt,π h (x, a)

d M ,pt h (x, a) + ε (d M ,π h (x, a) + d c Mt,π h (x, a)) I{h = h},

which satisfy PH h=1 rh [0, ε 1] almost surely, allows us to bound

E M ,π " d c Mt,π h d M ,pt h h + ε (d M ,π h + d c Mt,π h )

E c Mt,π " d c Mt,π h d M ,pt h h + ε (d M ,π h + d c Mt,π h )

H3C Eπ qt D2 H c M t(π), M (π) +

E c Mt,π " d c Mt,π h d M ,pt h h + ε d c Mt,π h

H3C Eπ qt D2 H c M t(π), M (π) +

Scalable Online Exploration via Coverability

We conclude that

t=1 Ψ M h,ε(p t h) 2

t=1 max π Π E c Mt,π " d c Mt,π h d M ,pt h h + ε d c Mt,π h

H3C Eπ qt D2 H c M t(π), M (π) + 3

t=1 max π Π E c Mt,π " d c Mt,π h d M ,pt h h + ε d c Mt,π h

H3CT Eston H (M, T, δ) + 3

We now appeal to the following lemma.

Lemma I.7. Consider the reward-free setting. For any pair of MDP c M = X, A, {P c M h }H h=0 and M = X, A, {P M h }H h=0 , any policy π Πrns, and any distribution p (Πrns), it holds that

E c M,π " d c M,π h d M ,p h + ε d c M,π h

3 E c M,π " d c M,π h d c M,p h + ε d c M,π h

ε2 Eπ p h D2 H c M(π), M (π) i .

Combining Eq. (49) with Lemma I.7, we have that

t=1 Ψ M h,ε(p t h) 6

t=1 Ψ c Mt h,ε(p t h) + 12

H3CT Eston H (M, T, δ) + 3

t=1 Eπ pt h

h D2 H c M t(π), M (π) i

t=1 Ψ c Mt h,ε(p t h) + 12

H3CT Eston H (M, T, δ) + 3

t=1 Eπ qt h D2 H c M t(π), M (π) i

t=1 Ψ c Mt h,ε(p t h) + 12

H3CT Eston H (M, T, δ) + 3

ε2 Eston H (M, T, δ). (50)

To conclude the proof of Eq. (36), we note that it follows from the definition of pt h that for all h [H] and t [T],

Ψ c Mt h,ε(p t h) C.

Hence, we have that

t=1 Ψ M h,ε(p t h) 6CT + 12

H3CT Eston H (M, T, δ) + 3

ε2 Eston H (M, T, δ)

H3CT Eston H (M, T, δ) + 8H

ε2 Eston H (M, T, δ).

This implies that

Ψ M h,ε(ph) 11HC + 12

H3C Eston H (M, T, δ) T + 8H

ε2 Eston H (M, T, δ)

as desired.

Corollary I.1 (Main guarantee for Algorithm 2 under L1-Coverability). Let ε > 0 be given. Suppose that (i) we restrict M such that all M M have Cov M ε Cov M ε , and (ii) we solve Eq. (13) with C = Cov M ε for all t (which is always feasible). Then, given access to an offline estimation oracle satisfying Assumptions 5.2 and 5.3, using T = e O H12(Cov M 0 )4dest log(Best/δ)

ε6 episodes, Algorithm 2 produces policy covers p1, . . . , p H (Π) such that

h [H] : Ψ M h,ε(ph) 12H Cov M ε (37)

with probability at least 1 δ. In particular, for a finite class M, if we use MLE as the estimator, we can take T = e O H12(Cov M 0 )4 log(|M|/δ) ε6 .

Scalable Online Exploration via Coverability

Proof of Corollary I.1. We prove Eq. (37) as a consequence of Theorem I.1. We can take C Cov M ε , and using Lemma I.1, we have

Eston H (M, T, δ) e O H C M avg (1 Estoff H (M, T, δ)) 1/3 T 2/3 e O H Cov M 0 (1 Estoff H (M, T, δ)) 1/3 T 2/3 ,

so that Eq. (36) gives

h [H] : Ψ M h,ε(ph) 11HC + e O

H12(Cov M 0 )4(1 Estoff H (M, T, δ)) T

where the final inequality uses the choice for T in the corollary statement.

Theorem 5.1 (Main guarantee for CODEX). Let ε > 0 be given. Let C CM , and suppose that (i) we restrict M such that all M M have CM C , and (ii) we solve Eq. (13) with C = C for all t.19 Then, given an offline estimation

oracle satisfying Assumptions 5.2 and 5.3, using T = e O H8(CM )3dest log(Best/δ)

ε4 episodes, Algorithm 2 produces policy

covers p1, . . . , p H (Π) such that

h [H] : Ψ M h,ε(ph) 12H C M (15)

w.p. at least 1 δ. For a finite class M, if we use MLE as the estimator, we can take T = e O H8(CM )3 log(|M|/δ)

Proof of Theorem 5.1. We prove Eq. (15) as a consequence of Theorem I.1. We can take C CM , and using Lemma I.2, we have

Eston H (M, T, δ) e O H q

CM T Estoff H (M, T, δ) + H C M

CM T (1 Estoff H (M, T, δ)) ,

so that Eq. (36) gives

h [H] : Ψ M h,ε(ph) 11HC + e O

H8(CM )3(1 Estoff H (M, T, δ)) T

where the final inequality uses the choice for T in the theorem statement.

I.4.1. SUPPORTING LEMMAS

Lemma I.5. Fix h [H]. For all mixture policies p (Π) and MDPs c M = X, A, {P c M h }H h=0 , it holds that

Ψ M h,ε(p) max π E M ,π " d M ,π h (xh, ah) + d c M,π h (xh, ah)

d M ,p h (xh, ah) + ε (d M ,π h (xh, ah) + d c M,π h (xh, ah))

Proof of Lemma I.5. To keep notation compact, let us suppress the dependence on xh and ah. For all π Πrns and p (Πrns), we have that

E M ,π " d M ,π h d M ,p h + ε d M ,π h

E M ,π " d M ,π h d M ,p h + ε (d M ,π h + d c M,π h )

= E M ,π " d M ,π h ε d c M,π h (d M ,p h + ε d M ,π h )(d M ,p h + ε (d M ,π h + d c M,π h ))

E M ,π " d c M,π h d M ,p h + ε (d M ,π h + d c M,π h )

19We can take CM C w.l.o.g. when C is known. In this case, solving Eq. (13) with C = C is feasible by Proposition 3.2.

Scalable Online Exploration via Coverability

This proves the result.

Lemma I.6. Consider the reward-free setting. Let an MDP c M = X, A, {P c M h }H h=0 be given, and let p1, . . . , p H (Π) be (C, ε)-policy covers for c M, i.e.

Ψ c M h,ε(ph) C h [H]. (48)

Then the distribution q := Unif(p1, . . . , p H) ensures that for all MDPs M = X, A, {P M h }H h=0 , all reward distributions R = {Rh}H h=1 with PH h=1 rh [0, B] almost surely, and all policies π Π,

J M R (π) J c M R (π) 4B q

H3C Eπ q D2 H c M(π), M(π) +

Proof of Lemma I.6. Let an arbitrary MDP M = X, A, {P M h }H h=0 , reward distribution R = {Rh}H h=1, and policy π Π be fixed. To begin, using the simulation lemma (Lemma C.8), we have

J M R (π) J c M R (π) B

h=1 E c M,πh DH P c M h (xh, ah), P M h (xh, ah) i ,

where we have used that both MDPs have the same reward distribution. Let h [H] be fixed. Since DH P c M h (xh, ah), P M h (xh, ah) [0,

2], we can use Proposition 3.1 to bound

E c M,π DH P c M h (xh, ah), P M h (xh, ah) 2 q

Ψ c M h,ε(ph) E c M,ph D2 H P c M h (xh, ah), P M h (xh, ah) +

2 Ψ c M h,ε(ph) ε

HΨ c M h,ε(ph) E c M,q D2 H P c M h (xh, ah), P M h (xh, ah) +

2 Ψ c M h,ε(ph) ε

HC E c M,q D2 H P c M h (xh, ah), P M h (xh, ah) +

HC Eπ q D2 H c M(π), M(π) +

where the last inequality follows form Lemma C.5. Summing across all layers, we conclude that

J M R (π) J c M R (π) 4B q

H3C Eπ q D2 H c M(π), M(π) +

Lemma I.7. Consider the reward-free setting. For any pair of MDP c M = X, A, {P c M h }H h=0 and M = X, A, {P M h }H h=0 , any policy π Πrns, and any distribution p (Πrns), it holds that

E c M,π " d c M,π h d M ,p h + ε d c M,π h

3 E c M,π " d c M,π h d c M,p h + ε d c M,π h

ε2 Eπ p h D2 H c M(π), M (π) i .

Scalable Online Exploration via Coverability

Proof of Lemma I.7. Consider any c 1. For any π Πrns and p (Πrns), we can write

E c M,π " d c M,π h d M ,p h + ε d c M,π h

c E c M,π " d c M,π h d c M,p h + ε d c M,π h

d c M,π h d c M,p h + ε d c M,π h c d M ,p h + ε d c M,π h

d M ,p h + ε d c M,π h d c M,p h + ε d c M,π h

d c M,π h d c M,p h c d M ,p h

d M ,p h + ε d c M,π h d c M,p h + ε d c M,π h

(d c M,π h )2 d M ,p h + ε d c M,π h d c M,p h + ε d c M,π h

(d c M,π h )2 d M ,p h + ε d c M,π h d c M,p h + ε d c M,π h

Observe that

(d c M,π h )2 d M ,p h + ε d c M,π h d c M,p h + ε d c M,π h 1

almost surely. Consequently, Lemma C.6 implies that for c = 3,

(d c M,π h )2 d M ,p h + ε d c M,π h d c M,p h + ε d c M,π h

(d c M,π h )2 d M ,p h + ε d c M,π h d c M,p h + ε d c M,π h

ε2 D2 H d c M,p h , d M ,p h .

Finally, note that by joint convexity of Hellinger distance and the data processing inequality, we have that

D2 H d c M,p h , d M ,p h Eπ p h D2 H d c M,π h , d M ,π h i Eπ p h D2 H c M(π), M (π) i .

Scalable Online Exploration via Coverability

J. Proofs from Appendix F

Proposition F.1. For all ε > 0, it holds that Cov M h,ε 1 + 2 q

Proof of Proposition F.1. Let µ (X A) be the distribution that attains the value of CM avg;h. Using Lemma C.10, we have that for all π Π and p (Π),

E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε d M,π h (xh, ah)

2 E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε (d M,π h (xh, ah) + µ(xh, ah))

+ E M,π µ(xh, ah) d M,p h (xh, ah) + ε (d M,π h (xh, ah) + µ(xh, ah))

2 E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε (d M,π h (xh, ah) + µ(xh, ah))

+ E M,π µ(xh, ah) d M,p h (xh, ah) + ε µ(xh, ah))

Observe that we can bound

E M,π d M,π h (xh, ah) d M,p h (xh, ah) + ε (d M,π h (xh, ah) + µ(xh, ah))

(d M,π h (x, a))2

d M,p h (x, a) + ε (d M,π h (x, a) + µ(x, a))

d M,π h (x, a)µ1/2(x, a) d M,p h (x, a) + ε (d M,π h (x, a) + µ(x, a)) d M,π h (x, a) µ1/2(x, a)

x X,a A µ(x, a) (d M,π h (x, a))2

(d M,p h (x, a) + ε (d M,π h (x, a) + µ(x, a)))2

(d M,π h (x, a))2

x X,a A µ(x, a) d M,π h (x, a) d M,p h (x, a) + ε (d M,π h (x, a) + µ(x, a))

C M avg;h 1/2

ε E M,π µ(xh, ah) d M,p h (xh, ah) + ε µ(xh, ah)

1/2 C M avg;h 1/2.

Hence, if we define

V := inf p (Π) sup π Π E M,π µ(xh, ah) d M,p h (xh, ah) + ε µ(xh, ah)

this argument establishes that

Cov M h,ε V + 2

We now claim that V 1. To see this, observe that the function

(p, q) 7 Eπ q E M,π µ(xh, ah) d M,p h (xh, ah) + ε µ(xh, ah)

is convex-concave. In addition, it is straightforward to see that the function is jointly Lipschitz with respect to total variation

Scalable Online Exploration via Coverability

distance whenever ε > 0. Hence, using the minimax theorem (Lemma C.1), we have that

V = inf p (Π) sup q (Π) Eπ q E M,π µ(xh, ah) d M,p h (xh, ah) + ε µ(xh, ah)

= sup q (Π) inf p (Π) Eπ q E M,π µ(xh, ah) d M,p h (xh, ah) + ε µ(xh, ah)

sup q (Π) Eπ q E M,π µ(xh, ah) d M,q h (xh, ah) + ε µ(xh, ah)

d M,q h (x, a)µ(x, a) d M,q h (x, a) + ε µ(x, a) 1.

Proposition F.2. Suppose the MDP M obeys the low-rank structure in Eq. (30). Then for all h [H], we have

CM avg;h |A| CM ϕ;h 1, and consequently Cov M h,ε 1 + 2 q

|A| CM ϕ;h 1 ε .

Proof of Proposition F.2. We first note that

C M avg;h |A| inf µ (A) sup π Π E M,π d M,π h (xh) µ(xh)

Let ν (X A) be arbitrary, and let µ := ν h 1 P M h 1 be the distribution induced by sampling (xh 1, ah 1) ν and xh P M h 1( | xh 1, ah 1). Let π Π be arbitrary. We can write

E M,π d M,π h (xh) µ(xh)

E M,π[ϕh 1(xh 1, ah 1)], X

x X ψh(x)d M,π h (x) µ(x) | {z } =:w

Using Cauchy-Schwarz and defining Σν := E(xh 1,ah 1) ν ϕh 1(xh 1, ah 1)ϕh 1(xh 1, ah 1) , we can bound

E M,π[ϕh 1(xh 1, ah 1)], w = D Σ 1/2 ν E M,π[ϕh 1(xh 1, ah 1)], Σ1/2 ν w E

2 E M,π[ϕh 1(xh 1, ah 1)] 2 Σ 1 ν + 1

We can write

w 2 Σν = E(xh 1,ah 1) ν h ϕh 1(xh 1, ah 1), w 2i

= E(xh 1,ah 1) ν

ϕh 1(xh 1, ah 1), X

x X ψh(x)d M,π h (x) µ(x)

= E(xh 1,ah 1) ν

" E M d M,π h (xh) µ(xh) | xh 1, ah 1

" d M,π h (xh) µ(xh)

" d M,π h (xh) µ(xh)

= E M,π d M,π h (xh) µ(xh)

Hence, we have shown that

E M,π d M,π h (xh) µ(xh)

2 E M,π[ϕh 1(xh 1, ah 1)] 2 Σ 1 ν + 1

2 E M,π d M,π h (xh) µ(xh)

To conclude, we rearrange and recall that 1) π is arbitrary, and 2) we are free to choose ν to minimize the right-hand side. From here, the claim follows from Proposition F.1.

Scalable Online Exploration via Coverability

Proposition F.3. There exists an MDP M and policy class Π Πrns with horizon H = 1 such that CM ;h 2 (and hence Cov M h,ε 2 as well), yet for all ε > 0,

inf p (Π) Ψ M ;h,ε(p) 1

and in particular infp (Π) ΨM ;h,0(p) = .

Proof of Proposition F.3. Consider an MDP M with horizon H = 1, a singleton state space S = {s}, and action space A = N { }. For each i N, we define a randomized policy πi via

π i(a | s) =

1 1 2i2 , a = , 1 2i2 , a = i, 0, o.w. ,

d M,πi 1 (s, a) =

1 1 2i2 , z = , 1 2i2 , z = i, 0, o.w.

We set Π = {πi}i N, and abbreviate di(s, a) = d M,πi 1 (s, a) going forward.

We first bound CM . We choose µ by setting µ(s, ) = 1

2 and µ(s, i) = 3 π2 1

i2 , which has P a A µ(s, a) = 1. It is fairly

immediate to see that for all i, we have di(s, )

µ(s, ) 2 and

µ(s, i) = π2

This shows that CM 2. On the other hand, for any p (N), we have

Ψ M ;h,ε(p) sup i N sup j N

di(s, j) Ek p[dk(s, j)] + ε di(s, j)

di(s, i) Ek p[dk(s, i) + ε di(s, i)]

(p(i) + ε) (1/2i2)

1 p(i) + ε = 1

where the conclusion holds because P

i N p(i) 1, which means for all δ > 0, there exists i such that p(i) δ.

K. Proofs from Appendix E

K.1. General Guarantees for Algorithm 3

We first present general assumptions on the weight function estimation and policy optimization subroutines under which Algorithm 3 can be analyzed, then present our most general result, Theorem K.1.

K.1.1. WEIGHT FUNCTION REALIZABILITY

Theorem E.1 is analyzed under the weight function realizability assumption in Assumption E.1. However, Algorithm 3 is most directly analyzed in terms of the following, slightly stronger weight function assumption, which we show is implied by Assumption E.1. To motivate the assumption, recall that we seek to estimate a weight function bwt h approximating Eq. (26). Assumption K.1 (Weight function realizability strong version). For a parameter T N, we assume that for all h 2, all t [T], and all policies π1, . . . , πt 1 Πns, we have that

h (x | x, a) := P M h 1(x | x, a) P

i<t d M ,πi h (x ) + P M h 1(x | x, a) Wh.

Scalable Online Exploration via Coverability

We assume without loss of generality that w 1 for all w Wh.

To compute a policy cover that approximately solves ph = arg minp (Πrns) ΨM push;h,ε(ph) for parameter ε > 0, we require that Assumption K.1 holds for T = 1

Algorithm 3 enjoys tighter guarantees when Assumption K.1 is satisfied, but the following result shows that Assumption E.1 implies Assumption K.1 at the cost of a small degradation in rate. Proposition K.1. For any T N, given a weight function class W satisfying Assumption E.1, the induced class W given by

(x, a, x ) 7 1 1 + P

i<t 1 wi h(x |x,a) | w 1 h, . . . , w t 1 h Wh, t [T]

satisfies Assumption K.1, and has log|W h| O(T log|Wh|).

ε this increases the weight function class size from log|W| to O 1

ε log|W| , leading to an extra 1

ε factor in the final sample complexity bound for our main result (Theorem E.1). Remark K.1 (Sufficiency of Assumption K.1). When invoked with layer h 2 and iteration t 2 within Algorithm 3, Estimate Weight Function (Algorithm 4) collects datasets D1 µ and D2 ν such that

µ(x | x, a) = P M h 1(x | x, a), ν(x | x, a) = 1

i<t d M ,πh,i

h (x ) + P M h 1(x | x, a)

µ(x, a) = ν(x, a) = 1

d M ,ph 1 h 1 (x, a) + 1 t 1

i<t d M ,πh,i h 1πunif h 1 (x, a)

Then, in Line 10, the algorithm computes the estimator Eq. (27) with respect to the class t W, which is guaranteed to have µ(x,a,x ) ν(x,a,x ) = t wt h(x | x, a) t W under Assumption K.1.

K.1.2. POLICY OPTIMIZATION SUBROUTINE

This section presents general conditions for the subroutine Policy Optimization under which CODEX.W obtains the same guarantees as in Theorem E.1, and establishes that PSDP satisfies this assumption.

To formalize the requirement of Policy Optimization, recall that for each layer h 2, iteration t [T], and each ℓ h 1, we define

Q M ,π ℓ (x, a; bw t h) = E M ,π[ bw t h(xh | xh 1, ah 1) | xℓ= x, aℓ= a]

as the Q-function for a policy π Πns under the (stochastic) reward rt h 1 = bwt h(xh | xh 1, ah 1) in Algorithm 3. We assume that the policy πt,h = Policy Optimizationh 1(rh,t, p1:h 1, ϵ, δ) approximately maximizes this Q-function under p1, . . . , ph 1. Assumption K.2 (Local optimality for policy optimization). For any fixed iteration h 2 and t [T], the subroutine Policy Optimizationh 1(rh,t, p1:h 1, ϵ, δ) produces a policy πh,t such that with probability at least 1 δ,

ℓ=1 E M ,pℓ max a A Q M ,πh,t

ℓ (xℓ, a; bw t h) Q M ,πh,t

ℓ (xℓ, π h,t(xℓ); bw t h) ϵ, (52)

and does so using Nopt(ϵ, δ) episodes.

This assumption asserts that πt,h cannot be substantially improved, but only with respect to the state distribution induced by p1, . . . , ph 1.20 This is a weak guarantee that can be achieved using only data collected from p1, . . . , ph 1 (e.g., via offline RL methods or hybrid offline/online methods), and does not require systematic exploration.

20Note that if p1, . . . , ph 1 uniformly cover all policies, then Assumption K.2 implies that πh,t is globally optimal by the performance difference lemma. However, Eq. (52) can still lead to useful guarantees in the presence of partial coverage, which our analysis critically exploits.

Scalable Online Exploration via Coverability

The next result shows that PSDP satisfies Assumption K.2 under the value function realizability assumption in Assumption E.2.

Lemma K.1 (Local optimality for PSDP). Suppose Assumption E.2 holds. Then for any ϵ, δ (0, 1), the subroutine πh,t =

PSDPh 1(rh,t, p1:h 1, ϵ, δ) satisfies Assumption K.2, and does so using at most Npsdp(ϵ, δ) = O H3|A| log(|Q|Hδ 1)

See Appendix K.4 for details. We expect that similar guarantees can be proven for Natural Policy Gradient, Conservative Policy Iteration, and other standard local search methods. Different subroutines may allow one to make use of weaker function approximation requirements.

K.1.3. GENERAL GUARANTEE FOR CODEX.W (ALGORITHM 3)

Our most general guarantee for CODEX.W is given below.

Theorem K.1 (General guarantee for Algorithm 3). Let ε (0, 1/2) and δ (0, e 1) be given, and suppose that Assumption K.1 and Assumption K.2 are satisfied. Then Algorithm 3 produces policy covers p1, . . . , p H (Πrns) such that with probability at least 1 δ, for all h [H],

Ψ M push;h,ε(ph) 170H log(ε 1) C M push,

and does so using at most

N e O H|A| log(|W|δ 1)

ε Nopt(cε2, δ/2HT)

episodes, where c > 0 is a sufficiently small absolute constant.

In particular, this result shows that we can optimize the pushforward coverability objective (and consequently the L1Coverage objective, via Proposition 4.2), up to small O(H log(ε 1)) approximation factor. The sample complexity is polynomial in all relevant problem parameters whenever the subroutine Policy Optimization has polynomial sample complexity. Note that the sample complexity for the first term is of order 1/ε3 (as opposed to the slower 1/ε4 in Theorem E.1) since we are stating this result under the stronger weight function realizability assumption (Assumption K.1).

Combining Theorem K.1 with Proposition K.1 and the guarantee for PSDP (Lemma K.1), we obtain Theorem E.1.

K.2. Technical Preliminaries

Lemma K.2. For any distribution ω (Z) and any pair of functions w, w : Z R+,

Eω[w] 3 Eω[w ] + 2 Eω ( w

Proof of Lemma K.2. By AM-GM, we have

|Eω[w] Eω[w ]| Eω h | w

w )2 Eω ( w

2(Eω[w] + Eω[w ]) + 1

Rearranging, we conclude that

Eω[w] 3 Eω[w ] + 2 Eω ( w

Scalable Online Exploration via Coverability

Lemma K.3 (e.g., Xie et al. (2023); Mhammedi et al. (2023a)). Consider a set Z and a sequence of distributions d1, . . . , d T (Z) for which there exists a distribution µ (Z) such that supz Z n dt(z)

µ(z) o C for all t [T]. For any

sequence of functions g1, . . . , g T (Z [ B, B]), it holds that

t=1 Ez dt[g(z)]

v u u t2C log(2T)

i<t Ez di[(gt(z))2] + 2CB. (53)

K.3. Weight Function Estimation

In this section, we give self-contained guarantees for the statistical problem of estimating the density ratio (or, weight function ) for a pair of distributions.

Consider the following setting. Let Z be a set. We receive samples z1 µ, . . . , zn µ Z and z1 ν, . . . , zn ν Z, where zt µ µt (Z) and zt ν νt (Z). The distributions µt and νt can be chosen in an adaptive fashion based on z1 µ, z1 ν, . . . , zt 1 µ , zt 1 ν . We define µ = 1

n Pn t=1 µt and ν = 1

n Pn t=1 νt, and our goal is to estimate the density ratio

w (z) := µ(z)

We assume that w B, and assume access to a realizable weight function class W with w W. Following Nguyen et al. (2010) (see also Katdare et al. (2023)), we consider the estimator

bw := arg max w W b Eµ[log(w)] b Eν[w], (54)

where b Eµ[ ] denotes the empirical expectation with respect to z1 µ, . . . , zn µ and b Eν[ ] denotes the empirical expectation with respect to z1 ν, . . . , zn ν. The following theorem gives a finite-sample bound for this estimator, which may be of independent interest.

Theorem K.2. Suppose that w W and supw W w B. The estimator in Eq. (54) ensures that with probability at least 1 δ,

D2 H,ν( bw, w ) 20B log(|W|δ 1)

where D2 H,ν(w, w ) := Eν w

Remark K.2 (Extension to contextual weight function estimation). An immediate corollary for Theorem K.2 concerns the following contextual setting. Suppose that Z = X Y, and that for all t, zt µ = (xt, yt µ) and zν = (xt, yt ν) have the same marginal distribution for xt, i.e. µt(x, y) = µt(y | x)ω(x) and νt(x, y) = νt(y | x)ω(x) for some ω (X). Define µ(y | x) = 1

n Pn t=1 µt(y | x) and ν(y | x) = 1

n Pn t=1 νt(y | x), and let w (y | x) = µ(y|x)

ν(y|x). Then, given a class of weight functions W with w W, where each w W has the form w(y | x) and w B, the estimator in Eq. (54) ensures that

Ex ω,y ν( |x)[| bw(y | x) w (y | x)|] 10B

log(|W|δ 1)

Proof of Theorem K.2. Define V (w) = Eµ[log(w)] Eν[w] and b V (w) = b Eµ[log(w)] b Eν[w], and note that w = µ ν = arg maxw V (w). We begin by performing concentration on the log-loss terms. Define Xt(w) = 1 2 log(w (zt µ)) log(w(zt µ)) . By Lemma C.2 and a union bound, we have that with probability at least 1 δ, for all w W,

t=1 log Eµt exp 1

2 log(w/w ) 1

b Eµ[log(w )] b Eµ[log(w)] + log(|W|δ 1)

Note that Eµt exp 1

2 log(w/w ) = Eµt hp

w/w i . We now state and prove a basic technical lemma.

Scalable Online Exploration via Coverability

Lemma K.4. For all x > 0, log(x) 1 x.

Proof of Lemma K.4. Let f(x) = log(x). Since f is convex, we have that for all x > 0,

f(x) f(1) + f (1)(x 1).

Noting that f(1) = 0 and f (1) = 1, the result is established.

Using Lemma K.4, we have

t=1 log Eµt exp 1

2 log(w/w ) 1 1

w/w i = 1 Eν h

where the last line uses that Eµ hp

w/w i = Eν h w p

w w . By direct calculation, we have that

D2 H,ν(w, w ) = Eν

w 2 = Eν[w ] + Eν[w] 2 Eν h

w w i = 1 + Eν[w] 2 Eν h

2D2 H,ν(w, w ) + 1

2(1 Eν[w]).

Specializing to bw, we have

1 2D2 H,ν( bw, w ) + 1

2(1 Eν[ bw]) 1

b Eµ[log(w )] b Eµ[log( bw)] + log(|W|δ 1)

b V (w ) b V ( bw) + 1

b Eν[w ] b Eν[ bw] + log(|W|δ 1)

Since b V (w ) b V ( bw) 0, rearranging gives

D2 H,ν( bw, w ) b Eν[w ] b Eν[ bw] 1 b Eν[ bw] + 2 log(|W|δ 1)

= b Eν[w ] b Eν[ bw] (Eν[w ] Eν[ bw]) + 2 log(|W|δ 1)

Using Lemma C.3, we have that for all η 1/2B, with probability at least 1 δ, for all w W

b Eν[w ] b Eν[w] (Eν[w ] Eν[w]) η

t=1 Eνt (w w )2 + log(|W|δ 1)

= η Eν (w w )2 + log(|W|δ 1)

We further observe that

Eν (w w )2 = Eν h ( w

w )2i 4B Eν h ( w

w )2i = 4B D2 H,ν(w, w ),

so choosing η = 1 8B gives

b Eν[w ] b Eν[w] (Eν[w ] Eν[w]) 1

2D2 H,ν(w, w ) + 8B log(|W|δ 1)

Combining this with Eq. (55), we conclude that

D2 H,ν( bw, w ) 1

2D2 H,ν( bw, w ) + 8B log(|W|δ 1)

n + 2log(|W|δ 1)

Scalable Online Exploration via Coverability

which implies that

D2 H,ν( bw, w ) 20B log(|W|δ 1)

after simplifying.

K.4. Policy Optimization Subroutines

K.4.1. POLICY SEARCH BY DYNAMIC PROGRAMMING (PSDP)

Algorithm 6 PSDPh(p1:h, r1:h; ϵ, δ, Q1:h): Policy Search by Dynamic Programming (cf. Bagnell et al. (2003))

Target layer h [H], policy covers p1:h, reward functions r1:h.

Accuracy parameters ϵ, δ [0, 1].

Function classes Q1:h.

2: Let n = npsdp(ϵ, δ) := c H2|A| log(|Q|Hδ 1)

ϵ2 for a sufficiently large numerical constant c > 0.

3: for ℓ= h, . . . , 1 do

5: for npsdp times do

6: Sample π pℓ.

7: Sample (xℓ, aℓ, Ph ℓ =ℓrℓ (xℓ , aℓ )) π ℓπunif ℓ+1 bπ.

8: Update dataset: Dℓ Dℓ xℓ, aℓ, Ph ℓ =ℓrℓ (xℓ , aℓ ) .

9: Solve regression: b Qℓ arg min Q Qℓ

(x,a,R) Dℓ (Q(x, a) R)2.

10: Define bπℓ(x) = arg maxa A b Qℓ(x, a).

11: return: Policy bπ.

This section presents self-contained guarantees for Policy Search by Dynamic Programming (PSDP, Algorithm 6) (Bagnell et al., 2003), which performs local policy optimization given access to exploratory distributions p1:h (Πrns). PSDP takes as input an arbitrary reward functions r1:h : X A [0, 1] and a function class Q = Q1:h, where Qℓ {Q : X A [0, 1]}, that can represent certain Q-functions for these rewards.

We prove that with high probability, the output bπ = PSDPh(p1:h, r1:h; ϵ, δ, Q) is an approximate maximizer of the objective

max π Πns E M ,π " h X

ℓ=1 rℓ(xℓ, aℓ)

in a local sense with respect to p1:h (cf. Eq. (64)).

To prove guarantees for PSDP, we make use of the following realizability assumption for the class Q = Q1:h.

Definition K.1. We say that function classes Q1:h, where Qℓ {Q : X A R+} for ℓ [h], realize the reward functions r1:h : X A R+ if for all t [h] and all π Πns,

Q M ,π ℓ ( , ; r) Qℓ, where Q M ,π ℓ (x, a; r) := E M ,π " h X

ℓ =ℓ rℓ (xℓ , aℓ )

xℓ= x, aℓ= a

Scalable Online Exploration via Coverability

Lemma K.5 (Main guarantee for PSDP). For any ϵ, δ (0, 1) and reward function {rℓ}ℓ [h] with Ph ℓ=1 rℓ [0, 1] that is realizable in the sense of Definition K.1, PSDP ensures that with probability at least 1 δ, the output bπ = PSDPh(p1:h, r1:h; ϵ, δ, Q1:h) satisfies

ℓ=1 E M ,pℓ max a A Q M ,bπ ℓ (xℓ, a; r) Q M ,bπ ℓ (xℓ, bπ(xℓ); r) ϵ, (60)

and does so using at most Npsdp(ϵ, δ) = O H3|A| log(|Q|Hδ 1)

ϵ2 episodes.

Proof of Lemma K.5. See the proof of Theorem D.1 in Mhammedi et al. (2023b).

K.5. Proof of Theorem K.1

Theorem K.1 (General guarantee for Algorithm 3). Let ε (0, 1/2) and δ (0, e 1) be given, and suppose that Assumption K.1 and Assumption K.2 are satisfied. Then Algorithm 3 produces policy covers p1, . . . , p H (Πrns) such that with probability at least 1 δ, for all h [H],

Ψ M push;h,ε(ph) 170H log(ε 1) C M push,

and does so using at most

N e O H|A| log(|W|δ 1)

ε Nopt(cε2, δ/2HT)

episodes, where c > 0 is a sufficiently small absolute constant.

Proof of Theorem K.1. To keep notation compact, throughout this section we abbreviate dπ h d M ,π h , Ph( | ) P M h ( | ), Eπ[ ] E M,π[ ], and so on when the MDP is clear from context.

w t h(xh | xh 1, ah 1) = Ph 1(xh | xh 1, ah 1) P

i<t dπh,i h (xh) + Ph 1(xh | xh 1, ah 1) (61)

We define two notions of estimation error for the weight function estimates produced by the subroutine Estimate Weight Function (Algorithm 4):

(ε t w,off;h)2 = Eph 1 " q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

(ε t w,on;h)2 = 1 t 1

i<t Eπh,i h 1πunif " q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

We define a notion of local suboptimality for the policies πh,t produced by the subroutine Policy Optimization as follows:

ε t opt;h =

ℓ=1 Epℓ max a A Qπh,t ℓ (xℓ, a; bw t h) Qπh,t ℓ (xℓ, π h,t(xℓ); bw t h) . (64)

Lemma K.6. Let ε (0, 1/2) and set T = 1 ε. Suppose that for all h 2 and t [T], it holds that εt w,off;h c1(CM push/|A|t)1/2ε1/2, εt w,on;h c2(CM push/|A|t)1/2ε1/2, and εt opt;h c3ε2 for absolute constants c1, c2, c3 > 0. Then for all h 2, we have that

Ψ M push;h,ε(ph) 170H log(ε 1) C M push.

Scalable Online Exploration via Coverability

Let Nweight(t, ϵ, δ) denote the number of episodes used by Estimate Weight Function to ensure that (εt w,off;h)2, (εt w,on;h)2 ϵ2/t with probability at least 1 δ when invoked at iteration t [T] for layer h 2, and let Nopt(ϵ, δ) be the number of trajectories used by Policy Optimization to ensures that εt opt;h ϵ with probability at least 1 δ when invoked at iteration t [T] for layer h 2. It follows from Lemma K.6 that with the parameter settings in Algorithm 3, we are guaranteed that with probability at least 1 δ, for all h [H]

Ψ M push;h,ε(ph) 170H log(ε 1) C M push.

and the total number of episodes used is at most

N HT(Nweight(T, ϵw, δw) + Nopt(ϵopt, δopt))

HT Nweight(T, c(C M push/|A|)1/2ε1/2, δ/2HT) + Nopt(c ε2, δ/2HT) .

for absolute constants c, c > 0. It remains to bound Nweight(ϵ, δ), for which we appeal to the following lemma, a corollary of Theorem K.2. Lemma K.7. Let h 2 and t [T] be given. For any ϵ, δ (0, 1), distribution ph 1 (Πrns) and πh,1, . . . , πh,t 1 Πns, Estimate Weight Function ensures that with probability at least 1 δ, the output bwt h Estimate Weight Functionh,t(ph 1, {πh,i}i<t; ϵ, δ, W) satisfies

(ε t w,off;h)2 = E M ,ph 1 " q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

ϵ2/t, and (65)

(ε t w,on;h)2 = 1 t 1

i<t E M ,πh,i h 1πunif " q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

and does so using at most Nweight(t, ϵ, δ) = 80t log(|W|δ 1)

ϵ2 episodes.

Appealing to this result, we conclude that the total number of episodes is at most

N e O H|A| log(|W|δ 1)

ε Nopt(cε2, δ/2HT) .

K.5.1. PROOF OF LEMMA K.6 (OUTER-LEVEL ANALYSIS)

Extended MDP. Let α 1 be a parameter to the proof, whose value will be chosen at the end as a function of ε. Following Mhammedi et al. (2023b;a), we define an extended MDP M by augmenting A = A {t} and X = X {t}. M has identical dynamics to M , except that taking action t causes the state to transition to t deterministically; t is a self-looping terminal state.

We define ΠRNS as the set of all randomized non-stationary policies from X to A. For π ΠRNS, we abbreviate dπ h d

M,π h , P h( | ) P M h ( | ), P π[ ] PM,π[ ], E π[ ] E

M,π[ ], and so on. We adopt the convention that all policies in ΠRNS select action t in the terminal state.

Truncated benchmark policy class. Given the policy class Π = Πns, we inductively define a sequence of policy classes Πα,1, . . . , Πα,H based on the extended MDP and the output p1, . . . , p H of the algorithm as follows.

First, Πα,0 = Π.

Next, for h = 1, . . . , H, we construct Πα,h from Πα,h 1. For each π Πα,h 1, add a policy π to Πα,h defined as follows: For all h = h, π h = πh , and for layer h,

πh(x), dπ h(x) d ph h (x) α,

t, dπ h(x) d ph h (x) > α. (67)

Finally, we adopt the shorthand Πα := Πα,H.

Scalable Online Exploration via Coverability

Main analysis. Let h 2 be fixed. For the remainder of the proof, we abbreviate πt πt,h to keep notation compact. Define

Ψpush;h,ε(p) = sup π Πα

E P h 1(xh | xh 1, ah 1) dp h(xh) + ε P h 1(xh | xh 1, ah 1)Ixh =t

as a counterpart to the pushforward coverage relaxation for the extended MDP M. We define three quantities,

t=1 sup π Πα

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2 Ixh =t

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2 Ixh =t

t=1 sup π Πα

E π[ bw t h(xh | xh 1, ah 1)Ixh =t] E πt [ bw t h(xh | xh 1, ah 1)Ixh =t].

which measure the quality of the weight function estimates produced by Estimate Weight Function and the optimization quality of the policies produced by Policy Optimization in the extended MDP.

We now state three technical lemmas, all proven in the sequel. The first lemma shows that we can bound the value of Ψpush;h,ε(ph) in terms of the pushforward coverability parameter CM push;h, up to additive error terms depending on the quantities defined above. Lemma K.8. Let h 2 be fixed and ε (0, 1/2). Then, with T = 1

ε, it holds that

Ψpush;h,ε(ph) 72C M push;h log(T) + 2 w,off;h + 6 w,on;h + 3 opt;h. (69)

The next two lemmas relate the weight estimation and policy optimization errors in M to their counterparts in the true MDP M , leveraging key properties of the truncated policy class Πα. Lemma K.9. The following bounds hold for all h 2, as long as w 1 for all w Wh:

w,off;h α|A|

t=1 (ε t w,off;h)2, and (70)

v u u t8|A|CM push log(T)

t=1 (t 1)(εt w,on;h)2 + 4C M push. (71)

Lemma K.10. The following bound holds for all h 2:

t=1 ε t opt;h. (72)

Appealing to Lemmas K.8 to K.10, we conclude that as long as εt w,off;h c1(α|A|t/CM push) 1/2, εt w,on;h c2(|A|Tt/CM push) 1/2, and εt opt;h c3(αT) 1 for all h 2, t [T], where c1, c2, c3 > 0 are absolute constants, we are guaranteed that for all h,

Ψpush;h,ε(ph) 85C M push log(T). (73)

It remains to translate this back to a bound on the L1-Coverage objective for the true MDP M . To do so, we start with the following technical lemma, also proven in the sequel. Lemma K.11. Consider any reward function {rh}h [H] with rh : X A [0, 1] such that PH h=1 rh(xh, ah) [0, 1] for all sequences (x1, a1), . . . , (x H, a H), and such that rh(t, a) = 0 and rh(x, t) = 0. It holds that

sup π Π Eπ " H X

h=1 sup π Πα

P π dπ h(xh) dph h (xh) > α, xh = t .

Scalable Online Exploration via Coverability

Let h 2 be fixed and define a reward function {rℓ}ℓ h 1 with rℓ: X A [0, ε 1] via

rh 1(x, a) = E Ph 1(xh | xh 1, ah 1) dph h (xh) + ε Ph 1(xh | xh 1, ah 1) | xh 1 = x, ah 1 = a Ix,a =t

and rℓ= 0 for ℓ< h 1. Using Lemma K.11, we have that

Ψ M push;h,ε(ph) = sup π Π

ℓ=1 sup π Πα

P π dπ ℓ(xℓ) dpℓ ℓ(xℓ) > α, xℓ = t . (74)

To bound the right-hand side, we first note that

E π E Ph 1(xh | xh 1, ah 1) dph h (xh) + ε Ph 1(xh | xh 1, ah 1) | xh 1, ah 1

Ixh 1,ah 1 =t

E P h 1(xh | xh 1, ah 1) dph h (xh) + ε P h 1(xh | xh 1, ah 1) | xh 1, ah 1

Ixh 1,ah 1 =t

E π P h 1(xh | xh 1, ah 1) dph h (xh) + ε P h 1(xh | xh 1, ah 1)Ixh 1,ah 1 =t

E π P h 1(xh | xh 1, ah 1) dph h (xh) + ε P h 1(xh | xh 1, ah 1)Ixh =t

= Ψpush;h,ε(ph). (75)

Here, the second equality uses that i) M and M have identical transition dynamics whenever xh 1, ah 1t and ii) policies in the support of ph never take the terminal action. Meanwhile, the second-to-last inequality uses that xh = t if and only if xh 1 = t and ah 1 = t in M. To bound the second term on the right-hand side of Eq. (74), we use a variant of Proposition 4.2. Lemma K.12. For all α > 0 and ℓ 1, it holds that

P π dπ ℓ(xℓ) dpℓ ℓ(xℓ) > α, xℓ = t 2

E π P ℓ 1(xℓ| xℓ 1, aℓ 1) dpℓ ℓ(xℓ) + α 1 P ℓ 1(xℓ| xℓ 1, aℓ 1)Ixℓ =t

We set α = ε 1, so that combining Eq. (74) with Eq. (75) and Lemma K.12 yields

Ψ M push;h,ε(ph) Ψpush;h,ε(ph) + 2

Ψpush;ℓ,ε(pℓ). (77)

Consequently, for the choice T = ε 1 and α = ε 1, Eq. (73) and Eq. (77) imply that for all h 2,

Ψ M push;h,ε(ph) 170H log(ε 1) C M push.

The final approximation requirements for these choices are εt w,off;h c1(CM push/|A|t)1/2ε1/2, εt w,on;h c2(CM push/|A|t)1/2ε1/2, and εt opt;h c3ε2 for absolute constants c1, c2, c3 > 0.

K.5.2. PROOFS FOR SUPPORTING LEMMAS FOR LEMMA K.6

Lemma K.8. Let h 2 be fixed and ε (0, 1/2). Then, with T = 1

ε, it holds that

Ψpush;h,ε(ph) 72C M push;h log(T) + 2 w,off;h + 6 w,on;h + 3 opt;h. (69)

Proof of Lemma K.8. This is a slightly modified variant of the proof of Theorem 4.2. Let edt h = P

i<t dπi h and ˇdt h = P

i<t dπi h . Observe that for T = 1

Ψpush;h,ε(ph) = sup π Πα

E P h 1(xh | xh 1, ah 1) dph h (xh) + ε P h 1(xh | xh 1, ah 1)Ixh =t

= T sup π Πα

P h 1(xh | xh 1, ah 1) ˇd T +1 h (xh) + P h 1(xh | xh 1, ah 1)Ixh =t

Scalable Online Exploration via Coverability

and hence it suffices to bound the quantity on the right-hand side. Next, note that for all t [T], we have that

P h 1(xh | xh 1, ah 1) ˇd t+1 h (xh) + P h 1(xh | xh 1, ah 1)Ixh =t

P h 1(xh | xh 1, ah 1) ˇd t+1 h (xh) + P h 1(xh | xh 1, ah 1)Ixh =t

and consequently

P h 1(xh | xh 1, ah 1) ˇd T +1 h (xh) + P h 1(xh | xh 1, ah 1)Ixh =t

t=1 sup π Πα

P h 1(xh | xh 1, ah 1) ˇdt h(xh) + P h 1(xh | xh 1, ah 1)Ixh =t

Now, note that if xh = t, the dynamics of M imply that we must have xh 1, ah 1 = t as well. In this case, we have

P h 1(xh | xh 1, ah 1) ˇdt h(xh) + P h 1(xh | xh 1, ah 1) = w t h(xh | xh 1, ah 1), (78)

since P h 1( | xh 1, ah 1) = Ph 1( | xh 1, ah 1) with xh 1, ah 1 = t, and since dπi h (xh) = dπi h (xh) when xh = t (as the policies π1, . . . , πT never take the terminal action). As a result, using Lemma K.2, we have that

t=1 sup π Πα

P h 1(xh | xh 1, ah 1) ˇdt h(xh) + P h 1(xh | xh 1, ah 1)Ixh =t

t=1 sup π Πα

E π[w t h(xh | xh 1, ah 1)Ixh =t]

t=1 sup π Πα

E π[ bw t h(xh | xh 1, ah 1)Ixh =t]

t=1 sup π Πα

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2 Ixh =t

| {z } = w,off;h

Next, we can bound

t=1 sup π Πα

E π[ bw t h(xh | xh 1, ah 1)Ixh =t]

E πt [ bw t h(xh | xh 1, ah 1)Ixh =t] + opt;h

by definition. Applying Lemma K.2 once more, we have that

E πt [ bw t h(xh | xh 1, ah 1)Ixh =t]

E πt [w t h(xh | xh 1, ah 1)Ixh =t] + 2

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2 Ixh =t

| {z } = w,on;h

Finally, note that since πt Π never select the terminal action (and in particular never reach the terminal state), we have

E πt [w t h(xh | xh 1, ah 1)Ixh =t] =

E πt" Ph 1(xh | xh 1, ah 1) edt h(xh) + Ph 1(xh | xh 1, ah 1) Ixh =t

t=1 Eπt " Ph 1(xh | xh 1, ah 1) edt h(xh) + Ph 1(xh | xh 1, ah 1)

4C M push;h log(2T),

Scalable Online Exploration via Coverability

where the final bound follows from is Eq. (35). To simplify the constants, we note that log(2T) 2 log(T) whenever ε 1/2.

Lemma K.9. The following bounds hold for all h 2, as long as w 1 for all w Wh:

w,off;h α|A|

t=1 (ε t w,off;h)2, and (70)

v u u t8|A|CM push log(T)

t=1 (t 1)(εt w,on;h)2 + 4C M push. (71)

Proof of Lemma K.9. We first bound the quantity

t=1 sup π Πα

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2 Ixh =t

Observe that if xh = t, then the dynamics of the MDP M imply that xh 1 = t, ah 1 = t. Consider an arbitrary policy π Πα. Since π(xh 1) = t, the dynamics in Eq. (67) imply that dπ h 1(xh 1)/ dph 1 h 1 (xh 1) α. Consequently, for any t [T], we can bound

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2 Ixh =t

x X:π(x) =t,a A

dπ h 1(x, a) E

bwt h(xh | x, a) q

wt h(xh | x, a) 2 | xh 1 = x, ah 1 = a

x X:π(x) =t

dπ h 1(x) E

bwt h(xh | x, a) q

wt h(xh | x, a) 2 | xh 1 = x, ah 1 = a

x X dph 1 h 1 (x) E

bwt h(xh | x, a) q

wt h(xh | x, a) 2 | xh 1 = x, ah 1 = a

= α|A| E ph 1 h 1πunif " q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

= α|A| E ph 1 " q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

= α|A| Eph 1 " q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

= α|A| (ε t w,off;h)2,

where the second-to-last equality uses that ph 1 h 1 πunif = ph 1 by construction, and the final equality uses that policies in the support of ph 1 never take the optimal action. Summing over t completes the proof.

We now bound the quantity w,on;h. Note that since the policy πt never chooses the terminal action t, we can write

t=1 Eπt " q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

Observe that as a consequence of pushforward coverability, there exists a distribution µ (X) such that dπ h 1(x) CM pushµ(x) for all x X, π Πrns. Hence, by applying Lemma K.3 with gt(x) =

Scalable Online Exploration via Coverability

bwt h(xh | x, πt(x)) p

wt h(xh | x, πt(x)) 2 | xh 1 = x, ah 1 = πt(x) i , which has gt [0, 2] whenever

bwt , wt 1, we have

v u u u t2CM push log(T) X

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2 | xh 1, ah 1 = πt(xh 1) i 2

+ 4C M push

v u u t2CM push log(T)

i<t Eπi h 1πt q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 4 + 4C M push

v u u t2CM push|A| log(T)

i<t Eπi h 1πunif q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 4 + 4C M push

v u u t8CM push|A| log(T)

i<t Eπi h 1πunif q

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2 + 4C M push.

This proves the result.

Lemma K.10. The following bound holds for all h 2:

t=1 ε t opt;h. (72)

Proof of Lemma K.10. Let t [T] be fixed, and recall that we can write

E π[ bw t h(xh | xh 1, ah 1)Ixh =t] E πt [ bw t h(xh | xh 1, ah 1)Ixh =t]

E π bw t h(xh | xh 1, ah 1)Ixh 1,ah 1 =t E πt bw t h(xh | xh 1, ah 1)Ixh 1,ah 1 =t ,

since xh = t if and only if xh 1, ah 1 = t. If we define Qπ as the state-action value function for policy π in M under the reward function given by rh 1(x, a) = E[ bwt h(xh | x, a) | xh 1 = x, ah 1 = a]Ix,a =t and rh (x, a) = 0 for h < h 1, the performance difference lemma (Kakade, 2003) implies that

E π bw t h(xh | xh 1, ah 1)Ixh 1,ah 1 =t E πt bw t h(xh | xh 1, ah 1)Ixh 1,ah 1 =t

Qπt ℓ(xℓ, π(xℓ)) Qπt ℓ(xℓ, π t(xℓ)) i .

Consider an arbitrary policy π Πα, and fix ℓ [h 1], and write

Qπt ℓ(xℓ, π(xℓ)) Qπt ℓ(xℓ, π t(xℓ)) i = X

x X dπ ℓ(x)

Qπt ℓ(x, π(x)) Qπt ℓ(x, π t(x))

For any x X, since π Πα, if dπ ℓ(x)/ dpℓ ℓ(x) > α, then π(x) = t, which implies that

Qπt ℓ(x, π(x)) Qπt ℓ(x, π t(x)) Qπt ℓ(x, π t(x)) 0

Scalable Online Exploration via Coverability

since the reward is non-negative, and since we can receive non-zero reward only if π(xℓ) = t for all ℓ h 1. Hence, we can bound

x X dπ ℓ(x)

Qπt ℓ(x, π(x)) Qπt ℓ(x, π t(x)) X

x X dπ ℓ(x)

Qπt ℓ(x, π(x)) Qπt ℓ(x, π t(x)) I dπ ℓ(x)/ dpℓ ℓ(x) α

x X dπ ℓ(x) max a A Qπt ℓ(x, a) Qπt ℓ(x, π t(x)) I dπ ℓ(x)/ dpℓ ℓ(x) α

x X dpℓ ℓ(x) max a A Qπt ℓ(x, a) Qπt ℓ(x, π t(x)) I dπ ℓ(x)/ dpℓ ℓ(x) α

x X dpℓ ℓ(x) max a A Qπt ℓ(x, a) Qπt ℓ(x, π t(x))

= α E pℓ max a A Qπt ℓ(x, a) Qπt ℓ(x, π t(x)) .

Above, the second inequality uses that i) Qπt ℓ(x, a) = 0 for all a A if x = t and ii) π(x) A if x = t but dπ ℓ(x)/ dpℓ ℓ(x) α. Finally, we note that

E pℓ max a A Qπt ℓ(xℓ, a) Qπt ℓ(xℓ, π t(xℓ)) = Epℓ max a A Qπt ℓ(xℓ, a; bw t h) Qπt ℓ(xℓ, π t(xℓ); bw t h) ,

since i) policies in the support of pℓnever take the terminal action, and ii) Qπt ℓ(x, a) = Qπt ℓ(x, a; bwt h) whenever x, a = t, since πt never takes the terminal action.

Lemma K.11. Consider any reward function {rh}h [H] with rh : X A [0, 1] such that PH h=1 rh(xh, ah) [0, 1] for all sequences (x1, a1), . . . , (x H, a H), and such that rh(t, a) = 0 and rh(x, t) = 0. It holds that

sup π Π Eπ " H X

h=1 sup π Πα

P π dπ h(xh) dph h (xh) > α, xh = t .

Proof of Lemma K.11. Since policies π Π never take the terminal action, we can write

sup π Π Eπ " H X

= sup π Πα,0

sup π Πα,ℓ 1

by telescoping. Fix ℓ [H]. Let π Πα,ℓ 1 be arbitrary, and let π Πα,ℓdenote the policy with π h = πh for h = ℓ, and with

πℓ(x), dπ ℓ(x) d pℓ ℓ(x) α,

t, dπ ℓ(x) d pℓ ℓ(x) > α. (79)

Let Q π h(x, a) := E πh PH h =h rh | xh = x, ah = a i denote the Q-function for {rh} in M. Then by the performance

Scalable Online Exploration via Coverability

difference lemma (Kakade, 2003), we have

= E π " H X

h (xh, π(xh)) Q π

h (xh, π (xh))

ℓ(xℓ, π(xℓ)) Q π

ℓ(xℓ, π (xℓ)) i ,

since the policies agree unless h = ℓ. Since Q π

ℓ [0, 1] by the normalization assumption on the rewards and Q π

ℓ(t, a) = 0 for all a A, we have

ℓ(xℓ, π(xℓ)) Q π

ℓ(xℓ, π (xℓ)) i P π[π (xℓ) = π(xℓ), xℓ = t]

= P π dπ ℓ(xℓ) dpℓ ℓ(xℓ) > α, xℓ = t ,

where the final equality follows from Eq. (79). Since this result holds uniformly for all π Πα,ℓ 1, we conclude that

sup π Πα,ℓ 1

sup π Πα,ℓ 1

P π dπ ℓ(xℓ) dpℓ ℓ(xℓ) > α, xℓ = t

P π dπ ℓ(xℓ) dpℓ ℓ(xℓ) > α, xℓ = t ,

where the final equality uses that every policy in Πα,ℓ 1 has a counterpart in Πα = Πα,H that takes identical actions for layers 1, . . . , ℓ 1.

Lemma K.12. For all α > 0 and ℓ 1, it holds that

P π dπ ℓ(xℓ) dpℓ ℓ(xℓ) > α, xℓ = t 2

E π P ℓ 1(xℓ| xℓ 1, aℓ 1) dpℓ ℓ(xℓ) + α 1 P ℓ 1(xℓ| xℓ 1, aℓ 1)Ixℓ =t

Proof of Lemma K.12. We follow a proof similar to Proposition 4.2. For any ℓ, we can write

P π dπ ℓ(xℓ) dpℓ ℓ(xℓ) > α, xℓ = t

P π 2 dπ ℓ(xℓ) dpℓ ℓ(xℓ) + α 1 dπ ℓ(xℓ) > α, xℓ = t

E π dπ ℓ(xℓ) dpℓ ℓ(xℓ) + α 1 dπ ℓ(xℓ)Ixℓ =t

( dπ ℓ(x))2 dpℓ ℓ(x) + α 1 dπ ℓ(x)Ix =t.

By Lemma H.1, the function

d 7 (d)2 dpℓ ℓ(x) + α 1 d

is convex for all x. Hence, writing dπ ℓ(x)Ix =t = E π

P ℓ 1(x | xℓ 1, aℓ 1)Ix =t , Jensen s inequality implies that for all x,

( dπ ℓ(x))2 dpℓ ℓ(x) + α 1 dπ ℓ(x)Ix =t E π (P ℓ 1(x | xℓ 1, aℓ 1))2

dpℓ ℓ(x) + α 1 P ℓ 1(x | xℓ 1, aℓ 1)Ix =t

Scalable Online Exploration via Coverability

We conclude that

P π dπ ℓ(xℓ) dpℓ ℓ(xℓ) > α, xℓ = t 2

E π P ℓ 1(xℓ| xℓ 1, aℓ 1) dpℓ ℓ(xℓ) + α 1 P ℓ 1(xℓ| xℓ 1, aℓ 1)Ixℓ =t

K.5.3. PROOF OF LEMMA K.7 (GUARANTEE FOR Estimate Weight Function)

Let wt h := t wt h and let ˇwt h := t bwt h. Observe that solving the optimization problem in Line 10 of Algorithm 4 is equivalent to solving the optimization problem in Eq. (54) over the class t Wh, which has w t for all w Wh. As such, we can appeal to Theorem K.2 (in particular, Remark K.2) with

µ(x | x, a) = P M h 1(x | x, a), ν(x | x, a) = 1

i<t d M ,πh,i

h (x ) + P M h 1(x | x, a)

ω(x, a) = 1

d M ,ph 1 h 1 (x, a) + 1 t 1

i<t d M ,πh,i h 1πunif h 1 (x, a)

Under Assumption K.1, we have

µ(x | x, a) ν(x | x, a) = w t h(x | x, a) t Wh,

so Theorem K.2 and Remark K.2 imply that

E(xh 1,ah 1) ω

ˇwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

20t log(|W|δ 1)

tn = 20 log(|W|δ 1)

or equivalently,

E(xh 1,ah 1) ω

bwt h(xh | xh 1, ah 1) q

wt h(xh | xh 1, ah 1) 2#

20 log(|W|δ 1)

From the definition of ω, it follows that setting n = nweight(ϵ, δ) = 40 log(|W|δ 1)

ϵ2 is sufficient to achieve the desired bound. The total number of episodes is at most 2t nweight(ϵ, δ).