# efficient_modelfree_exploration_in_lowrank_mdps__12694275.pdf

Efficient Model-Free Exploration in Low-Rank MDPs

Zakaria Mhammedi MIT mhammedi@mit.edu

Adam Block MIT ablock@mit.edu

Dylan J. Foster Microsoft Research dylanfoster@microsoft.com

Alexander Rakhlin MIT rakhlin@mit.edu

A major challenge in reinforcement learning is to develop practical, sample-efficient algorithms for exploration in high-dimensional domains where generalization and function approximation is required. Low-Rank Markov Decision Processes where transition probabilities admit a low-rank factorization based on an unknown feature embedding offer a simple, yet expressive framework for RL with function approximation, yet existing algorithms either (1) are computationally intractable, or (2) require restrictive statistical assumptions such as latent variable structure or access to model-based function approximation. In this work, we propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs that is both computationally efficient and model-free, allowing for general function approximation while requiring no structural assumptions beyond a reachability condition that we show is substantially weaker than that assumed in prior work. Our algorithm, Span RL, uses the notion of a barycentric spanner for the feature embedding as an efficiently computable basis for exploration, performing efficient spanner computation by interleaving representation learning and policy optimization subroutines. Our analysis which is appealingly simple and modular carefully combines several techniques, including a new approach to error-tolerant barycentric spanner computation, and a new analysis of a certain minimax representation learning objective found in prior work.

1 Introduction

In reinforcement learning and control, many of the most promising application domains require the agent to navigate complex, high-dimensional state and action spaces, where generalization and function approximation is necessary. The last decade has witnessed impressive empirical success in domains where data are abundant [34, 38, 26, 28, 27], but when data are limited, ensuring efficient exploration in large domains is a major research question. For statistical efficiency, the foundations have recently begun to take shape, with a line of research providing structural conditions that facilitate sample-efficient exploration, as well as fundamental limits [37, 21, 39, 42, 14, 24, 17, 18]. Computational efficiency, however, remains a major challenge: outside of simple settings [7, 23], existing algorithms with provable sample complexity guarantees are computationally inefficient, and typically require solving intractable non-convex optimization problems [21, 11, 24, 10]. The prospect of developing practical algorithms for exploration in high-dimensional state spaces that are both computationally and statistically efficient raises three fundamental questions:

1. What are the right computational primitives for exploration? That is, how can one efficiently represent and compute exploratory policies that allow the learner to explore the state space and gather useful data?

2. How should one leverage function approximation for example, via representation learning to discover such primitives in a computationally and statistically efficient fashion?

3. Given answers to the first two questions, how can one efficiently interleave function approximation and exploration to provide provably efficient algorithms?

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

In this paper, we investigate these questions through the Low-Rank MDP model [36, 45, 1]. In a Low-Rank MDP, the state space is large and potentially continuous, but the transition probabilities admit an (unknown) low-rank factorization. Concretely, for a finite-horizon Low-Rank MDP with horizon H, the transition densities for layer h [H] satisfy

Th(xh+1 xh,ah) = µ h+1(xh+1) ϕ h(xh,ah), (1)

where ϕ h( , ) Rd and µ h+1( ) Rd are state-action and next-state embeddings. The low-rank structure in (1) facilitates tractable exploration: if the embedding ϕ h is known to the learner, one can efficiently learn a near-optimal policy with sample complexity polynomial in the feature dimension d, and independent of the size of the state space [23]; in this regard, ϕ h can be thought of as a low-dimensional representation that enables sample-efficient RL. Following Agarwal et al. [1], we consider the challenging setting in which both ϕ h and µ h+1 are unknown to the learner. This formulation generalizes well-known frameworks such as the Block MDP (BMDP) model [12, 32], and necessitates the use of representation learning: the agent must learn an embedding that approximates ϕ h as it explores the environment, and must use this learned embedding to drive subsequent exploration. This form of function approximation allows for great flexibility, as ϕ h can be an arbitrary, nonlinear function of the state; in practice, it is common to model ϕ h as a neural net [49].

The Low-Rank MDP is perhaps the simplest MDP structure that demands systematic exploration and nonlinear function approximation while allowing for a continuum of states, yet understanding of efficient algorithm design for this model is surprisingly limited. Existing algorithms suffer from at least one of the following drawbacks:

1. Computational intractability [21, 24, 14, 9, 43].

2. Strong modeling assumptions (e.g., ability to model µ h+1( ), which facilitates application of model-based RL techniques) [1, 40, 10]; in this work, we aim for model-free methods that only require learning ϕ h.

3. Restrictive structural assumptions (e.g., non-negativity or latent variable structure for the embeddings in (1)) [35, 49].

At the root of these limitations is the complex interplay between exploration and representation learning: the agent must learn a high-quality representation to guide in exploring the state space, but learning such a representation requires gathering diverse and informative data, which is difficult to acquire without having already explored the state space to begin with. Overcoming this challenge particularly where computational efficiency is concerned requires (1) representation learning procedures that lead to sufficiently expressive representations for downstream applications, (2) efficient exploration procedures that are robust to errors in learned representations, and 3) understanding the interaction between these procedures, which must be interleaved. In this work, we propose an algorithm that addresses each of these challenges, as detailed below.

Contributions. We provide the first provably computationally efficient and model-free algorithm for general Low-Rank MDPs. Our algorithm, Span RL, uses the notion of a barycentric spanner [6] for the embedding ϕ h as an efficiently computable basis for exploration, and combines this with a minimax representation learning approach [35, 49]. Span RL interleaves exploration with representation learning in a layer-wise fashion, learning a new representation at each layer h using exploratory data gathered at previous layers, then uses this representation to facilitate computation of a collection of exploratory policies (a policy cover), which act as an approximate barycentric spanner for the features at layer h + 1, ensuring good coverage for subsequent iterations. Span RL is simple and modular, and its analysis is surprisingly compact given the greater generality compared to prior work [49, 35, 31].

Span RL can accommodate general-purpose function approximation to learn the representation ϕ (e.g., neural nets or other flexible classes) whenever a certain minimax representation learning objective [35, 49] can be solved efficiently for the function class of interest. Compared to efficient algorithms from prior work, Span RL: (1) is model-free (i.e., only requires access to a function class Φ capable of modeling ϕ , and does not need to model µ h+1), and (2) applies to general Low-Rank MDPs, replacing strong additional assumptions such as non-negativity of the feature embeddings (so-called latent variable structure) or block structure (see Table 1) with a reachability assumption that we show is substantially weaker than that assumed in prior work (see Appendix H). As a secondary benefit, the algorithm is reward-free. Our analysis carefully combines several new techniques, including (1) an error-tolerant variant of the classical barycentric spanner computation algorithm of Awerbuch

Table 1: Comparison of sample complexity required learn an ε-optimal policy. Φ denotes the feature class, and Υ denotes an additional feature class capturing model-based function approximation. For approaches that require non-negative (latent variable) structure, d LV [resp. γ] denotes the latent variable dimension [resp. the reachability parameter in the latent representation], and for BMDPs, S denotes the size of the latent state space. For Span RL, η denotes the reachability parameter.

Comp. efficient Model-free General low rank Sample comp. OLIVE [21] (see also [24, 14, 9, 43]) d3AH5 log Φ

FLAMBE [1] 1 d7A9H22 log( Φ Υ )

ε10 Rep-UCB [40] (see also [10]) d4A2H5 log( Φ Υ )

MOFFLE [35]2 Non-negative/ latent variable

d19 LV A32H19 log Φ

BRIEE [49] Block MDP S 8A14H9 log Φ

Span RL (this paper) A4d9H4(d+log Φ )

and Kleinberg [6], and (2) a new analysis of a minimax representation learning objective introduced in Modi et al. [35], Zhang et al. [49], which shows for the first time that this objective can lead to meaningful guarantees in general Low-Rank MDPs without latent variable structure; this increased generality is meaningful, as we show in Appendix H that there is an exponential separation between our guarantees and those that require such a structure.

Organization. Section 2 formally introduces the Low-Rank MDP model and the online reinforcement learning framework we consider. In Section 3, we highlight challenges faced by previous approaches, introduce our main algorithm, Span RL, and show how it overcomes these challenges, and then present its main sample complexity guarantee.

Comparison to Ar Xiv Version. After the initial submission of this work, we developed a substantially improved of the algorithm that removes the reachability assumption at the cost of a larger (but still polynomial) sample complexity guarantee. We have included this algorithm and its analysis in the Ar Xiv version of this paper [30].

2 Problem Setting

2.1 Low-Rank MDP Model We work in an episodic, finite-horizon reinforcement learning framework, where H N denotes the horizon. A Low-Rank MDP [36, 45, 1] is a tuple M = (X,A,(ϕ h)h [H],(µ h)h [H],ρ) consisting of a state space X, action space A with A = A, distribution over initial states ρ (X), and mappings µ h+1 X Rd and ϕ h X A Rd.3 Beginning with x1 ρ, an episode proceeds in H steps, where for each step h [H], the state xh evolves as a function of the agent s action ah via

xh+1 Th( xh,ah),

where Th is a probability transition kernel, which is assumed to factorize based on ϕ h and µ h. In detail, we assume that there exists a σ-finite measure ν on X such that for all 1 h H 1, and for all x X and a A, the function x µ h+1(x ) ϕ h(x,a) is a probability density with respect to ν (i.e. the function is everywhere non-negative and integrates to 1 under ν). For any X X, the probability that xh+1 X under xh+1 Th( xh,ah) is then assumed to follow the law

Th(X xh,ah) = X µ h+1(x) ϕ h(xh,ah)dν(x). (2)

For notational compactness, we assume (following, e.g., Jiang et al. [21]) that the MDP M is layered so that X = X1 XH for Xi Xj = for all i j, where Xh X is the subset of states in X

1For the stated sample complexity, FLAMBE requires access to a sampling oracle for the learner model. Without this oracle, the results require additional latent variable structure and a reachability assumption. 2We compare to the variant of MOFFLE that uses the same representation learning objective we consider. Other variants have improved sample complexity, but make use of stronger oracles. 3We emphasize that neither µ h nor ϕ h is known to the agent, in contrast to the linear MDP setting [44, 23].

that are reachable at layer h [H]. This can be seen to hold without loss of generality (modulo dependence on H), by augmenting the state space to include the layer index.

Remark 2.1 (Comparison to previous formulations). Our formulation, in which the transition dynamics (2) are stated with respect to a base measure ν, are a rigorous generalization of Low Rank MDP formulations found in previous works [23, 1], which tend to implicitly assume the state space is countable and avoid rigorously defining integrals. We adopt this more general formulation to emphasize the applicability our results to continuous domains. However, in the special case where state space is countable, choosing ν as the counting measure yields Th(X xh,ah) = x X µ h+1(x) ϕ h(xh,ah), which is consistent with prior work.

Policies and occupancy measures. We define ΠM = {π X (A)} as the set of all randomized, Markovian policies. For a policy π ΠM, we let Pπ denote the law of (x1,a1),...,(x H,a H) under ah π(xh), and let Eπ denote the corresponding expectation. For any X Xh, we let Pπ h[X ] = Pπ[xh X ] denote the marginal law of xh under π. For x Xh, we define the occupancy measure dπ(x) = d Pπ h dν (x) as the density of Pπ h with respect to ν.

2.2 Online Reinforcement Learning and Reward-Free Exploration We consider a standard online reinforcement learning framework where the Low-Rank MDP M is unknown, and the learning agent interacts with it in episodes, where at each episode the agent executes a policy of the form π X (A) and observes the resulting trajectory (x1,a1),...,(x H,a H). While the ultimate goal of reinforcement learning is to optimize a policy with respect to a possibly unknown reward function, here we focus on the problem of reward-free exploration, which entails learning a collection of policies that almost optimally covers the state space, and can be used to efficiently optimize any downstream reward function [12, 33, 15, 31]. To wit, we aim to construct an policy cover, a collection of policies that can reach any state with near-optimal probability.

Definition 2.1 (Policy cover). For α (0,1], a subset Ψ ΠM is an α-policy cover for layer h if

x Xh, max π Ψ dπ(x) α max π ΠM dπ (x). (3)

We show (Appendix G) that given access to such a policy cover with constant α, it is possible to optimize any downstream reward function with polynomial sample complexity.

Assumptions. To facilitate learning a policy cover, we make the following reachability assumption.

Assumption 2.1 (η-reachability). For any h [H] and x Xh, maxπ ΠM dπ(x) η µ h(x) .

Reachability is necessary if one aims to build a policy cover that satisfies (3) uniformly for all states; without such a condition, one gives up on covering hard-to-reach states. Some notion of reachability is required in essentially all prior work on efficient model-free algorithms for Low-Rank MDPs [32, 35, 5], and was only very recently removed in the (more restrictive) BMDP setting [31, 49].

Remark 2.2 (Comparison to other reachability-like assumptions). Assumption 2.1 generalizes and subsumes all previous reachability-like conditions of which we are aware [33, 46, 1, 35]. Notably, reachability is implied by the notion of feature coverage [5] (used in the context of transfer learning in Low-Rank MDPs), which asserts that supπ ΠM λmin(Eπ[ϕ h(xh,ah)ϕ h(xh,ah) ]) η, for some η > 0. It is also implied by explorability [46], which is similar to feature coverage, but involves the first moments of ϕ h. Our reachability assumption is also weaker than that used in [1, 35] under the latent variable model, and generalizes that made for BMDPs [33]. See Appendix H for details, as well as an exponential separation between our assumptions and analogous assumptions in [1, 35].

Beyond reachability, we assume (following [1, 35]) for normalization that, for all h [H] and (x,a) Xh A, ϕ h(x,a) 1, and that for all g Xh [0,1],

Xh µ h(x)g(x)dν(x)

Function approximation and desiderata. We do not assume that the true features (ϕ h)h [H] or the mappings (µ h)h [H] are known to the learner. To provide sample-efficient learning guarantees we make use of function approximation as in prior work [3, 35], and assume access to a feature class Φ {ϕ X A Rd} that contains ϕ h, for h [H 1].

Assumption 2.2 (Realizability). The feature class Φ {ϕ X A Rd} has ϕ h Φ for all h [H]. Moreover, for all ϕ Φ, x X, and a A, it holds that ϕ(x,a) 1.

The class Φ may consist of linear functions, neural networks, or other standard models depending on the application, and reflects the learner s prior knowledge of the underlying MDP. We assume that Φ is finite to simplify presentation, but extension to infinite classes is straightforward, as our results only invoke finiteness through standard uniform convergence arguments. Note that unlike model-based approaches [1, 40, 10, 2], we do not assume access to a class capable of realizing the features µ h, and our algorithm does not attempt to learn these features; this is why we distinguish our results as model-free.

For constant α, our goal is to learn an α-policy cover using poly(d,A,H,log Φ ,η 1) episodes of interaction. This guarantee scales with the dimension d of the feature map and the complexity log Φ of the feature class but, critically, does not depend on the size of the state space X; by [10], dependence on H and A = A is necessary when ϕ is unknown. Given such a guarantee, we show in Appendix G how to optimize any downstream reward function to error ε with polynomial sample complexity.

Additional preliminaries. For any m,n N, we denote by [m..n] the integer interval {m,...,n}. We also let [n] = [1..n]. For any sequence of objects o1,o2,..., we define om n = (oi)i [m..n]. A partial policy is a policy defined over a contiguous subset of layers [ℓ..r] [H]. We denote by Πℓ r M = {π r h=ℓXh (A)} the set of all partial policies over layers ℓto r; note that ΠM Π1 H M . For a policy π Πℓ r M and h [ℓ..r], π(xh) denotes the action distribution for the policy at layer h when xh Xh is the current state. For 1 t h H and any pair of partial policies π Π1 t 1 M ,π Πt h M , we define π t π Π1 h M as the partial policy given by (π t π )(xℓ) = π(xℓ) for all ℓ< t and (π t π )(xℓ) = π (xℓ) for all ℓ [t..h].

We use the xh π as shorthand to indicate that xh is drawn from the law Pπ, and likewise for (xh,ah) π and so on. For a set of partial policies Ψ = {π(i) i [N]}, we define unif(Ψ) as the random partial policy obtained by sampling i unif([N]) and playing π(i). We define πunif ΠM as the random policy that selects actions in A uniformly at random at each layer. We use to denote the Euclidean norm, to denote the supremum norm on functions, and let B(r) Rd denote the Euclidean ball of radius r. We refer to a scalar c > 0 as an absolute constant to indicate that it is independent of all problem parameters and use O( ) to denote a bound up to factors polylogarithmic in parameters appearing in the expression.

3 Span RL: Algorithm and Main Results

In this section, we present the Span RL algorithm. We begin by describing challenges in deriving efficient, model-free algorithms using existing approaches (Section 3.1). We then formally describe Span RL (Section 3.2) and build intuition as to how it is able to overcome these challenges, and finally state our main sample complexity guarantee (Section 3.3).

3.1 Challenges and Related Work

Designing algorithms with provable guarantees in the Low-Rank MDP setting is challenging because of the complicated interplay between representation learning and exploration. Indeed, while there are many efficient algorithms for the so-called linear MDP setting where the feature maps (ϕ h)h [H] are known (removing the need for representation learning) [23, 47, 4, 41], these approaches do not readily generalize to accommodate unknown features. For Low-Rank MDPs, previous algorithms suffer from at least one of the following three drawbacks: (1) the algorithms are computationally inefficient; (2) the algorithms are model-based; or (3) the algorithms place strong assumptions on the MDP that are unlikely to hold in practice. To motivate the Span RL algorithm, we briefly survey these results, highlighting several key challenges in avoiding these pitfalls.

Let us first discuss the issue of computational efficiency. While there are a number of algorithms all based on the principle of optimism in the face of uncertainty that provide tight sample complexity guarantees for Low-Rank MDPs in reward-based [21, 24, 14] and reward-free [9, 43] settings, these algorithms involve intractable optimization problems, and cannot be implemented efficiently even when the learner has access to an optimization oracle for the representation class Φ [11]. This intractability arises because these algorithms implement optimism via a global approach, in which the algorithm explores at each round by choosing the most optimistic value function in a certain version space of candidate value functions; optimizing over this version space is challenging,

as it involves satisfying non-convex constraints with a complicated dependence on the learned representation, and because the constraints are coupled globally across layers h [H].

To avoid the intractability of global optimism, several works have restricted attention to a simpler model-based setting. Here, in addition to assuming that the feature maps (ϕ h)h [H] are realizable with respect to Φ, one assumes access to a second feature class Υ capable of modeling the mappings (µ h)h [H]; this facilitates direct estimation of the transition probability kernel Th( x,a). For the model-based setting, it is possible to efficiently implement certain local forms of optimism [40, 10, 48], as well as certain non-optimistic exploration techniques based on policy covers [1]. Unfortunately, model-based realizability is a restrictive assumption, and falls short of the model-free guarantees we aim for in this work; indeed, in general, one cannot hope to estimate the feature map µ h+1 without sample complexity scaling with the number of states.4

When one moves from model-based learning to model-free learning, representation learning becomes substantially more challenging both for optimistic and non-optimistic approaches. Here, a key challenge is to develop representation learning procedures that are (1) efficient, yet (2) provide meaningful guarantees when the learned features are used downstream for exploration. To our knowledge, the only proposal for a representation learning procedure satisfying both desiderata comes from the work of Modi et al. [35], who introduced a promising minimax representation learning objective (described in detail in the sequel; cf. Algorithm 5), which Zhang et al. [49] subsequently showed to have encouraging empirical performance. However, to provide guarantees for this objective, both works place substantial additional restrictions on the low-rank factorization. In particular, Modi et al. [35] make the so-called latent variable assumption [1], which asserts that ϕ h and µ h are non-negative coordinate-wise, and Zhang et al. [49] further restrict to the Block MDP model [12, 32]. Non-negativity is a substantial restriction, as the best non-negative factorization can have exponentially large dimension relative to the best unrestricted factorization [1], even when reachability is assumed (cf. Appendix H.1). The source of this restriction is the problem of how to quantify how close a learned representation ˆϕ is to the ground truth ϕ , which depends strongly on the downstream exploration strategy. In what follows, we show that with the right exploration strategy, this challenge can be ameliorated, but prior to our work it was unclear whether the minimax objective could lead to meaningful guarantees in the absence of non-negativity.

3.2 The Span RL Algorithm Our algorithm, Span RL, is presented in Algorithm 1. The algorithm proceeds by building a policy cover layer-by-layer in an inductive fashion. For each layer h 2, Span RL uses a policy cover Ψ(h) built at a previous iteration within a subroutine, Rep Learn (Algorithm 5; deferred to Appendix B) to produce a feature map ˆϕ(h) that approximates ϕ h. Using this feature map, the algorithm invokes a second subroutine, Robust Spanner (Algorithm 2 in Appendix B) to produce a collection of policies π1,...,πd that act as a barycentric spanner for the feature map, ensuring maximal coverage in a certain sense; given these policies, a new policy cover for layer h + 2 is formed via Ψ(h+2) = {πi h+1 πunif i [d]}. To invoke the Robust Spanner subroutine, Span RL makes use of additional subroutines for policy optimization (PSDP; Algorithm 3 in Appendix B) and estimation of certain vector-valued functionals (Est Vec; Algorithm 7 in Appendix B). We now describe each component of the algorithm in detail, highlighting how they allow us to overcome the challenges in the prequel.

Barycentric spanners. At the heart of Span RL is the notion of a barycentric spanner [6] as an efficient basis for exploration. We begin by defining a barycentric spanner for an abstract set.

Definition 3.1 (Awerbuch and Kleinberg [6]). Given a set W Rd such that span(W) = Rd, we say that a set {w1,...,wd} W is a (C,ε)-approximate barycentric spanner for W if for every w W, there exist β1,...,βd [ C,C] such that w d i=1 βiwi ε.5

The utility of barycentric spanners for reward-free exploration is highlighted in the following lemma.

Lemma 3.1. Suppose that Assumption 2.1 holds. If Ψ ΠM is a collection of policies such that {Eπ [ϕ h(xh,ah)] π Ψ} Rd is a (C,ε)-approximate barycentric spanner for Wh = {Eπ [ϕ h(xh,ah)] π ΠM} with ε η

2, then Ψ is an α-policy cover for layer h+1 with α = (2d C) 1.

4For example, in the special case of the Block MDP setting [12, 32], model-based realizability entails modeling a certain emission process, which is not required by model-free approaches. 5Note that our definition is a slight generalization of [6, Definition 2.1]; the latter is recovered with ε = 0.

Lemma 3.1, proven in Appendix F.6, shows that to compute a policy cover for layer h+1, it suffices to find a barycentric spanner for the set Wh = {Eπ [ϕ h(xh,ah)] π ΠM} Rd. Of course, even if ϕ h is known, this observation is only useful if we can compute a spanner without explicitly enumerating over the set ΠM, since our goal is to develop an efficient algorithm. In what follows, we will show:6

1. Using, Robust Spanner, a novel adaptation of the classical algorithm of Awerbuch and Kleinberg [6], it holds that for any ϕ Φ, spanner computation for the set {Eπ[ϕ(xh,ah)] π ΠM} can be performed efficiently whenever, for any θ B(1), one can (approximately) solve linear optimization problems of the form

arg max π ΠM Eπ[θ ϕ(xh,ah)]. (5)

2. Given access to policy covers Ψ(1 h) for layers 1 to h, one can efficiently solve the optimization problem in (5) by appealing to the PSDP algorithm for policy optimization (Algorithm 3).

To handle the fact that ϕ h is unknown, Algorithm 1 computes policies π1 d that induce a barycentric spanner for the set {Eπ[ˆϕ(h)(xh,ah)] π ΠM}, where ˆϕ(h) Φ is a learned feature map. In what follows, we first give a detailed explanation of the two above points, before showing how to complete the argument by learning a feature map through representation learning.

Algorithm 1 Span RL: Volumetric Exploration and Representation Learning via Barycentric Spanner Require: Feature class Φ and parameters ε,c > 0 and δ (0,1).

1: Set Ψ(1) = , Ψ(2) = {πunif}.

2: Set n Rep Learn = c ε 2A2dlog( Φ /δ) and n Est Vec = c ε 2log(1/δ).

3: Set n PSDP = c ε 2A2d3H2 (d + log( Φ /δ)).

4: Define F = {f x maxa A θ ϕ(x,a) θ B(1),ϕ Φ}.

5: Define G = {g (x,a) ϕ(x,a) w ϕ Φ,w B(2

6: for h = 1,...,H 2 do

/* Learn feature representation for layer h. */

7: Set ϕ(h) = Rep Learn(h,F,Φ,P (h),n Rep Learn), with P (h) = unif(Ψ(h)). // Algorithm 6.

/* Computing an approximate spanner using learned features. */

8: For θ Rd and (x,a) X A, define

rt(x,a;θ) = { ϕ(h)(x,a) θ, for t = h, 0, otherwise.

9: For each t [h], set Gt = G and P (t) = unif(Ψ(t)).

10: For θ Rd, define Lin Opt(θ) = PSDP(h,r1 h( , ;θ),G1 h,P (1 h),n PSDP). // Algorithm 3.

11: For π ΠM, define Lin Est(π) = Est Vec(h,ϕ(h),π,n Est Vec). // Algorithm 7.

12: Set (π1,...,πd) = Robust Spanner(Lin Opt( ),Lin Est( ),2,ε). // Algorithm 2.

13: Set Ψ(h+2) = {πi h+1 πunif i [d]}.

14: Return: Policy cover Ψ(1 H).

Barycentric spanner computation via approximate linear optimization. To describe spanner computation in Span RL, we take a brief detour and consider an abstract approach to barycentric spanner computation, which generalizes our problem. Suppose that we wish to compute a spanner for an implicitly specified set W = {wz}z Z Rd indexed by an abstract set Z. The set Z (which will be set to ΠM when we return to RL) may be exponentially large and cannot be efficiently enumerated. In addition, given z Z, we cannot explicitly compute wz, and have to settle for a noisy approximation.

To allow for efficient spanner computation, we assume access to two oracles for the set W, a linear optimization oracle Lin Opt B(1) Z and an index-to-vector oracle Lin Est Z Rd. We assume that for some ε > 0:

6While barycentric spanners have been used in a number of recent works on sample-efficient RL [19, 20], the motivation for their use within our algorithm and analysis are quite different; see Appendix A.

1. For all θ Rd with θ = 1, the output ˆzθ = Lin Opt(θ) satisfies θ wˆzθ supz Z θ wz ε.

2. For all z Z, the output ˆwz = Lin Est(z) satisfies ˆwz wz ε.

The Robust Spanner algorithm (Algorithm 2) computes a (C,ε)-approximate spanner for W using O(dlog(d/ε)) total calls to Lin Opt and Lin Est. Robust Spanner is an error-tolerant variant of the classical spanner computation algorithm of Awerbuch and Kleinberg [6], which was originally introduced and analyzed for spanner computation with an exact linear optimization oracle. Tolerance to approximation errors in the linear optimization oracle is critical for our application to RL, where additive errors will arise in sampling trajectories, as well as estimating the feature maps (ϕ h)h [H]. Robust Spanner achieves error tolerance by perturbing the vectors returned by Lin Opt(θ) in the direction of θ, which amounts to running the classical algorithm on an ε-fattening of W, and is necessary in order to ensure that the approximation error of Lin Opt does not swamp the signal in directions θ in which W is too skinny. This technique may be of independent interest; see Appendix C for additional details and formal guarantees.

Representation learning. Ideally, we would like to use Robust Spanner to construct a barycentric spanner for the set {Eπ[ϕ h(xh,ah)] π ΠM} with Z = ΠM. Because we do not have access to ϕ h, we instead apply Robust Spanner with W = {Eπ[ˆϕ(h)(xh,ah)] π ΠM}, where ˆϕ(h) is a learned representation. We now describe how the feature map ˆϕ(h) is learned, then show how to use these learned features to efficiently implement the oracles Lin Opt( ) and Lin Est( ).

To learn a representation for layer h, we use the Rep Learn algorithm (Algorithm 5), which was originally introduced in Modi et al. [35], Zhang et al. [49]. The algorithm gathers a collection of triples (xh,ah,xh+1) by rolling in to xh with a policy sampled uniformly from the policy cover Ψ(h) and selecting ah uniformly at random. Using this dataset, the algorithm solves a sequence of adversarial training sub-problems (Line 9 of Algorithm 5) which involve the feature class Φ and an auxiliary discriminator class F X R. As we discuss in detail in the sequel, these sub-problems, described in (7), are amenable to standard gradient-based training methods. The sub-problems are designed to approximate the following idealized max-min-max representation learning objective:

ˆϕ(h) arg min ϕ Φ sup f F inf w Eunif(Ψ(h)) hπunif [(ϕ(xh,ah) w E[f(xh+1) xh,ah]) 2]. (6)

The intuition for this objective comes from the fact that in a Low-Rank MDP, for any function f X R, the quantity E[f(xh+1) xh = x,ah = a] is linear in ϕ h(x,a). Thus, if F is sufficiently expressive, we may hope that ˆϕ(h) and ϕ are close. We adopt the simple discriminator class F = {x maxa A θ ϕ(x,a) θ B(1), ϕ Φ}. We show that solving (6) with this choice for F, which is simpler than that in Modi et al. [35], Zhang et al. [49], yields an approximation guarantee for ˆϕ(h) that is suitable for downstream use in spanner computation for general Low-Rank MDPs.

Remark 3.1 (Improved analysis of Rep Learn). To facilitate an analysis of Span RL that does not require reachability assumptions, we use slightly different parameter values for Rep Learn than in Modi et al. [35], Zhang et al. [49], and provide a tighter sample complexity bound (Theorem E.1) which may be of independent interest.

In more detail, prior work shows that the Rep Learn algorithm solves a variant of (6) with w B(d1/2 poly(ε 1)), where ε > 0 is the desired bound on mean-squared error. Due to the polynomial dependence on ε 1, such a guarantee would lead to vacuous guarantees when invoked within our analysis of Span RL. Our improved analysis of Rep Learn, which is based on a determinantal potential argument, shows that w B(poly(d)) suffices. A secondary benefit of our improved bound is a faster rate with respect to the number of trajectories.

Putting everything together. Having learned ˆϕ(h) using Rep Learn, in Span RL we apply Robust Spanner with W = {Eπ[ˆϕ(h)(xh,ah)] π ΠM}, Z = ΠM, and C = 2; that is, we plug-in the learned representation ˆϕ(h) for the true representation ϕ h.7 With this choice, implementing Lin Opt entails (approximately) solving arg maxπ ΠM Eπ[θ ˆϕ(h)(xh,ah)] for a given θ B(1), and implementing the Lin Est oracle entails estimating Eπ[ˆϕ(h)(xh,ah)] for a given π ΠM. We instantiate

7Though the policies produced by the algorithm may not necessarily induce a spanner for Wh = {Eπ[ϕ h(xh, ah)] π ΠM} (this would require point-wise representation learning guarantees, which we do not have), our analysis shows that they still suffice to build a policy cover for layer h + 2.

Lin Est(π) as the Monte Carlo algorithm Est Vec (Algorithm 7), which simply samples trajectories according to π and returns the sample average of ˆϕ(h)(xh,ah). To implement Lin Opt(θ), we appeal to PSDP (Algorithm 3). PSDP, given an arbitrary reward function r1 h X A R and a function class G {g X A R} capable of realizing all possible value functions induced by these rewards, can use the policy covers Ψ(1 h) to efficiently compute a policy ˆπ = PSDP(h,r1 h,G,Ψ(1 h),n) that approximately solves arg maxπ ΠM Eπ[ h t=1 rt(xt,at)], and does so using polynomially many episodes; see Appendix D for details and formal guarantees.8 Thus, implementing Lin Opt(θ) is as simple as invoking PSDP with the rewards

rt(x,a;θ) = { ˆϕ(h)(x,a) θ, for t = h, 0, otherwise.

With this, we have all the ingredients needed for spanner computation, and the algorithm is complete.

3.3 Main Guarantee for Span RL The following result is the main sample complexity guarantee for Span RL (Algorithm 1).

Theorem 3.2 (Main theorem for Span RL). Let δ (0,1) be given, and suppose that realizability holds (Assumption 2.2) and that reachability (Assumption 2.1) is satisfied with parameter η > 0. If ε = η 36d5/2 and c = polylog(A,H,d,log( Φ /δ)) is sufficiently large, then the policies Ψ(1 H)

produced by Span RL(Φ,ε,c,δ) are a ( 1 4Ad,0)-policy cover with probability at least 1 δ. The total number of episodes used by Span RL is at most:

O (A4d9H4(d + log( Φ /δ)) 1/η2).

Theorem 3.2 is the first provable, model-free sample complexity guarantee for general Low-Rank MDPs that is attained by an efficient algorithm. Prior to our work, all efficient model-free algorithms required non-negative features (latent variable structure) or stronger assumptions [35, 49], even in the presence of similar reachability assumptions; see Table 1.

Remark 3.2 (On the reachability assumption). While the reachability assumption is shared by the best prior efficient algorithms [35], which require non-negativity in addition to this assumption, it is natural to ask to what extent reachability restricts the generality of the Low-Rank MDP model. In Appendix H, we show that even when reachability holds, the embedding dimension in our model can be exponentially smaller than the best embedding dimension for the best non-negative (latent variable) embedding [35]. Hence, our results are meaningfully more general than prior work.

While our guarantee is polynomial in all relevant problem parameters, improving the dependence further (e.g., to match that of the best known inefficient algorithms) is an interesting direction for future research, as is removing the reachability assumption.

Application to reward-based RL. By using the policy cover produced by Span RL within PSDP (Algorithm 3), we can optimize any downstream reward function to error ε using poly(d,A,H,log Φ ) 1/ε2 episodes. See Appendix G for details.

Efficiency and practicality. We observe that Span RL is simple and practical. Defining LD(ϕ,w,f) = (x,a,x ) D(ϕ(x,a) w f(x ))2 + λ w 2, where D is a dataset consisting of (xh,ah,rh,xh+1) tuples, the algorithm is provably efficient whenever the adversarial objective

f (t) arg max f F max ϕ Φ {min w LD(ϕ(t),w,f) min w LD( ϕ, w,f)}, (7)

in Line 9 of Rep Learn (Algorithm 5), can be implemented efficiently (note that by the definition of LD, the inner minima over w and w in (7) can be solved in closed form). This objective was also assumed to be efficiently solvable in Modi et al. [35], Zhang et al. [49] and was empirically shown to be practical in [49]; note that the objective is amenable to standard gradient-based optimization techniques, and that F can be over-parameterized. While a detailed experimental evaluation is outside of the scope of this paper, we are optimistic about the empirical performance of the algorithm in light of the encouraging results based on the same objective in Zhang et al. [49]

8This is the main place where the analysis uses the inductive hypothesis that Ψ(1 h) are policy covers.

Outside of representation learning, the only overhead in Span RL is the Robust Spanner subroutine, which has polynomial runtime. Indeed, Robust Spanner requires only polynomially many calls to the linear optimization oracle, instantiated as PSDP, which is efficient whenever standard least-squares regression problems based on the class Φ can be solved efficiently, analogous to [33, 31].

Analysis and proof techniques. The proof of Theorem 3.2, which is given in Appendix F, is appealing in its simplicity and modularity. The crux of the proof is to show that the representation learning guarantee in (6) is strong enough to ensure that the downstream spanner computation in Robust Spanner succeeds. It is straightforward to show that spanner computation would succeed if we had access to an estimated representation that ˆϕ(h) that approximates ϕ h point-wise (i.e., uniformly for all (x,a) pairs), but the key challenge is that the guarantee in (6) only holds on average under the roll-in distribution unif(Ψ(h)). Prior works that make use of the same representation learning objective (BRIEE [49] and MOFFLE [35]) do not make use of spanners; instead, they appeal to exploration strategies based on elliptic bonuses, addressing the issue of approximation errors through additional assumptions (non-negativity of the factorization for MOFFLE, and Block MDP structure for BRIEE). Perhaps the most important observation in our proof is that barycentric spanners are robust to the average-case approximation error guarantee in (6) as-is, without additional structural assumptions. Intuitively, this benefit seems to arise from the fact that the spanner property only concerns the first moment of the feature map ϕ , while algorithms based on elliptic bonuses require approximation guarantees for the second moment; understanding this issue more deeply is an interesting question for future work. Another useful feature of our proof is to show that the notion of reachability in Assumption 2.1, which generalizes and extends all previous reachability conditions in the Low-Rank MDP and Block MDP literature [46, 5, 13, 33, 1, 35, 31], is sufficient to build a policy cover. We anticipate that this observation will find broader use.

Acknowledgments and Disclosure of Funding

We thank Noah Golowich, Dhruv Rohatgi, and Ayush Sekhari for several helpful discussions. ZM and AR acknowledge support from the ONR through awards N00014-20-1-2336 and N00014-20-1-2394, and ARO through award W911NF-21-1-0328. AB acknowledges support from the National Science Foundation Graduate Research Fellowship under Grant No. 1122374.

[1] A. Agarwal, S. Kakade, A. Krishnamurthy, and W. Sun. FLAMBE: Structural complexity and representation learning of low rank MDPs. ar Xiv preprint ar Xiv:2006.10814, 2020.

[2] A. Agarwal, S. Kakade, and L. F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67 83. PMLR, 2020.

[3] A. Agarwal, S. M. Kakade, A. Krishnamurthy, and W. Sun. FLAMBE: structural complexity and representation learning of low rank mdps. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[4] A. Agarwal, Y. Jin, and T. Zhang. Vo q l: Towards optimal regret in model-free rl with nonlinear function approximation. ar Xiv preprint ar Xiv:2212.06069, 2022.

[5] A. Agarwal, Y. Song, W. Sun, K. Wang, M. Wang, and X. Zhang. Provable benefits of representational transfer in reinforcement learning. Co RR, abs/2205.14571, 2022.

[6] B. Awerbuch and R. Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97 114, 2008.

[7] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263 272, 2017.

[8] J. Bagnell, S. M. Kakade, J. Schneider, and A. Ng. Policy search by dynamic programming. Advances in neural information processing systems, 16, 2003.

[9] F. Chen, Y. Bai, and S. Mei. Partially observable rl with b-stability: Unified structural condition and sharp sample-efficient algorithms. ar Xiv preprint ar Xiv:2209.14990, 2022.

[10] Y. Cheng, R. Huang, Y. Liang, and J. Yang. Improved sample complexity for reward-free reinforcement learning under low-rank mdps. In The Eleventh International Conference on Learning Representations, 2023.

[11] C. Dann, N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. On oracleefficient PAC RL with rich observations. In Advances in neural information processing systems, pages 1422 1432, 2018.

[12] S. S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Provably efficient RL with rich observations via latent state decoding. ar Xiv preprint ar Xiv:1901.09018, 2019.

[13] S. S. Du, Y. Luo, R. Wang, and H. Zhang. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, pages 8060 8070, 2019.

[14] S. S. Du, S. M. Kakade, J. D. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bilinear classes: A structural framework for provable generalization in RL. ar Xiv preprint ar Xiv:2103.10897, 2021.

[15] Y. Efroni, D. Misra, A. Krishnamurthy, A. Agarwal, and J. Langford. Provably filtering exogenous distractors using multistep inverse dynamics. In International Conference on Learning Representations, 2021.

[16] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.

[17] D. J. Foster, S. M. Kakade, J. Qian, and A. Rakhlin. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021.

[18] D. J. Foster, N. Golowich, and Y. Han. Tight guarantees for interactive decision making with the decision-estimation coefficient. Conference on Learning Theory (COLT), 2023.

[19] N. Golowich, A. Moitra, and D. Rohatgi. Learning in observable pomdps, without computationally intractable oracles. In Advances in Neural Information Processing Systems, 2022.

[20] A. Huang, J. Chen, and N. Jiang. Reinforcement learning in low-rank mdps with density features. International Conference on Machine Learning (ICML), 2023.

[21] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pages 1704 1713, 2017.

[22] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. ar Xiv preprint ar Xiv:1902.03736, 2019.

[23] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143, 2020.

[24] C. Jin, Q. Liu, and S. Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. ar Xiv preprint ar Xiv:2102.00815, 2021.

[25] S. M. Kakade. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.

[26] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238 1274, 2013.

[27] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao. Deep reinforcement learning for dialogue generation. In EMNLP, 2016.

[28] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

[29] Z. Mhammedi, D. J. Foster, M. Simchowitz, D. Misra, W. Sun, A. Krishnamurthy, A. Rakhlin, and J. Langford. Learning the linear quadratic regulator from nonlinear observations. Advances in Neural Information Processing Systems, 33:14532 14543, 2020.

[30] Z. Mhammedi, A. Block, D. J. Foster, and A. Rakhlin. Efficient model-free exploration in low-rank mdps. ar Xiv preprint ar Xiv:2307.03997, 2023.

[31] Z. Mhammedi, D. J. Foster, and A. Rakhlin. Representation learning with multi-step inverse kinematics: An efficient and optimal approach to rich-observation rl. International Conference on Machine Learning (ICML), 2023.

[32] D. Misra, M. Henaff, A. Krishnamurthy, and J. Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. ar Xiv preprint ar Xiv:1911.05815, 2019.

[33] D. Misra, M. Henaff, A. Krishnamurthy, and J. Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961 6971. PMLR, 2020.

[34] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[35] A. Modi, J. Chen, A. Krishnamurthy, N. Jiang, and A. Agarwal. Model-free representation learning and exploration in low-rank mdps. Co RR, abs/2102.07035, 2021.

[36] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 811 820. ACM, 2010.

[37] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256 2264, 2013.

[38] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

[39] W. Sun, N. Jiang, A. Krishnamurthy, A. Agarwal, and J. Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898 2933. PMLR, 2019.

[40] M. Uehara, X. Zhang, and W. Sun. Representation learning for online and offline RL in low-rank mdps. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. [41] R. Wang, S. S. Du, L. Yang, and R. R. Salakhutdinov. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33: 17816 17826, 2020. [42] R. Wang, R. Salakhutdinov, and L. F. Yang. Provably efficient reinforcement learning with general value function approximation. ar Xiv preprint ar Xiv:2005.10804, 2020. [43] T. Xie, D. J. Foster, Y. Bai, N. Jiang, and S. M. Kakade. The role of coverage in online reinforcement learning. ar Xiv preprint ar Xiv:2210.04157, 2022. [44] L. Yang and M. Wang. Sample-optimal parametric q-learning using linearly additive features. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6995 7004. PMLR, 2019.

[45] H. Yao, C. Szepesvári, B. Á. Pires, and X. Zhang. Pseudo-mdps and factored linear action models. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, ADPRL 2014, Orlando, FL, USA, December 9-12, 2014, pages 1 9. IEEE, 2014. [46] A. Zanette, A. Lazaric, M. J. Kochenderfer, and E. Brunskill. Provably efficient reward-agnostic navigation with linear value iteration. Advances in Neural Information Processing Systems, 33: 11756 11766, 2020. [47] T. Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. ar Xiv preprint ar Xiv:2110.00871, 2021. [48] T. Zhang, T. Ren, M. Yang, J. Gonzalez, D. Schuurmans, and B. Dai. Making linear mdps practical via contrastive representation learning. In International Conference on Machine Learning, pages 26447 26466. PMLR, 2022. [49] X. Zhang, Y. Song, M. Uehara, M. Wang, A. Agarwal, and W. Sun. Efficient reinforcement learning in block mdps: A model-free representation learning approach. In International Conference on Machine Learning, pages 26517 26547. PMLR, 2022.

Contents of Appendix

A Additional Related Work 14

B Omitted Algorithms 15

C Generic Guarantee for Robust Spanner 18

D Generic Guarantee for PSDP 20

E Guarantee for Rep Learn 23

F Analysis 26 F.1 Proof Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 F.2 Guarantee for PSDP as a Subroutine for Robust Spanner . . . . . . . . . . . . . . . . . 27 F.3 Guarantee for Robust Spanner as a Subroutine for Span RL . . . . . . . . . . . . . . . 28 F.4 Guarantee for Rep Learn as a Subroutine for Span RL . . . . . . . . . . . . . . . . . . . 29 F.5 Concluding the Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 F.6 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

G Application to Reward-Based RL 32

H Properties of Reachability Assumption 33 H.1 Comparison to Latent Variable Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 H.2 Relation to Other Reachability Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 35

A Additional Related Work

In this section, we discuss relevant related work not already covered.

Block MDPs. A particularly well-studied special case low-rank MDPs with the latent variable assumed in Modi et al. [35] (defined in Definition H.1) is the Block MDP (BMDP) model Du et al. [13], Misra et al. [32], Zhang et al. [49], Mhammedi et al. [31]. For this setting, Du et al. [13], Misra et al. [32] provide algorithms that conduct exploration in a provably oracle-efficient manner under a reachability assumption. This reachability assumption was removed by subsequent work of Zhang et al. [49] (with a suboptimal rate) and Mhammedi et al. [31] (with optimal error dependence), but the analysis in these works is tailored to the BMDP model.

Barycentric spanners. Huang et al. [20] consider a variant of the Low-Rank MDP framework in which we are given a class Υ that realizes the next-state feature map µ , but do not have access to a class Φ for the feature map ϕ , which is unknown. Their algorithm, like Span RL, is based on barycentric spanners, though the algorithm design considerations and analysis are significantly different. Notably, their algorithm is not computationally efficient, and their analysis takes advantage of the fact that realizability of µ facilitates estimation of the occupancies {dπ( )}π ΠM in ℓ1-error. Barycentric spanners were also in the work of Golowich et al. [19] for reinforcement learning in Partially Observable MDPs (POMDPs). Their analysis is substantially different from ours, and their algorithm appeals to the barycentric spanner computation approach in Awerbuch and Kleinberg [6] in an off-the-shelf fashion.

B Omitted Algorithms

Algorithm 2 Robust Spanner: Barycentric Spanner via Approximate Linear Optimization Require:

Approximate linear optimization subroutine Lin Opt Rd Z. /* See Section 3.2 */ Approximate index-to-vector subroutine Lin Est Z Rd. Parameters C,ε > 0.

1: Set W = (w1,...,wd) = (e1,...,ed).

2: for i = 1,...,d do

3: Set θi = (det(ej,W i))j [d] Rd. // W i is defined to be W without the ith column

4: Set z+ i = Lin Opt(θi/ θi ) and w+ i = Lin Est(z+ i ).

5: Set z i = Lin Opt( θi/ θi ) and w i = Lin Est(z i ).

6: if θ i w+ i θ i w i then

7: Set wi = w+ i , zi = z+ i , and wi = wi + εθi/ θi .

9: Set wi = w i , zi = z i , and wi = wi εθi/ θi .

10: for n = 1,2,... do

11: Set i = 1.

12: while i d do

13: Set θi = (det(ej,W i))j [d] Rd.

14: Set z+ i = Lin Opt(θi/ θi ) and w+ i = Lin Est(z+ i ).

15: Set z i = Lin Opt( θi/ θi ) and w i = Lin Est(z i ).

16: if θ i w+ i + ε θi C det(wi,W i) then

17: Set wi = w+ i , zi = z+ i , and wi = wi + ε θi/ θi .

19: else if θ i w i + ε θi C det(wi,W i) then

20: Set wi = w i , zi = z i , and wi = wi ε θi/ θi .

22: Set i = i + 1.

23: if i = d + 1 then

25: Return: (z1,...,zd).

Algorithm 3 PSDP(h,r1 h,G,Ψ(1 h),n): Policy Search by Dynamic Programming (variant of [8]) Require:

Target layer h [H]. Reward functions r1 h. Function class G. Policy covers Ψ(1),...,Ψ(h). Number of samples n N.

1: for t = h,...,1 do

3: for n times do

4: Sample (xt,at, h ℓ=t rℓ(xℓ,aℓ)) unif(Ψ(t)) t πunif t+1 ˆπ(t+1).

5: Update dataset: D(t) D(t) {(xt,at, h ℓ=t rℓ(xℓ,aℓ))}.

6: Solve regression: g(t) arg min g G (x,a,R) D(t) (g(x,a) R)2.

7: Define ˆπ(t) Πt h M via

ˆπ(t)(x) = { arg maxa A g(t)(x,a), x Xt, ˆπ(t+1)(x), x Xℓ, ℓ [t + 1..h]. (8)

8: Return: Near-optimal policy ˆπ(1) ΠM.

Algorithm 4 Est Vec(h,F,π,n): Estimate Eπ[F(xh,ah)] for policy π and function F X A Rd. Require:

Target layer h [H]. Vector-valued function F X A Rd. Policy π ΠM. Number of samples n N.

2: for n times do

3: Sample (xh,ah) π.

4: Update dataset: D D {(xh,ah)}.

5: Return: F = 1

n (x,a) D F(x,a).

Algorithm 5 Rep Learn(h,F,Φ,Ψ,n): Representation Learning for Low-Rank MDPs [35, 49]. Require:

Target layer h [H]. Discriminator class F. Feature class Φ. A set of policies Ψ. Number of samples n N.

1: Set ε = 100 d log n+log( Φ /δ)

n , ε1 = 16

2d3/2 ε1/2.

2: Initialize λ = n, T =

d 2 ε, and ξ = 3

2ε1 + ε + 2λd

n , ϕ(0) Φ.

4: for n times do

5: Sample (xh,ah,xh+1) unif(Ψ) h πunif.

6: Update dataset: D D {(xh,ah,xh+1)}.

7: Define LD(ϕ,w,f) = (x,a,x ) D(ϕ(x,a) w f(x ))2 + λ w 2.

8: for t = 0,...,T 1 do

/* Discriminator selection */

f (t) arg max f F (f), where (f) = max ϕ Φ {min w LD(ϕ(t),w,f) min w LD( ϕ, w,f)}.

10: if (f (t)) ξ then

11: Return ˆϕ = ϕ(t).

/* Feature selection via least-squares minimization */

ϕ(t+1) arg min ϕ Φ min w0 t

t i=0 LD(ϕ,wi,f (i)).

C Generic Guarantee for Robust Spanner

In this section, we give a generic guarantee for the Robust Spanner algorithm when invoked with oracles Lin Opt and Lin Est satisfying the following assumption.

Assumption C.1 (Lin Opt and Lin Est as approximate Linear Optimization Oracles). For some abstract set Z and a collection of vectors {wz Rd z Z} indexed by elements in Z, there exists ε > 0 such that for any θ Rd {0} and z Z, the outputs ˆzθ = Lin Opt(θ/ θ ) and ˆwz = Lin Est(z) satisfy

sup z Z θ wz θ wˆzθ + ε θ , and ˆwz wz ε .

Letting W = {wz z Z} and assuming that W B(1), the next theorem bounds the number of iterations of Robust Spanner(Lin Opt( ),Lin Est( ), , ) under Assumption C.1, and shows that the output is an approximate barycentric spanner for W (Definition 3.1). Our result extends those of Awerbuch and Kleinberg [6], in that it only requires an approximate linear optimization oracle, which is potentially of independent interest.

Proposition C.1. Fix C > 1 and ε (0,1) and suppose that {wz z Z} B(1). If Robust Spanner (Algorithm 2) is run with parameters C,ε > 0 and oracles Lin Opt, Lin Est satisfying Assumption C.1

with ε = ε/2, then it terminates after d + d

2 log C 100d

ε2 iterations, and requires at most twice that many calls to each of Lin Opt and Lin Est. Furthermore, the output z1 d has the property that for all z Z, there exist β1,...,βd [ C,C], such that

wz d i=1 βiwzi 3Cd ε

Proof of Proposition C.1. The proof will follows similar steps to those in Awerbuch and Kleinberg [6, Lemma 2.6], with modifications to account for the fact that linear optimization over the set W = {wz z Z} is only performed approximately.

Part I: Bounding the number of iterations. In Algorithm 2, there are two loops, both of which require two calls to Lin Opt and Lin Est per iteration. As the first loop has exactly d iterations, it suffices to bound the number of iterations in the second loop.

Let M (i) = (w1,...,wi,ei+1,...,ed) be the matrix whose columns are the vectors at end of the ith iteration of the first loop (Line 2) of Algorithm 2; note that columns i + 1 through d are unchanged at this point in the algorithm. For i [d], we define ℓi(w) = det(w,M (i) i ) and θi = (det(ej,M (i) i ))j [d] Rd, where we recall that for any matrix A, the matrix A i is defined as the result of removing the ith column from A. Note that ℓi is linear in w, and in particular

ℓi(w) = w θi.

Let W (0) = M (d) = (w1,...,wd), and let W (j) denote the resulting matrix after j iterations of the second loop (Line 10) of Algorithm 2. We will show that for any J 1,

det(W (J)) det(W (0)) (100d

By construction of the loop, we have det(W (j)) C det(W (j 1)) for each j [J], and thus det(W (J)) det(W (0)) CJ. Combining these two facts will establish the bound on the iteration complexity. We now prove (9).

Let ui = e i (M (i)) 1 (note that ui is a row vector) and let U denote the matrix whose ith row is ui. We observe that for all w Rd,

uiw = ℓi(w)

where we note that ℓi(wi) 0 by construction; indeed, the columns of M (i) are a basis for Rd

because det(M (i)) 0, and the equality holds on the columns, so the two linear functions must be equal. Now, since Assumption C.1 holds with ε = ε/2, we have

θ i w+ i sup z Z θ i wz ε

2 θi , and θ i w i inf z Z θ i wz + ε

2 θi , (10)

where w i = Lin Opt(z i ). We will now show that

2 θi . (11)

There are two cases. First, suppose that θ i w+ i θ i w i , corresponding to the conditional in Line 6 of Algorithm 2 being satisfied. Combining this with (10), we have

θ i w+ i (sup z Z θ i wz ε

2 θi ) ( θ i w i ),

(sup z Z θ i wz ε

2 θi ) (sup z Z θ i wz ε

2 θi ), (by (10))

= (sup z Z θ i wz) (sup z Z θ i wz) ε

2 θi . (12)

Because the conditional is satisfied, wi = w+ i + ε θi θi , and so by plugging this into (12), we have

ℓi(wi) = θ i wi ε

The case that θ i w+ i θ i w i is essentially identical, establishing (11). Now, recall that W = {wz z Z} and let W B ( 3ε

2 ) = {w + b w W and b B ( 3ε

2 )} denote the Minkowski sum with B ( 3ε

2 ). By Cauchy-Schwarz, it holds that for all w = w + b W B ( 3ε

ℓi(w ) = θ i w = θ i w + θ i b (1 + 3ε

where we used that W B(1) (by assumption). Thus, for any w W B ( 3ε

2 ), we have

uiw = ℓi(w )

ℓi(wi) 1 + 3ε

We now observe that by construction and the fact that Assumption C.1 holds with ε = ε/2, the kth column w k of W (J) belongs to W B ( 3ε

2 ), for any k [d]. Thus, the (i,k) entry uiw k of UW (J)

satisfies uiw k [ 1 3ε

2 ], and so the columns of UW (J) have Euclidean norm at most 10

d ε . Since the magnitude of the determinant of a matrix is upper bounded by the product of the Euclidean

norms of its columns, it holds that det(UW (J)) ( 100d

On the other hand, again by construction, we see that the columns w1,...,wd of W (0) satisfy uiwj = 0, for j < i, and uiwi = 1. Thus, UW (0) is an upper-triangular matrix with 1s on the diagonal, and hence has determinant 1. Because determinants are multiplicative, this implies that det(U) 0. We now compute:

det(W (J)) = det(UW (J))

det(U) = det(UW (J))

det(UW (0)) (100d

Thus, the upper bound on det(W (J)) holds and the claim is proven. Therefore, we have

2 log C ( 100d

Part II: Spanner property for the output. Having shown that the algorithm terminates, we now show that the result is an approximate barycentric spanner for W. Let W = (w1,...,wd) be the matrix at termination of the algorithm. By definition, if the second loop (Line 10) has terminated, then for all i [d],

max(θ i w+ i , θ i w i ) + ε θi C det(wi,W i) ,

where θi = (det(ej,W i))j [d] Rd. On the other hand, by Assumption C.1, (10) holds, and so

z Z, i [d], det(wz,W i) = θ i wz max(θ i w+ i , θ i w i ) + ε θi , C det(wi,W i) . (13)

Now, fix z Z. Since det(W) 0, there exist β1 d R such that wz = d i=1 βiwi. By plugging this into (13) and using the linearity of the determinant, we have

i [d], C det(wi,W i) det(wz,W i) = RRRRRRRRRRR

d j=1 βi det(wj,W i) RRRRRRRRRRR = βi det(wi,W i) .

Therefore, βi C, for all i [d]. Now, by definition of w1 d and w1 d, for all i [d], we have that wi wi ε. Furthermore, by Assumption C.1, we also have that wi wzi ε/2. Therefore, by the triangle inequality, we have

wz d i=1 βiwzi wz d i=1 βiwi + d i=1 βi wi wzi + d i=1 βi wi wi 3d Cε/2.

This completes the proof.

D Generic Guarantee for PSDP

In this section, we present a generic guarantee for PSDP (Algorithm 3). We show that given any reward functions r1 h X A R and a function class G {g X A R} that realizes these reward functions (we formalize this in the next definition), if Ψ(1 h) are α-policy covers for layers 1 through h, then for sufficiently large n 1 and with high probability, the output ˆπ = PSDP(h,r1 h,G,Ψ(1 h),n) is an approximate maximizer of the objective

max π ΠM Eπ [ h t=1 rt(xt,at)].

To formalize this result, we define the notion of realizability we require for the function class G.

Definition D.1. We say that the function class G {g X A R} realizes the reward functions r1 h X A R if for all t [h] and all π Πt h M ,

Qπ t G, where Qπ t (x,a) = rt(x,a) + Eπ [ h ℓ=t+1 rℓ(xℓ,aℓ) xt = x,at = a]. (14)

Note that Qπ t in (14) represents the state-action value function (Q-function) at layer t [h] with respect to the rewards r1 h and partial policy π.

In what follows, given a function class G {g X A R}, we use NG(ε) to denote the ε-covering number of G in ℓ distance. With this, we now state a guarantee for PSDP.

Theorem D.2. Let ε,δ (0,1), B > 0, and h [H]. Suppose reward functions r1 h X A R, function class G, a collection of policies Ψ(1 h), and a parameter n 1 satisfy the following:

The function class G realizes the reward functions r1 h (in the sense of Definition D.1), functions in G are uniformly bounded by B, and limn n 1 log NG(1/n) = 0.

For some 0 < α 1, for each 1 t h, it holds that Ψ(t) is an α-policy cover for layer t and moreover Ψ(t) d.

The parameter n is chosen such that cd Hα 1 εstat(n,δ/H) ε, where εstat(n,δ ) =

B2n 1 (log NG(1/n) + log(n/δ)) and c > 0 is a large enough absolute constant.

Then, with probability at least 1 δ, the policy ˆπ = PSDP(h,r1 h,G,Ψ(1 h),n) coming from Algorithm 3, satisfies the following guarantee:

max π ΠM Eπ [ h t=1 rt(xt,at)] Eˆπ [ h t=1 rt(xt,at)] + ε.

To prove the theorem, we need two intermediate results. The first shows that the Q function is the Bayes-optimal predictor of the sum of rewards when rolling out with policy π.

Lemma D.1. Let t [H], r1 h X A R be reward functions, π ΠM, and let gπ bayes denote the Bayes-optimal predictor9 for the sum of rewards under policy π, i.e.,

gπ bayes arg min g Xt A R Eπ (g(xt,at) h ℓ=t rℓ(xℓ,aℓ))

Then, gπ bayes = Qπ t , where Qπ t is the Q-function at layer t [h] with respect to the policy π and rewards r1 h defined in (14).

Proof of Lemma D.1. The least-squares solution gπ bayes of the problem in (15) satisfies, for all a A and x Xt,

gπ bayes(x,a) = Eπ [ h ℓ=t rℓ(xℓ,aℓ) xt = x,at = a],

= E[rt(xt,at) xt = x,at = a] + Eπ [ h ℓ=t+1 rℓ(xℓ,aℓ) xt = x,at = a],

= rt(x,a) + Eπ [ h ℓ=t+1 rℓ(xℓ,aℓ) xt = x,at = a],

= Qπ t (s,a),

where the last step follows by the definition of the Q-function in (14).

We now show that the solution ˆg(t) of the least-squares problem in (6) of Algorithm 3 is close to the Q-function in the appropriate sense.

Lemma D.2. Let ε,δ (0,1), B > 0, and 1 t h H. Further, let (εstat,r1 h,G,Ψ(1 h),n) be as in Theorem D.2. Then, the solution ˆg(t) of the least-squares problem in (6) in Algorithm 3 when calling PSDP(h,r1 h,G,Ψ(1 h),n) satisfies with probability at least 1 δ,

Eunif(Ψ(t)) [max a A (ˆg(t)(xt,a) Qˆπ(t+1) t (xt,a)) 2 ] c2 ε2 stat(n,δ),

where ˆπ(t+1) Πt+1 h M is as in Algorithm 3, and c > 0 is an absolute constant.

Proof of Lemma D.2. Fix t [h] and let g(t) bayes = gπ bayes be as in Lemma D.1 with

π = unif(Ψ(t)) t πunif t+1 ˆπ(t+1),

and reward functions r1 h as in the lemma s statement. By Lemma D.1, g(t) bayes is the Bayes-optimal solution of the least-squares problem in (6) of Algorithm 3. Thus, since G realizes the reward functions r1 h, a standard uniform-convergence guarantee for least-square regression (see e.g. Mhammedi et al. [29, Proposition B.1] with e = 0 almost surely) implies that there exists an absolute constant c > 0 (independent of t,h, and any other problem parameters) such that with probability at least 1 δ,

Eunif(Ψ(t)) [max a A (ˆg(t)(xt,a) g(t) bayes(xt,a)) 2 ] c2B2 log NG(1/n) + log(n/δ)

The desired result follows by the fact that g(t) bayes Qˆπ(t+1) t ; see Lemma D.1. We also require the classical performance difference lemma from Kakade [25].

9Observe that because the loss is strongly convex, this predictor is unique up to sets of measure zero, justifying our usage of the Bayes optimal reward.

Lemma D.3 (Performance Difference Lemma). Let π ,π ΠM be arbitrary, and Qπ t be as defined in (14). Then, for any h 1

Eπ [ h t=1 rt(xt,at)] Eπ [ h t=1 rt(xt,at)] = h t=1 Eπ [Qπ t (xt,π (xt)) Qπ t (xt,π(xt))].

Using these results, we now prove Theorem D.2.

Proof of Theorem D.2. We first show that for any t [h], there is an event Et of probability at least 1 δ/H under which the learned partial policies ˆπ(t), ˆπ(t+1) are such that

Eπ [Qˆπ(t+1) t (xt,π (xt)) Qˆπ(t+1) t (xt, ˆπ(t)(xt))] ε

where π arg maxπ ΠM Eπ[ h t=1 rt(xt,at)] is the optimal policy and Qπ t is the Q-function defined in (14). Once we establish (16) for all t [h], we will apply the performance difference lemma (Lemma D.3) and the union bound to obtain the desired result.

For any t [h], we have

Eπ [Qˆπ(t+1) t (xt,π (xt)) Qˆπ(t+1) t (xt, ˆπ(t)(xt))]

= Eπ [Qˆπ(t+1) t (xt,π (xt)) ˆg(t)(xt,π (xt)) + ˆg(t)(xt,π (xt)) Qˆπ(t+1) t (xt, ˆπ(t)(xt))],

Eπ [Qˆπ(t+1) t (xt,π (xt)) ˆg(t)(xt,π (xt)) + ˆg(t)(xt, ˆπ(t)(xt)) Qˆπ(t+1) t (xt, ˆπ(t)(xt))],

2 Eπ [max a A Qˆπ(t+1) t (xt,a) ˆg(t)(xt,a) ], (17)

where the penultimate inequality follows by the fact that ˆπ(t)(x) arg maxa A ˆg(t)(x,a), for all (x,a) Xt A by its construction in (8). Now, using the assumption on Ψ(t), we have for any t [t],

Eπ [Qˆπ(t+1) t (xt,π (xt)) Qˆπ(t+1) t (xt, ˆπ(t)(xt))]

2 Xt (max a A Qˆπ(t+1) t (x,a) ˆg(t)(x,a) )dπ (x)dν(x),

2α 1 Xt (max a A Qˆπ(t+1) t (x,a) ˆg(t)(x,a) ) max π Ψ(t) dπ(x)dν(x),

2α 1d Xt (max a A Qˆπ(t+1) t (x,a) ˆg(t)(x,a) )dunif(Ψ(t))(x)dν(x), ( Ψ(t) d)

= 2α 1d Eunif(Ψ(t)) [max a A Qˆπ(t+1) t (xt,a) ˆg(t)(xt,a) ], (18)

where the first inequality is by (17), the second inequality follows from the fact that Ψ(t) is an α-policy cover, the third inequality follows from the fact that Ψ(t) d, and the equality is by definition. Now, by Lemma D.2, we have that for any t [h], there is an absolute constant c > 0 (independent of t and other problem parameters) and an event Et of probability at least 1 δ/H under which the solution ˆg(t) of the least-squares regression problem on (6) of Algorithm 3 satisfies,

Eunif(Ψ(t)) [max a A ˆg(t)(xt,a) Qˆπ(t+1) t (xt,a) ] c εstat(n, δ

2d H , (19)

where the last inequality follows by the choice of n in the theorem s statement. Combining (19) with (18) establishes (16) under the event Et.

Now, by the performance difference lemma (Lemma D.3), we have

Eπ [ h t=1 rt(xt,at)] Eˆπ [ h t=1 rt(xt,at)]

= h t=1 Eπ [Qˆπ(t+1) t (xt,π (xt)) Qˆπ(t+1) t (xt, ˆπ(t)(xt))].

Algorithm 6 Rep Learn(h,F,Φ,P,n): Representation Learning for Low-Rank MDPs [35] Require:

Target layer h [H]. Discriminator class F. Feature class Φ. Policy distribution P (ΠM). Number of samples n N.

1: Set εstat = O(

cd2n 1 log( Φ /δ)) for sufficiently absolute constant c > 0 (see Appendix E).

2: Let ϕ(1) Φ be arbitrary.

4: for n times do

5: Sample π P.

6: Sample (xh,ah,xh+1) π h πunif.

7: Update dataset: D D {(xh,ah,xh+1)}.

8: Define LD(ϕ,w,f) = (x,a,x ) D(ϕ(x,a) w f(x ))2.

9: for t = 1,2,... do

/* Discriminator selection */

f (t) arg max f F (f), where (f) = max ϕ Φ { min w B(3d3/2)LD(ϕ(t),w,f) min w B(2

d) LD( ϕ, w,f)}.

11: if (f (t)) 16dtε2 stat then

12: Return ϕ(t).

/* Feature selection via least-squares minimization */

ϕ(t+1) arg min ϕ Φ min (w1,...,wt) B(2

t ℓ=1 LD(ϕ,wℓ,f (ℓ)).

Thus, under the event E = h t=1 Et, we have that

Eπ [ h t=1 rt(xt,at)] Eˆπ [ h t=1 rt(xt,at)] ε.

The desired result follows by the fact that a union bound implies P[E] 1 δ.

E Guarantee for Rep Learn

In this section, we give a generic guarantee for Rep Learn (Algorithm 6). Compared to previous guarantees in Modi et al. [35], Zhang et al. [49], we prove a fast 1/n-type rate of convergence for Rep Learn, and show that the algorithm succeeds even when the norm of the weight w in Line 10 does not grow with the number of iterations. We also use the slightly simpler discriminator class:

F = {f x max a A θ ϕ(x,a) θ B(1),ϕ Φ}. (20)

The main guarantee for Rep Learn is as follows.

Theorem E.1. Let h [H], δ (0,e 1), and n N be given, and suppose that µ h+1 satisfies the normalization assumption in Eq. (4). For any function f F, define

wf = Xh+1 f(x)µ h+1(x)dν(x).

Let P (ΠM) be a distribution over policies, F be as (20), and Φ be a feature class satisfying Assumption 2.2. With probability at least 1 δ, Rep Learn with input (h,F,Φ,P,n) terminates after t T = dlog3/2(2nd 1/2) iterations, and its output ϕ(t) satisfies

sup f F inf w B(3d3/2)Eπ P Eπ hπunif [(w ϕ(t)(xh,ah) w fϕ h(xh,ah)) 2] ε2 Rep Learn(n,δ), (21)

where ε2 Rep Learn(n,δ) = c Td3n 1 log( Φ /δ), for some sufficiently large absolute constant c > 0.

To prove the theorem, we need a technical lemma, which follows from Modi et al. [35, Lemma 14].

Lemma E.1. Consider a call to Rep Learn(h,F,Φ,P,n) (Algorithm 6) in the setting of Theorem E.1. Further, let LD be as in Algorithm 6 and define

(ϕ(t), w(t) 1 ,..., w(t) t 1) arg min ϕ Φ,(w1,...,wt 1) B(2

t 1 ℓ=1 LD(ϕ,wℓ,f (ℓ)). (22)

For any δ (0,1), there is an event E(t)(δ) of probability at least 1 δ such that under E(t)(δ), if Algorithm 6 does not terminate at iteration t 1, then for w(ℓ) = wf (ℓ):

t 1 ℓ=1 Eπ P Eπ hπunif [(ϕ(t)(xh,ah) w(t) ℓ ϕ h(xh,ah) w(ℓ)) 2 ] tε2 stat(n,δ), (23)

2 B(d3/2)Eπ P Eπ hπunif [(ϕ(t)(xh,ah) w ϕ h(xh,ah) w(t)) 2] > 8dtε2 stat(n,δ),

where ε2 stat(n,δ) = cd2n 1 log( Φ /δ) and c 1 is a sufficiently large absolute constant.

With this, we prove Theorem E.1.

Proof of Theorem E.1. Let us abbreviate ε = εstat(n,δ), with εstat(n,δ) defined as in Lemma E.1. Further, let N = 1 + dlog3/2(2d3/2/ε) , δ = δ 2N , and define

εstat = εstat(n,δ ). (24)

Note that ε εstat and N 1 T, where T is the number of iterations in the theorem statement; the latter inequality follows by the facts that the absolute constant c in Lemma E.1 is at least 1 and log( Φ /δ) 1. We define an event E = E(1)(δ ) E(N)(δ ), where (Et( ))t are the success events in Lemma E.1. Note that P[E] 1 δ/2 by the union bound. Throughout this proof, we condition on the event E.

To begin the proof, we define a sequence of vectors (v(ℓ) 1 d)ℓ 0 in an inductive fashion, with v(ℓ) i Rd

for all i [d] and ℓ 0. For ℓ= 0, we let v(0) i = εei/d, for all i [d]. For ℓ 1, we consider two cases:

J (ℓ) = {j [d] det(V (ℓ 1) j ,w(ℓ)) > (1 + C) det(V (ℓ 1)) } ,

where V (ℓ 1) = (v(ℓ 1) 1 ,...,v(ℓ 1) d ) Rd d and w(ℓ) = wf (ℓ), then we let j = arg minj J (ℓ) j and define

v(ℓ) i = { w(ℓ), if i = j, v(ℓ 1) i , otherwise.

Case II: If J (ℓ) = , we let v(ℓ) i = v(ℓ 1) i , for all i [d].

We first show that J (t) at any iteration t [N] where Rep Learn does not terminate. Let t [N] be an iteration where the algorithm does not terminate, and suppose that J (t) = . This means that

j [d], det(V (t 1) j ,w(t)) (1 + C) det(V (t 1)) . (25)

Now, since det(V (t 1)) 0 (note that det(V (t)) is non-decreasing with t), we have that

span(V (t 1)) = Rd. Thus, there exist β1,...,βd R be such that w(t) = d i=1 βiv(t 1) i . By the linearity of the determinant and (25), we have

j [d], (1 + C) det(V (t 1)) det(V (t 1) j ,w(t)) ,

= det(V (t 1) j , d i=1 βiv(t 1) i ) ,

= RRRRRRRRRRRR i [d] βi det(V (t 1) j ,v(t 1) i ) RRRRRRRRRRRR ,

= βj det(V (t 1)) .

This implies that βj (1 + C) for all j [d]. Now, note that by the definition of (v(t 1) i ), we have that for any i [d] such that v(t 1) i εei/d, there exists ℓ [t 1] such that w(ℓ) = v(t 1) i . Let

I(t) = {i [d] v(t 1) i εei/d},

and for any i I(t), let ℓi [t 1] be such that w(ℓi) = v(t 1) i . Further, define

w(t) = i I(t) βiw(ℓi) = i I(t) βiv(t 1) i , (26)

and note that by the triangle inequality and the fact that w(t) = d i=1 βiv(t 1) i , we have

w(t) w(t) (1 + C)εstat. (27)

Finally, with the notation in (22), define

w(t) t = i I(t) βi w(t) ℓi , (28)

and note that w(t) t (1 + C)B(2d3/2),

since βi (1 + C) for all i [d], I(t) d, and w(t) ℓ B(2

d), for all ℓ [t 1]. Now, by Lemma E.1, in particular (23), we have

i I(t) Eπ P Eπ hπunif [(ϕ(t)(xh,ah) w(t) ℓi ϕ h(xh,ah) w(ℓi)) 2 ] t ε2 stat, (29)

where εstat is as in (24). Using the expressions in Eqs. (26) and (28) with (29) and Jensen s inequality, we have that under E(t),

Eπ P Eπ hπunif [(ϕ(t)(xh,ah) w(t) t ϕ h(xh,ah) w(t)) 2 ]

j I(t) βj i I(t) Eπ P Eπ hπunif [(ϕ(t)(xh,ah) w(t) ℓi ϕ h(xh,ah) w(ℓi)) 2 ],

(1 + C)dt ε2 stat.

Now, using (27) and the facts that (a + b)2 2a2 + 2b2 and ϕ h 2 1, we have that

Eπ P Eπ hπunif [(ϕ(t)(xh,ah) w(t) t ϕ h(xh,ah) w(t)) 2 ] 2(1 + C)2ε2 + 2(1 + C)dt ε2 stat,

2(1 + C)2 ε2 stat + 2(1 + C)dt ε2 stat.

Using that C = 1/2, we conclude that the right-hand side of this inequality is bounded by 8dt ε2 stat which is a contradiction, since w(t) t (1+C)B(2d3/2) = B(3d3/2) and by Lemma E.1, we must have

inf w B(3d3/2)Eπ P Eπ hπunif [(ϕ(t)(xh,ah) w ϕ h(xh,ah) w(t)) 2] > 8t εstat

if Rep Learn does not terminate at round t. Therefore, we have that J (t) , for any iteration t [2..N] where Rep Learn does not terminate.

We now bound the iteration count and prove that the guarantee in Eq. (21) holds at termination. Note that whenever J (ℓ) for ℓ> 1, we have by construction:

det(V (ℓ)) > 3/2 det(V (ℓ 1)) .

Thus, if Rep Learn runs for t [2..N] iterations, then

det(V (t)) > (3/2)t 1 det(V (1)) . (30)

On the other hand, since the determinant of a matrix is bounded by the product of the norms of its columns and v(t) 1 d B(2

d), we have

det(V (t)) 2ddd/2.

Note also that det(V (0)) = (ε/d)d. Plugging this into (30), we conclude that

(3/2)t 1 < (2d3/2/ε)d.

Taking the logarithm on both sides and rearranging yields

t < 1 + dlog3/2(2d3/2/ε) N.

Thus, the algorithm must terminate after at most N 1 iterations. Furthermore, by [35, Lemma 14], we have that with probability at least 1 δ 2N , if the algorithm terminates at iteration t, then

max f F inf w B(3d3/2)Eπ P Eπ hπunif [(w ϕ(t)(xh,ah) w fϕ h(xh,ah)) 2] 32t ε2 stat,

32(N 1) ε2 stat,

32T ε2 stat.

Applying a union bound completes the proof.

In this section, we prove the main guarantee for Span RL (Theorem 3.2). First, we outline our proof strategy in Appendix F.1. Then, in Appendix F.2 and Appendix F.3, we present guarantees for the instances of PSDP (Algorithm 3) and Robust Spanner (Algorithm 2) used within Span RL. We then combine these results in Appendix F.5 to complete the proof of Theorem 3.2. A self-contained guarantee for Robust Spanner(Lemma 3.1) is given in Appendix F.6.

F.1 Proof Strategy

Like our algorithm, our analysis is inductive. For fixed h, we assume that the policy set Ψ(1 h+1) produced by Span RL satisfies the property:

Ψ(1),...Ψ(h+1) are ( 1 4Ad,0)-policy covers for layers 1 through h + 1, and max t [h+1] Ψ(t) d. (31)

Conditioned on this claim, we show that with high probability, the set Ψ(h+2) is a ( 1 4Ad,0)-policy cover for layer h + 2. To prove this, we use the inductive assumption to show that PSDP acts as an approximate linear optimization oracle over W = {Eπ [ϕ(h)(xh,ah)] π ΠM} (Appendix F.2). Using this, we then instantiate the guarantee of Robust Spanner from Lemma F.3 with Lin Opt and Lin Est instantiated with PSDP and Est Vec. To conclude the proof of the inductive step, we the main guarantee for Robust Spanner together with the main guarantee for Rep Learn (Theorem E.1), along with a change of measure argument enabled by the assumption that Ψ(1 h) are policy covers (i.e. (31)).

F.2 Guarantee for PSDP as a Subroutine for Robust Spanner We begin by showing that PSDP instantiates the approximate linear optimization oracle required by Robust Spanner. In particular, we fix a layer h and assume that Ψ(1 h+1) satisfy (31) and apply the results of Appendix D.

More precisely, we need to show that, for any θ Rd {0} and ϕ Φ, PSDP approximately solves max π ΠM θ Eπ [ϕ(xh,ah)]. (32)

We can equivalently formulate (32) as, for fixed θ Rd {0} and ϕ Φ, maximizing the sum of the reward functions r1 h( , ;θ,ϕ) given by:

(x,a) X A, rt(x,a;θ,ϕ) = { ϕ(x,a) θ

θ , for t = h, 0, otherwise. (33)

Note that this matches the choice of reward functions in Span RL (Algorithm 1) at iteration h with ϕ = ˆϕ(h), the feature map returned by Rep Learn in Line 8. With these reward functions and the function class G = {g (x,a) ϕ(x,a) w ϕ Φ,w B(

we show that the output ˆπ = PSDP(h,r1 h( , ;θ,ϕ),G,Ψ(1 h),n) approximately solves (32) with high probability if n 1 is sufficiently large. We first verify that the class G realizes the reward functions specified in (33) in the sense of Definition D.1.

Lemma F.1. Under Assumption 2.2, the function class G in (34) realizes the reward functions in (33), for any ϕ Φ and θ Rd {0}. Furthermore, we have that functions in G are uniformly bounded by h H, and log NG(ε) log Φ + dlog(H/ε), where we recall that NG(ε) denotes the ε-covering number of G in ℓ distance.

Proof. Fix ϕ Φ and θ Rd {0}, and let rℓ( , ) rℓ( , ;θ,ϕ), for ℓ [h]. For t = h, we clearly have that for any π Πh h M , Qπ t ( , ) = rt( , ) G. For t < h and π Πt h M , we have by the low-rank structure that

Qπ t (x,a) = Xt+1 Eπ[rh(xh,ah) xt+1 = y,at+1 = π(y)] ϕ t (x,a) µ t+1(y)dν(y),

= ϕ t (x,a) ( Xt+1 Eπ[rh(xh,ah) xt+1 = y,at+1 = π(y)] µ t+1(y)dν(y)). (35)

Now, by the fact that Eπ[rh(xh,ah) xt+1 = y,at+1 = π(y)] [ 1,1], for all y Xt+1 (since ϕ( , ) B(1), for all ϕ Φ), and the normalizing assumption made on (µ h)h [H] in Section 2.2 (i.e. that for all g Xt+1 [0,1], Xt+1 µ t+1(y)g(y)dν(y)

d), we have that

wt = Xt+1 Eπ[rh(xh,ah) xt+1 = y,at+1 = π(y)] µ t+1(y)dν(y) B(

This together with (35) and the fact that ϕ t Φ (by Assumption 2.2), we have that Qπ t G. Combining Lemma F.1 with Theorem D.2 results in the following bound on the quality of PSDP as an approximate linear optimization oracle over the space of policies.

Corollary F.1. Let ε,δ (0,1) and h [H]. Further, let θ Rd {0}, ϕ Φ, and ˆπ = PSDP(h,r1 h( , ;θ,ϕ),G,Ψ(1 h),n), where

The reward functions r1 h( , ;θ,ϕ) are as in (33).

The function class G is as in (34).

The collection of policies Ψ(1 h) satisfy (31).

The parameter n is chosen such that c HACd2 εstat(n,δ/H) ε, where εstat(n,δ ) =

dn 1 (dlog(n H) + log( Φ /δ )) and c > 0 is some large enough absolute constant.

Then, under Assumption 2.2, with probability at least 1 δ, we have that

max π ΠM θ Eπ[ϕ(xh,ah)] θ Eˆπ[ϕ(xh,ah)] + ε/2.

We emphasize that the inductive assumption that Ψ(1 h) is a policy cover of bounded size enters only in the statement of Theorem D.2. We now give a guarantee for Robust Spanner as used in Span RL.

Algorithm 7 Est Vec(h,F,π,n): Estimate Eπ[F(xh,ah)] for given policy π and function F. Require:

Target layer h [H]. Vector-valued function F X A Rd. Policy π ΠM. Number of samples n N.

2: for n times do

3: Sample (xh,ah) π.

4: Update dataset: D D {(xh,ah)}.

5: Return: F = 1

n (x,a) D F(x,a).

F.3 Guarantee for Robust Spanner as a Subroutine for Span RL In this section, we prove a guarantee for the instantiation of Robust Spanner in Span RL, which we require in the proof of the main theorem (Theorem 3.2). We first show that the Lin Est subroutine passed to Robust Spanner can be taken to be Est Vec (Algorithm 7), which simply estimates the expected feature imbedding of (xh,ah) under policy π by sampling sufficiently many trajectories and taking the empirical mean.

Lemma F.2 (Guarantee of Est Vec). Let δ (0,1) and ε > 0. For h [H], ϕ Φ, π ΠM, and n N such that n c ε2 log(d/δ) for some large enough absolute constant c > 0, the output ϕh = Est Vec(h,ϕ,π,n) (Algorithm 7) satisfies, with probability at least 1 δ,

ϕh Eπ[ϕ(xh,ah)] ε/2.

Proof. By Hoeffding s inequality (see for example [22, Corollary 7]) and the fact that ϕ(x,a) 1 for all x X and a A, there exists an absolute constant c > 0 such that with probability at least 1 δ,

ϕh Eπ [ϕ(xh,ah)] c

Setting n as in the statement of the theorem concludes the proof.

In Span RL, we instantiate Robust Spanner passing PSDP as Lin Opt and Est Vec as Lin Est. Combining Corollary F.1 and Lemma F.2 with the general guarantee of Robust Spanner in Proposition C.1, we have the following result.

Lemma F.3. Consider iteration h [H] of Span RL(Φ,ε,c,δ) (Algorithm 1) with ε,c > 0, δ (0,1), and feature class Φ satisfying Assumption 2.2. Further, let ˆϕ(h) denote the feature map returned by Rep Learn in Algorithm 1 at iteration h. If Ψ(1 h) satisfy (31) and c = polylog(A,d,H,log( Φ /δ)) is large enough, then there is an event Eh with probability at least 1 δ 2H such that

The number of iterations of Robust Spanner in Line 12 of Algorithm 1 is at most N = d

2 log2 ( 100d

ε ) < , and

The output (π1,...,πd) of Robust Spanner has the property that for all π ΠM, there exist β1,...,βd [ 2,2] such that

ˆϕ(h),π d i=1 βi ˆϕ(h),πi 3dε, where ˆϕ(h),π = Eπ [ˆϕ(h)(xh,ah)].

Proof. By Proposition C.1, on the event that the instances of PSDP and Est Vec used by Robust Spanner satisfy Assumption C.1 with ε = ε

2, the two desiderata of the lemma hold.10 We

10Here, we instantiate the guarantee in Proposition C.1 with C = 2; this is what C is set to in Algorithm 1.

claim that each call to PSDP and to Est Vec satisfies Assumption C.1 with probability at least 1 δ 8d NH . Because each of PSDP and Est Vec get called at most 4d N times per iteration of Robust Spanner, a union bound concludes the proof contingent on the above claim.

We now prove the claim. First, note that the instance of PSDP that Robust Spanner uses within Algorithm 1 is of the form:

PSDP(h,r1 h( , ,θ),G,Ψ1 h,n PSDP)

with r1 h and G as in Algorithm 1; this matches the form in Corollary F.1 (PSDP s guarantee) with ϕ = ˆϕ(h). Thus, by choosing

n PSDP = c A2d5H2 (dlog(H) + log(8d H2N Φ /δ))

for c = polylog(A,d,H,log( Φ /δ)) sufficiently large, the conditions of Corollary F.1 are satisfied, and its conclusion implies the claim for the PSDP instance used by Robust Spanner. Similarly, the choice of n Est Vec in Algorithm 1 ensures that the claim holds for the instance of Est Vec that Robust Spanner uses by Lemma F.2. The result follows.

F.4 Guarantee for Rep Learn as a Subroutine for Span RL In this section, we prove a guarantee for the invocation of Rep Learn within Span RL

Recall that P (h) = unif(Ψ(h)) is the distribution over policies that Span RL passes to Rep Learn at iteration h [H 2] to compute feature map ϕ(h). Thus, by invoking Theorem E.1 in Appendix E and using the choice of n Rep Learn in Algorithm 1, we immediately obtain the following corollary.

Corollary F.2. Let δ,ε (0,1), and F be as in Algorithm 1, and fix h [H 2]. Suppose that the feature class Φ satisfies Assumption 2.2. Then, with probability at least 1 δ 2H , the instance of Rep Learn in Line 9 of Algorithm 1 runs for t c d iterations for c = polylog(A,d,H,log( Φ /δ)) sufficiently large, and returns output ϕ(h) such that for all f F, there exists w(h) f B(3d3/2) satisfying

Eunif(Ψ(h)) [ a A (ϕ(h)(xh,a) w(h) f ϕ h(xh,a) wf) 2 ] η2

where wf = Xh+1 f(y)µ h+1(y)dν(y).

F.5 Concluding the Proof of Theorem 3.2 In this section, we conclude the proof of the main guarantee (Theorem 3.2). We derive the guarantee from the following inductive claim.

Theorem F.1. Consider iteration h [H] of Span RL(Φ,ε,c,δ) (Algorithm 1) with parameters ε,c > 0, δ (0,1) and a feature class Φ satisfying Assumption 2.2. Further, assume that:

The collection of policies Ψ(1 h+1) at the start of the hth iteration of Span RL satisfy (31).

Assumption 2.1 (reachability) holds with η > 0.

The input parameter ε to Span RL is set to ε = η 36d5/2 .

The input parameter c = polylog(A,d,H,log( Φ /δ)) is sufficiently large.

Then, with probability at least 1 δ

H , the set of policies Ψ(h+2) produced by Span RL(Φ,ε,c,δ) at the end of iteration h is an ( 1 4Ad,0)-policy cover for layer h + 2.

With this, we can now prove Theorem 3.2.

Proof of Theorem 3.2. Note that it suffices to prove that (31) holds for h = H 1 with probability at least 1 δ. To do this, we proceed by induction over h = 1,...,H 1. The base case of h = 1 trivially holds because Ψ(1) = and Ψ(2) = {πunif}. The induction step now follows by Theorem F.1 and the union bound (see e.g. [30, Lemma I.2]).

The number of trajectories used by Span RL is dominated by calls to PSDP. Since PSDP is called O(dlog(d/ε)) times at each iteration of Span RL (Lemma F.3), and each call to PSDP requires at most Hn PSDP trajectories, the total number of trajectories after H iterations of Span RL is bounded by O(H2dn PSDP). By plugging the choices for n PSDP and ε from the theorem statement, we obtain the claimed sample complexity.

Before proving Theorem F.1, we make the following simple observation.

Lemma F.4. For any π ΠM, h [H 1], any x Xh+1, we have

µ h+1(x) Eπ[ϕ h(xh,ah)] = dπ(x) 0.

Proof of Lemma F.4. The equality follows by construction. The non-negativity of dπ(x) follows by definition of a probability density.

We now prove Theorem F.1.

Proof of Theorem F.1. Let Eh and E h denote the success events in Lemma F.3 and Corollary F.2, respectively, and note that by the union bound, we have P[Eh E h] 1 δ/H. For the rest of this proof, we will condition on E = Eh E h.

Throughout, we denote

ϕ ,π h = Eπ[ϕ h(xh,ah)], h [H], π ΠM.

Because Ψ(1 h+1) satisfy (31) (i.e., are a policy cover) it holds by Lemma F.4 that for all x Xh,

max π Ψ(h) µ h(x) ϕ ,π h 1 α sup π ΠM µ h(x) ϕ ,π h 1, for α = 1 4Ad. (36)

We will show that with probability at least 1 δ

H , the policy set Ψ(h+2) has the same property for layer h + 2; that is, for all x Xh+1,

max π Ψ(h+2) µ h+2(x) ϕ ,π h+1 α sup π ΠM µ h+2(x) ϕ ,π h+1. (37)

Again, by Lemma F.4 this is equivalent to the statement that Ψ(h+2) is an ( 1 4Ad,0)-policy cover for layer h + 2.

For the remainder of the proof, we will fix x Xh+2 and let πx arg maxπ ΠM µ h+2(x) ϕ ,π h+1. Our goal is to show that the inequality Eq. (37) holds for x.

Preliminaries. Note that since x Xh+2, we have µ h+2(x) > 0. It will be convenient to introduce a function f Xh+1 R defined by

f(y) = θ xϕ h+1(y,πx(y)), where θx = µ h+2(x) µ h+2(x) .

Further, we define

wx = Xh+1 f(y)µ h+1(y)dν(y). (38)

By definition of πx, we have that for all y Xh+1,

θ xϕ h+1(y,πx(y)) = max a A θ xϕ h+1(y,a).

This together with the fact that θx = 1 implies that

f F = {x max a A θ ϕ(x,a) θ B(1),ϕ Φ}; (39)

the discriminator class in Line 4 of Span RL. Note also that since x Xh+2, we have by reachability that

w xϕ ,πx h = θ xϕ ,πx h+1 = 1 µ h+2(x) max π ΠM µ h+2(x) ϕ ,π h+1 η > 0. (40)

Applying the guarantee for Rep Learn. Moving forward, let ϕ(h) be the feature map returned by Rep Learn at the hth iteration of Algorithm 1, and define ϕ(h),π = Eπ[ϕ(h)(xh,ah)], for any π ΠM. Further, let w(h) x be the vector w(h) f in Corollary F.2 with f = fx, and note that

w(h) x 3d3/2. (41) By Jensen s inequality, we compute

( w(h) x ϕ(h),πx wx ϕ ,πx h ) 2

Eπx [(ϕ(h)(xh,ah) w(h) x ϕ h(xh,ah) wx) 2], (Jensen s inequality)

= Xh (ϕ(h)(y,πx(y)) w(h) x ϕ h(y,πx(y)) wx) 2 µ h(y) ϕ ,πx h 1 dν(y), (Low-Rank MDP)

α 1 max π Ψ(h) Xh (ϕ(h)(y,πx(y)) w(h) x ϕ h(y,πx(y)) wx) 2 µ h(y) ϕ , π h 1dν(y), (by (36))

α 1 π Ψ(h) Xh (ϕ(h)(y,πx(y)) w(h) x ϕ h(y,πx(y)) wx) 2 µ h(y) ϕ , π h 1dν(y), (by Lemma F.4)

α 1 π Ψ(h) a A Xh (ϕ(h)(y,a) w(h) x ϕ h(y,a) wx) 2 µ h(y) ϕ , π h 1dν(y),

= Aα 1d Eunif(Ψ(h)) [(ϕ(h)(xh,ah) w(h) x ϕ h(xh,ah) wx) 2], (42)

where the last step follows by the definition of Ψ(h) in Algorithm 1 and that Ψ(h) = d. Now, since wx = Xh+1 f(y)µ h+1(y)dν(y) (see (38)) and f F (see (39)); the guarantee for Rep Learnin Corollary F.2 together with (42) implies that (conditioned on the event E)

w(h) x ϕ(h),πx wx ϕ ,πx h

Applying the guarantee for Robust Spanner. Letting π1,...,πd be the policies returned by Robust Spanner at iteration h of Span RL, the guarantee of Robust Spanner in Lemma F.3 implies that there exist β1,...,βd [ 2,2] such that

ϕ(h),πx d i=1 βiϕ(h),πi 3dε η 12d3/2 , (44)

where the last inequality follows by the fact that ε = η 36d5/2 . Combining (44) with (43) and using the triangle inequality, we get that

w xϕ ,πx h d i=1 βiw xϕ ,πi h + w(h) x η 12d3/2 + η

d i=1 βiw xϕ ,πi h + η

4, (by (41))

2dmax i [d] w xϕ ,πi h + η

Combining this with (40) and rearranging implies w xϕ ,πx h 4d max i [d] w xϕ ,πi h . (45)

On the other hand, by definition of wx, we have max i [d] w xϕ ,πi h = max i [d] θ xϕ ,πi h+1πx h+1 ,

= 1 µ h+2(x) max i [d] Eπi h+1πx [µ h+2(x) ϕ h+1(xh+1,ah+1)],

A µ h+2(x) max i [d] Eπi h+1πunif [µ h+2(x) ϕ h+1(xh+1,ah+1)], (see below)

= A µ h+2(x) max π Ψ(h+2) µ h+2(x) ϕ ,π h+1, (46)

where the inequality follows from the non-negativity of µ h+1( ) ϕ h+1(x,a), for all (x,a) Xh A (due to Lemma F.4), and (46) follows from the definition of Ψ(h+2) in Line 13 of Algorithm 1. Combining (45) and (46) then implies that

1 µ h+2(x) µ h+2(x) ϕ ,πx h+1 = θ xϕ ,πx h+1 = w xϕ ,πx h 4d max i [d] w xϕ ,πi h ,

4Ad µ h+2(x) max π Ψ(h+2) µ h+2(x) ϕ ,π h+1.

This, together with Lemma F.4, implies that (37) holds. Since this argument holds uniformly for all x Xh+2, this completes the proof.

F.6 Proof of Lemma 3.1 By definition for x Xh+1, we have dπ(x) = Eπ [µ h+1(x) ϕ h(xh,ah)]. Let πx denote the policy maximizing dπ(x) (if no such maximizer exists, we may pass to a maximizing sequence) and let Ψ = {π1,...,πd}. Then, we have for some β1,...,βd [ C,C],

dπx(x) = µ h+1(x) ( d i=1 βiϕ ,πi h ) + µ h+1(x) (ϕ ,πx h d i=1 βiϕ ,πi h ),

Cd max i [d] µ h+1(x) ϕ ,πi h + ε µ h+1(x) , (Cauchy-Schwarz)

Cd max i [d] µ h+1(x) ϕ ,πi h + 1

where the inequality follows by the fact that Assumption 2.1 holds with ε η/2. The result now follows by rearranging.

G Application to Reward-Based RL

In this section, we explain how the output Ψ(1 H) of Span RL (Algorithm 1) can be used to optimize downstream reward functions r1 H; our treatment is standard. Since the output of Span RL is a policy cover, one way to optimize the sum of rewards SH = H h=1 rh is by first generating trajectories using policies in Ψ(1 H), then applying an offline RL algorithm, e.g. Fitted Q-Iteration (FQI) [16], to optimize SH. It is also possible to use PSDP with the policy cover Ψ(1 H) to achieve the same goal. We will showcase the latter approach since we have already stated a guarantee for PSDP.

As in Appendix D, we assume access to a function class G {g X A R} that realizes the rewards r1 H in the following sense: for all h [H] and all π Πh H M ,

Qπ h G, where Qπ h(x,a) = rh(x,a) + Eπ [ H t=h+1 rt(xt,at) xh = x,ah = a].

Note that when the reward functions r1 H are linear in the feature map ϕ h; that is, when for all h [H] and (x,a) Xh A, rh(x,a) = θ hϕ h(x,a) for some θh B(1) (a common assumption in the context of RL in Low-Rank MDPs [32, 31, 49, 35]), then the function class

G = {g (x,a) ϕ(x,a) w ϕ Φ,w B(2H

realizes r1 H. Note that G is the same function class we used for the PSDP subroutine in Algorithm 3, albeit with a larger ball for the w s. For the sake of generality, we state the next result (which shows how to use a policy cover to optimize a downstream reward function) for general r1 H and a function class G that realizes r1 H.

Theorem G.1. Let ε > 0 and δ (0,1). Suppose reward functions r1 H X A R, a collection of policies Ψ(1 H), and a parameter n 1 satisfy the following:

The function class G realizes the reward functions r1 H (in the sense of Definition D.1), and limn n 1 log NG(1/n) = 0, where NG(1/n) to denote the 1

n-covering number of G in the supremum norm. Furthermore, we suppose that functions in G are uniformly bounded by H

For some 0 < α 1, for each 1 h H, it holds that Ψ(h) is an α-policy cover for layer h and moreover Ψ(h) d.

The parameter n is chosen such that cd Hα 1 εstat(n,δ/H) ε, where εstat(n,δ ) =

d H2n 1 (log NG(1/n) + log(1/δ)) and c > 0 is a large enough absolute constant.

Then, with probability at least 1 δ, the policy ˆπ = PSDP(H,r1 H,G,P (1 H),n) (where P (t) = unif(Ψ(t)), for each t [h]) coming from Algorithm 3, satisfies the following guarantee:

max π ΠM Eπ [ H h=1 rh(xh,ah)] Eˆπ [ H h=1 rh(xh,ah)] + ε.

Moreover, the number of episodes used by PSDP in this case is

O (A2d5H5(log NG(ε) + log(1/δ))

Proof. This is simply a restatement of Theorem D.2 with h = H. The number of trajectories follows by the fact that each call to PSDP requires Hn trajectories.

H Properties of Reachability Assumption

In this section, we compare η-reachability (Assumption 2.1) to different reachability assumptions used in the literature in the context of RL in Low-Rank MDPs and show that ours is the weakest among those commonly assumed. In Appendix H.1, we demonstrate an exponential separation between our notion of reachability and that considered with respect to the popular latent variable model [1, 35]. In Appendix H.2, we consider a number of other reachability assumptions made outside the latent variable model and show how they imply Assumption 2.1.

H.1 Comparison to Latent Variable Model In this subsection, we show that our reachability assumption is implied a reachability assumption used by [1, 35] in the latent variable/non-negative feature model, and show that our reachability assumption can hold even when the best possible latent variable embedding dimension is exponential in the dimension d. We begin by defining the latent variable model.

Definition H.1 (Latent variable representation). Givn a transition operator T X A (X), a latent variable representation consists of a countable latent space Z and functions ψ X A (Z) and q Z (X), such that T( x,a) = z Z q( z)ψ(z x,a). The latent variable dimension of T, denoted d LV is the cardinality of smallest latent space Z for which T admits a latent variable representation.

The interpretation for the latent variable model is as follows:

1. Each (x,a) pair induces a distribution ψ(x,a) (Z) over z Z.

2. The latent variable is sampled as z ψ(x,a).

3. The next state is sampled as x q( z).

Note that in discrete state spaces, all transition operators admit a trivial latent variable representation, as we may take ψ(x,a) = T( x,a), but the dimension of such a representation is potentially infinite. A latent variable representation certifies that there exists a factorization T(x x,a) = ψ(x,a) q(x ) with embedding dimension Z , and so d LV, and hence gives an upper bound on the rank of the transition operator. On the other hand, compared with the general Low-Rank factorization, the latent variable factorization additionally requires that ψ(x,a) and q( z) are probability distributions, and thus non-negative, for all z Z and (x,a) X A, implying that d LV is equivalent to the non-negative rank [1] of the transition operator.

Assuming that a latent variable representation exists, [1, 35] consider the following notion of reachability.

Definition H.2 (Reachability in latent variable model). There exists η > 0 such that

h [H 1], z Zh+1, sup π ΠM Pπ[zh+1 = z] η. (47)

We first show the latent variable reachability condition above implies our more general assumption.

Lemma H.1. Consider a Low-Rank MDP M with rank d 1. Under the latent variable model in Definition H.1, if the latent variable reachability condition in (47) is satisfied for some η > 0, then, for all h [H], the transition kernel Th in M admits a factorization Th( x,a) = µ h+1( ) ϕ h(x,a), where µ h+1( ) Rd LV and ϕ h( , ) Rd LV, such that d LV d A2/η2 and η2

d-reachability (in the sense of Assumption 2.1) is satisfied.

Proof of Lemma H.1. Suppose that Assumption 2.1 (η-reachability) holds. By Agarwal et al. [1, Proposition 4], the non-negative rank of M is bounded as d LV d A2/η2.

Letting q and ψ be as in the definition of the latent variable representation in Definition H.1, we define µ h+1 and ϕ h as: for all h [H 1],

µ h+1( ) = (q( z))z Z Rd LV, and ϕ h( , ) = (ψ(z , ))z Z Rd LV.

Now, fix h [H 1] and x Xh+1. For z0 arg maxz Zh+1 q(x z), we have

sup π ΠM dπ(x) = Pπ[xh+1 = x] = sup π ΠM z Zh+1 q(x z) Eπ[ψ(z xh,ah)],

= sup π ΠM q(x z0) Eπ[ψ(z0 xh,ah)],

= µ h+1(x) sup π ΠM Pπ[zh+1 = z0],

η µ h+1(x) , (using reachability)

η d LV µ h+1(x) .

We now complement the result above by showing that there exists low-rank MDPs for which our notion of reachability (Assumption 2.1) is satisfied with η polynomially small, yet the best possible latent variable embedding has dimension d LV = 2Ω(d). This contrasts the results in [1, Proposition 2], which show that latent variable reachability implies a polynomial bound on the latent variable dimension.

Theorem H.3. There exists a one-step Low-Rank-MDP of rank d 1, where η-reachability (Assumption 2.1) is satisfied with η = 1 2

d, but where the non-negative rank satisfies d LV = 2Ω(d).

Proof of Theorem H.3. Let n N and d = (n 2) + 1. As shown in the proof of Agarwal et al. [1, Proposition 2], there exists a horizon-two MDP M with the following properties:

The state spaces X1 and X2 at layers 1 and 2, respectively, are finite.

The cardinality of A is d; i.e. A = {a1,...,ad}.11

The transition kernel T1 admits the factorization:

T1( x,a) = µ 2( ) ϕ 1(x,a) (X2), (x,a) X1 A,

where for all x X2, µ 2(x ) Rd 0, and for all (x,a) X1 A, ϕ 1(x,a) Rd 0.

The non-negative rank of M is d LV = 2Ω(d).

We augment this MDP by adding an extra state x0, and let X 1 = X1 {x0}. We define ϕ 1 X 1 A Rd 0 be the extension of ϕ 1 given by

i [d], ϕ 1(x0,ai) = ei, and x X1, ϕ 1(x,ai) = ϕ 1(x,ai),

11Technically, the example in the proof of [1, Proposition 2] does not explicitly specify the number of actions. Instead, the example assigns a number of state-action pairs to vectors in Rd, without specifying the number of actions. The number of actions in their example is a degree of freedom, which we set to d here without loss of generality.

where ei is the ith basis element in Rd. We define the initial state distribution to have ρ(x0) = 1

2 and ρ(x) = 1 2 X1 , for all x X1.12 We let M = (X1 X2,A,ϕ 1,(µ h)h [2],ρ) denote the resulting MDP. Note that adding an extra state at layer 1 in this fashion only adds d additional rows to the transition matrix T (viewed as a ( X1 A ) X2 matrix). Therefore, the non-negative rank of M is as least that of M.

We now show that reachability is satisfied in M. Let πi the policy that always plays action ai. With this, we have that for any x X2,

sup π ΠM dπ(x ) max i [d] dπi(x ),

= max i [d] µ 2(x ) E[ϕ 1(x1,ai)],

= max i [d] {E[I{x1 = x0} µ 2(x ) ϕ 1(x1,ai)] + E[I{x1 x0} µ 2(x ) ϕ 1(x1,ai)]},

max i [d] ρ(x0)µ 2(x ) ϕ 1(x0,ai). (48)

where the last inequality follows by the fact that, for all (x,a) X1 A, µ 2( ) ϕ 1(x,a) = µ 2(x ) ϕ 1(x,a) 0 (since µ 2(x ) ϕ 1(x,a) is a conditional density). On the other hand, from the construction of ϕ 1 and the fact that µ 2(x ) Rd 0, we have

max i [d] µ 2(x ) ϕ 1(x0,ai) = µ 2(x ) µ 2(x ) /

Combining this with (48) and using that ρ(x0) = 1/2 implies that reachability 1/(2

d) is satisfied in M.

H.2 Relation to Other Reachability Assumptions In this subsection, we show that Assumption 2.1 is implied by a notion of feature coverage used in the context of transfer learning in Low-Rank MDPs [5], as well as a notion of explorability used in the context of reward-free RL in linear MDPs [46].

H.2.1 Feature Coverage We first consider coverage condition used by Agarwal et al. [5], which involves the second moments of the feature map ϕ h.

Definition H.4 (η-feature coverage). We say that the linear MDP with featurization ϕ h satisfies η-feature coverage if for all h [H],

sup π ΠM λmin (Eπ[ϕ h(xh,ah)ϕ h(xh,ah) ]) η.

We show that η-feature coverage implies (η/2)3/2-reachability. Thus, up to polynomial dependence, η-feature coverage is a special case of Assumption 2.1.

Lemma H.2. Suppose that an MDP satisfies η-feature coverage as in Definition H.4 for some η > 0. If ϕ h(x,a) B(1) for all x,a, then the MDP satisfies (η/2)3/2-reachability in the sense of Assumption 2.1.

Proof of Lemma H.2. Let h [H] and x Xh+1 be given and define

θ = µ h+1(x) µ h+1(x) .

To keep notation compact, we define ϕ h = ϕ h(xh,ah). By η-feature coverage, there exists π ΠM such that

η Eπ[(θ ϕ h)2] = Eπ[I{(θ ϕ h)2 < η/2} (θ ϕ h)2] + Eπ[I{(θ ϕ h)2 η/2} (θ ϕ h)2],

η/2 + Pπ[(θ ϕ h)2 η/2], (49)

12We note that [1] did not specify the initial distribution, which is not needed for the conclusion of their result.

where we have used that θ = 1 and ϕ h(x,a) 1 for all (x,a) Xh A. Rearranging (49) and using that θ ϕ h 0 (it is a scaled conditional density), have

η/2] = Pπ[(θ ϕ h)2 η/2] η/2.

Now, by Markov s inequality, we have that

θ ϕ ,π h = Eπ[θ ϕ h]

η/2 Pπ[θ ϕ h

η/2] (η/2)3/2,

where we have once more used that θ ϕ h 0 almost surely.

H.2.2 Explorability We now consider the explorability assumption of [46], which involves the first moment of the feature map ϕ h. This notion is defined as follows.

Definition H.5 (η-explorability). We say that a linear MDP satisfies η-explorability if for any h [H] and any θ Rd {0} it holds that

sup π ΠM θ Eπ[ϕ h(xh,ah)] η θ .

We now show that η-explorability is a special case of η-reachability:

Lemma H.3. Suppose that the explorability condition in Definition H.5 is satisfied with η > 0. Then, η-reachability is satisfied.

Proof of Lemma H.3. Let x Xh+1 and define θ = µ h+1(x). By explorability, we have that

sup π ΠM dπ(x) = sup π ΠM Eπ[µ h+1(x) ϕ h(xh,ah)],

= sup π ΠM Eπ[µ h+1(x) ϕ h(xh,ah)] , (µ h+1( ) ϕ h(x,a) is a condition law)

= sup π ΠM θ Eπ[ϕ h(xh,ah)] ,

η θ , (by explorability)

= η µ h+1(x) .

This shows that Assumption 2.1 is satisfied with parameter η.