# when_is_generalizable_reinforcement_learning_tractable__ea884318.pdf

When Is Generalizable Reinforcement Learning Tractable?

Dhruv Malik Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213

Yuanzhi Li Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213

Pradeep Ravikumar Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213

Agents trained by reinforcement learning (RL) often fail to generalize beyond the environment they were trained in, even when presented with new scenarios that seem similar to the training environment. We study the query complexity required to train RL agents that generalize to multiple environments. Intuitively, tractable generalization is only possible when the environments are similar or close in some sense. To capture this, we introduce Weak Proximity, a natural structural condition that requires the environments to have highly similar transition and reward functions and share a policy providing optimal value. Despite such shared structure, we prove that tractable generalization is impossible in the worst case. This holds even when each individual environment can be efﬁciently solved to obtain an optimal linear policy, and when the agent possesses a generative model. Our lower bound applies to the more complex task of representation learning for efﬁcient generalization to multiple environments. On the positive side, we introduce Strong Proximity, a strengthened condition which we prove is sufﬁcient for efﬁcient generalization.

1 Introduction

Reinforcement learning (RL) is the dominant paradigm for sequential decision making in machine learning, and has achieved success in a variety of domains such as competitive gaming [33, 40] and robotic control [21, 22]. Despite this success, many issues prevent RL from being regularly used in the real world. For example, one typically trains and tests RL agents in the same environment. In such cases, an agent can memorize behavior that achieves high reward, without acquiring the true behavior that the system designer desires. This has raised concerns about RL agents overﬁtting to a single environment, instead of learning meaningful skills [17].

Indeed, a long line of work has noted the brittleness of RL agents: slight changes in the environment, such as those incurred by modeling or simulator design errors, or slight perturbations of the agent s trajectory, can lead to catastrophic declines in performance [36, 49, 23]. Furthermore, although RL agents can solve difﬁcult tasks, they struggle to transfer the skills they learned in one task to perform well in a different but similar task [37, 48]. Yet, in the real world, it is reasonable to expect that RL agents will see scenarios that are at least mildly different from the speciﬁc scenarios they trained for.

Hence, a desirable property of RL agents is that of generalization, broadly deﬁned as the ability to discern the correct notion of behavior and perform well in semantically similar environments. We focus on two popular generalization settings. The Average Performance setting assumes there is an

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

underlying distribution over the environments that an agent might encounter. The agent s goal is to perform well on average across this distribution [35, 34, 11]. The Meta Reinforcement Learning setting is closely related [20, 10, 37]. Here an agent ﬁrst learns from a suite of training environments sampled from a distribution. Then at test time the agent must leverage this experience to adapt to a new environment sampled from the same distribution, via only a few queries in the new environment.

Of course, in full generality, both notions of generalization are impossible to achieve efﬁciently. This is especially true in the RL function approximation setting, where the cardinality of the state space is potentially inﬁnite, and so we desire query complexity that scales (polynomially) with the dimensionality of the state space [13, 38, 14, 31, 46]. Hence, key to both lines of inquiry is the premise that the environments are structurally similar. For example, a robot may face the differing tasks of screwing a bottle cap and turning a doorknob, but both tasks involve turning the wrist [37]. The hope is that if the environments are sufﬁciently similar, then RL can exploit this structure to efﬁciently discover policies that generalize.

Yet, it remains unclear what kind of structure is necessary, and what it means for different environments to be close or similar. Motivated by this, we ask the following question:

What are the structural conditions on the environments that permit efﬁcient generalization?

This question underlies the analysis of our paper. We focus on environments that share state-action spaces, since even this basic case is not well understood in the literature. Indeed, even in this simpliﬁed setting, efﬁcient generalization can be highly non-trivial. We make the following contributions.

Our Contributions. We introduce Weak Proximity, a natural structural condition that is motivated by classical RL results, and requires the environments to have highly similar transition and reward functions and share optimal trajectories. We prove a statistical lower bound demonstrating that tractable generalization is impossible, despite this shared structure. This lower bound holds even when each individual environment can be efﬁciently solved to obtain an optimal linear policy, and when the agent possesses a generative model. Consequentially, we show that a classical metric for measuring the relative closeness of MDPs is not the right metric for modern RL generalization settings. Our lower bound implies that learning a state representation for the purpose of efﬁciently generalizing to multiple environments, is worst case sample inefﬁcient even when such a representation exists, the environments are ostensibly similar, and any single environment can be efﬁciently solved.

To provide a sufﬁcient condition for efﬁcient generalization, we introduce Strong Proximity. This structural condition strengthens Weak Proximity by additionally constraining the environments to share an optimal policy. We provide an algorithm which exploits Strong Proximity to provably and efﬁciently generalize, when the environments share deterministic transitions.

2 Related Work

Simulation Lemma. Many prior works deﬁne notions of statistical distance between Markov decision processes (MDPs), and measure the relative value of policies when deployed in different MDPs that are close under such metrics. The Simulation Lemma, which uses total variation distance between transitions and the absolute difference of rewards as this metric, is a well known formalization of this and has been very useful in classical prior work [26, 27, 6, 25, 1]. These works do not directly tackle generalization, but their analyses construct an approximate MDP that models the true MDP under the aforementioned metric. Solving this approximate MDP then corresponds to solving the true MDP. It is natural to ask whether this metric is useful for measuring the similarity of MDPs in modern RL generalization settings. We show this metric is not appropriate for the settings we study.

Transfer & Multitask Learning. There are varying formalisms of both settings, so we do not directly study them. However, they are broadly relevant, and we expect our theory to be useful for future studies of these settings. The works [32, 7, 24, 43] all study metrics for measuring variation between MDPs that are different from the metrics we study. A metric similar to the one used in the Simulation Lemma has also been studied [18], and we show that this is inappropriate for our settings.

Average Performance & Meta RL Settings. We directly study these two settings, which have seen much empirical work [35, 11, 12, 37, 48]. On the theoretical side, [5, 42] study an Average Performance setting where the agent receives a noisy observation in lieu of the actual state. We focus on the simpler setting where the agent knows its state. Recent works [16, 44] analyze the

MAML algorithm [20] in the context of Meta RL. In the worst case, their complexity bounds scale exponentially with the horizon, and they do not discuss structure which permits tractable Meta RL.

Representation Learning. A large body of work has focused on extracting a representation useful for a single MDP [19, 8, 30, 50]. Some works extend this to multiple MDPs [9, 3, 41], but they are about learning shared representations for MDPs that appear similar (but not from a sample efﬁciency perspective), while we formalize what it means for MDPs to be similar (in a sample efﬁcient sense). Indeed, these works study the general case when the environments have distinct state spaces, but our lower bounds show generalization is non-trivial even when each MDP shares the same state space.

3 Problem Formulation

Notation & Preliminaries. Before describing our settings of interest, we establish notation and brieﬂy review preliminaries. We always use M to denote a Markov decision process (MDP). Recall that an undiscounted ﬁnite horizon MDP is speciﬁed by a set of states S, a set of actions A, a transition function T which maps from state-action pairs to distributions over states, a reward function R which maps state-action pairs to nonnegative real numbers, and a ﬁnite planning horizon H. We assume that the state-action pairs are featurized, so that S A Rd, and that (s, a) 2 = 1 for all (s, a) S A. Any MDP we consider is undiscounted and has a ﬁnite action space, but could have an uncountable state space. If we need to refer to the transition or reward function of a speciﬁc MDP M, then we shall denote this via T M or RM. We will denote a distribution over MDPs as D. We also assume that S can be partitioned into H different levels. This means that for each s S there exists a unique h {0, 1 . . . H 1} such that it takes h timesteps to arrive at s from s0. We say that such a state s lies on level h, and denote Sh to be the set of states on level h. This assumption is without loss of generality, since we can always make the ﬁnal coordinate of each state-action pair encode the number of timesteps that elapsed to reach the state. A deterministic MDP is one with deterministic transitions. For any MDP, we assume a single initial state s0, which strengthens our lower bounds.

A policy maps each state to a corresponding distribution over actions, and shall typically be denoted by π. The total expected reward accumulated by policy π when initialized at state s in MDP M is given by E h PH 1 h=level(s) RM(sh, ah) | π i and will be denoted by V s M(π). Here the expectation is

over the trajectory {(sh, ah)}H 1 h=level(s) given that the ﬁrst state in the trajectory is s. So V s M(π) is the value of the policy π in MDP M with respect to (w.r.t) initial state s. Analogously, if a policy is parameterized by θ = {θh}H 1 h=0 , then we denote it as π(θ), and the notation V s M(π) is then replaced by V s M(θ). We assume that the cumulative reward collected by any trajectory from any initial state s in any MDP M is always bounded by 1. Hence the value of any policy in any MDP lies in the interval [0, 1]. TV(P, Q) denotes the total variation (TV) distance between probability distributions P and Q.

3.1 Problem Settings

Average Performance Setting. There is a ﬁxed distribution D over a family of MDPs. One can sample MDPs from D. The algorithm can query states in the sampled MDPs, to learn some common structure. The goal is to solve max π EM D [V s0 M (π)] . (1)

Meta Reinforcement Learning Setting. There is a ﬁxed distribution D over a family of MDPs. At training time, one can sample MDPs from D. The algorithm can query states in the sampled MDPs, to learn some common structure between all the MDPs. Then at test time, an MDP Mtest is sampled from the same distribution D. The goal of the algorithm is to learn a subroutine, which with non-trivial probability over the selection of Mtest, can solve

max π V s0 Mtest(π), (2)

signiﬁcantly more efﬁciently than trying to solve Mtest without having seen any training MDPs.

In both settings, sampling an MDP means drawing an MDP i.i.d from D, so that the agent can then interact with it by performing trajectories in it. Note that in Eqs. (1) & (2), in full generality the initial state s0 is random and depends on M, Mtest. We focus on the case when the MDPs supporting D share a state-action space, and hence share the same single initial s0 since we assume a single

initial state for any MDP. While such assumptions are already strong, they only strengthen our lower bounds. Furthermore, it is necessary to understand this simpler setting, before looking at more complex scenarios. To the best of our knowledge, such a study has not appeared in prior work.

To solve the problems described by Eqs. (1) & (2), we need to deﬁne an appropriate query model for the algorithm. We consider two query models, the ﬁrst of which is strictly stronger than the second.

Strong Query Model (SQM). Sampling an MDP from D incurs no cost. The agent has a generative model of any sampled MDP M. To interact with M, the agent inputs a state-action pair (s, a) of M into the model, and receives RM(s, a) and a state sampled from T M(s, a). This incurs a query cost of one. The goal is to solve Eqs. (1) & (2) with total query cost that is at most polynomial in |A|, H, d.

Weak Query Model (WQM). Sampling an MDP from D incurs a query cost of q D 1. Within a sampled MDP M, the agent operates in the standard episodic RL setup. Concretely, during each episode the agent interacts with the MDP by starting from s0, taking an action and observing the next state and reward, and repeating. Each action taken during an episode incurs a query cost of one. The goal is to solve Eqs. (1) & (2) with total query cost that is at most polynomial in q D, |A|, H, d.

Note that under both SQM and WQM, we desire query cost that is polynomial in the dimension d of the state-action space, as opposed to the cardinality of the state space. This is standard for our function approximation setting [13, 38, 14, 31, 46], since the cardinality of the state space could be inﬁnite. Also, we separate SQM and WQM because it is well known that different query models can lead to various subtleties in analysis and sample complexity guarantees [13, 38, 14, 31, 46]. The generative model that deﬁnes SQM assumes that we can simulate any state of our choice without performing a trajectory, which is unrealistic in practice, and is one of the strongest oracle models considered in prior literature [28, 4, 39, 14, 31, 2]. We shall present our lower bounds under SQM, which makes these results stronger, but shall present our upper bound under the natural and standard WQM.

Without any conditions on D, the Average Performance & Meta RL settings can be intractable, even under SQM. This will occur if the MDPs supporting D do not share structure. This will also occur if any individual MDP cannot be solved efﬁciently. Nevertheless, in practice one often deals with MDPs which share meaningful structure [11, 37]. For instance, the transition distributions of the MDPs may be close in a suitable metric. Similarly, the reward functions of the MDPs might be close in an appropriate norm, or each MDP may share a set of optimal trajectories. And in practice, individual MDPs can usually be optimized efﬁciently [35, 48]. In such cases, it is reasonable to expect tractable generalization. We are interested in formalizing conditions that permit efﬁcient generalization. We will particularly focus on conditions which capture shared structure of the MDPs and the tractability of individual MDPs. We now formally state the problem we consider throughout our paper.

Which conditions on D allow us to solve the Average Performance & Meta RL settings efﬁciently?

As mentioned above, there are two types of requirements. The ﬁrst requirement should ensure that the MDPs are meaningfully similar. We formalize such conditions in Section 3.2. The second requirement should ensure that any individual MDP is efﬁciently solvable, else there is no hope to efﬁciently ﬁnd policies that generalize for many MDPs. We formalize such properties in Section 3.3.

3.2 Strong & Weak Proximity

We now identify conditions that capture when the MDPs supporting D share meaningful structure. Since MDPs are deﬁned in terms of rewards and transitions, it is very natural to impose conditions directly on the rewards and transitions. To this end, we state the following condition.

Condition 1 (Similar Rewards & Transitions) The distribution D satisﬁes this condition with parameters ξr, ξtr 0 when:

(a) Each MDP supporting D shares the same state-action space S A.

(b) For all Mi, Mj supporting D and all (s, a) S A we have |RMi(s, a) RMj(s, a)| ξr.

(c) For all Mi, Mj supporting D and all (s, a) S A we have TV(T Mi(s, a), T Mj(s, a)) ξtr.

The parameters ξr, ξtr naturally quantify the similarity of different MDPs. Conditions of this form are canonical and have yielded fruitful research in classical RL literature [26, 27, 6, 25, 1], in the guise of the Simulation Lemma (see Section 2). To concretize this condition with an example, consider a suite of simulated robotic goal reaching tasks [48], where the physics simulator is the same in each task, so the transitions are ﬁxed and ξtr = 0, but the goal location changes from task to task, implying that ξr > 0. We now establish our Weak Proximity condition, which strictly strengthens Condition 1.

Condition 2 (Weak Proximity) The distribution D satisﬁes Weak Proximity with parameters ξr, ξtr, α 0 when:

(a) D satisﬁes Condition 1 with parameters ξr, ξtr 0.

(b) There exists a deterministic policy π which for any MDP M satisﬁes V s0 M (π ) maxπ V s0 M (π ) α.

Weak Proximity strengthens Condition 1 by additionally requiring (via part (b)) that there exists some policy π which provides α-suboptimal value for each MDP supporting D. Intuitively, this condition implicitly constrains the MDPs to be similar, since there is a single policy which provides (nearly) optimal value, irrespective of the MDP it is deployed in. Furthermore, recall from Eqs. (1) & (2) that the objectives of the Average Performance & Meta RL settings are deﬁned in terms of value w.r.t the initial state s0. So it is natural to assume, as we do in part (b), that there is one policy which provides good value w.r.t s0 for all MDPs. From an algorithmic perspective, this is helpful, because it ensures that we can restrict our search to those policies which perform well for many MDPs supporting D.

To concretize this condition in our aforementioned example of simulated robotic goal reaching tasks [48], consider a suite of tasks where each task has multiple different equivalent goals (so the task is complete when the robot reaches any single one of these goals), but there is only one goal location that is shared and invariant across each task. The trajectory that leads to this goal location from s0 deﬁnes a policy π , such that for any task M we have V s0 M (π ) = maxπ V s0 M (π ), implying that Weak Proximity is satisﬁed with α = 0.

Although Condition 1 is natural and well motivated by classical RL literature, it (and Weak Proximity) may seem strong. This is because it requires that each MDP supporting D shares the same state space, which may not hold in practice. We stress that we will prove a lower bound under Weak Proximity, showing that efﬁcient generalization is impossible even in the simpler regime of a shared state space.

We now present Strong Proximity, a condition which strictly strengthens Weak Proximity. We will later show that unlike its Weak counterpart, Strong Proximity indeed permits efﬁcient generalization (when the environments are deterministic).

Condition 3 (Strong Proximity) The distribution D satisﬁes Strong Proximity with parameters ξr, ξtr, α 0 when:

(a) D satisﬁes Condition 1 with parameters ξr, ξtr 0.

(b) There exists a deterministic policy π which is a near optimal policy for each MDP. Concretely, the policy π satisﬁes V s M(π ) maxπ V s M(π ) α for each state s and each MDP M.

Let us compare Weak with Strong Proximity. Part (a) remains identical. But Weak Proximity (b) only requires a shared policy which provides α-suboptimal value with respect to s0. This is in contrast to the shared policy in part (b) of Strong Proximity, which provides α-suboptimal value for any state.

3.3 Tractability of Individual Optimization

As discussed previously, in order to efﬁciently solve Eqs. (1) & (2), we require the property that each individual MDP supporting D can be efﬁciently solved. It is natural to expect such a property to hold in practice. For instance, in the context of our earlier example of simulated robotic goal reaching tasks [48], any individual task can be efﬁciently solved via policy gradient methods. We now state two such properties, the ﬁrst of which is strictly stronger than the second. Since these properties require a notion of query cost, we state both of them with reference to a generic query model QM, and when we later present our results we will instantiate QM to be either SQM or WQM. To avoid complicating notation in these statements, we assume in this subsection (as is our focus throughout

the paper) that all MDPs supporting D are deﬁned on the same state-action space S A Rd. Recall that a linear policy π is parameterized by θ = {θh}H 1 h=0 , where θh Rd and θh 2 = 1 for all 0 h H 1, such that π(s) argmaxa A(s, a)T θh for any s Sh. Here x T y denotes the Euclidean inner product of x, y Rd. We use π M to denote an arbitrary deterministic optimal policy of MDP M.

Property 1 (Strong Individual Optimization (SIO)) Let the query model be QM. The distribution D satisﬁes SIO with parameters k > 0 and 0 β < 1/4 when:

(a) Any MDP M supporting D admits an optimal linear policy. Concretely, given any M, there exists θ = {θ h}H 1 h=0 such that for every state s Sh we have π M(s) argmaxa A(s, a)T θ h.

(b) There exists a ﬁxed and known algorithm, such that given any MDP M and any state s, this algorithm uses at most O(| A |Hk) query cost (under QM) on M to identify (almost surely) a linear policy π(θ) parameterized by θ = {θh}H 1 h=0 which satisﬁes maxπ V s M(π ) V s M(θ) maxπ V s M(π ) β. This algorithm then outputs π(θ) as well as V s M(θ).

Let us discuss this property. Part (a) requires that for any MDP supporting D, there exists an optimal linear policy. Part (b) requires that the user has knowledge of an algorithm, which can efﬁciently ﬁnd a linear policy providing β-suboptimal value from any input state s in any MDP M. The exponent k describes the (polynomially sized) complexity of this algorithm. We stated the SIO property with respect to a generic efﬁcient algorithm, since MDPs with different structures can require different types of algorithms to solve efﬁciently. Nevertheless, in our lower bound construction, the algorithm we provide to satisfy SIO is extremely simple and natural. It is simply a greedy version of Monte Carlo Tree Search, which is extremely popular in practice [29, 40].

SIO is a fairly strong property, since it says that a linear policy is sufﬁcient to optimize any individual MDP, whereas in practice one typically requires nonlinear neural network policies. SIO also heavily constrains each individual MDP supporting D to be efﬁciently solvable from any initial state. We stress that we will prove our lower bounds under SIO, which makes our result stronger. Meanwhile, we prove our upper bounds under the following property, which is signiﬁcantly weaker than SIO.

Property 2 (Weak Individual Optimization (WIO)) Let the query model be QM. The distribution D satisﬁes WIO with parameter 0 β < 1/4 when the following holds. There exists an oracle b V , which takes as input a state s and MDP M, and outputs b V s M satisfying maxπ V s M(π ) b V s M maxπ V s M(π ) β, via query cost (under QM) on M that is polynomial in | A |, H, d.

WIO postulates the existence of an oracle b V , which can efﬁciently approximate the optimal value that is achievable from an input state and MDP. To see that WIO is strictly weaker than SIO, simply note we can implement b V by running the algorithm described in part (b) of SIO. Note that in certain states, a user may use domain knowledge to implement b V without solving an entire RL problem. Also note that WIO does not place (arguably unrealistic) linearity restrictions on the MDPs supporting D.

4 Main Results

We shall present our results in two subsections. In Section 4.1, we prove lower bounds which demonstrate that even under Weak Proximity, SQM and SIO, tractable generalization is worst case impossible. In Section 4.2, we prove that efﬁcient generalization is possible under Strong Proximity, WQM and WIO, when the MDPs supporting D share a deterministic transition function.

4.1 Lower Bounds

Before stating our own results, we ﬁrst state the following classical result which is known as the Simulation Lemma [26, 27, 6, 25, 1]. Recall that ξr, ξtr are parameters used to satisfy Condition 1.

Lemma 1 Consider any D satisfying Condition 1 with ξr, ξtr 0. For any policy π and any M1, M2 supporting D, we have that |V s0 M1(π) V s0 M2(π)| ξr H + ξtr H.

This result is almost identical to the one given by [25], although there are some (minor) differences in assumptions so we provide a proof in Appendix D. This lemma shows that when D satisﬁes Condition 1 and ξr, ξtr are each o( 1

H ), then efﬁcient generalization is trivial, at least in problems where H is large and we want to optimize to within o(1) tolerance. Concretely, take any M supporting D and use a standard RL method to ﬁnd π which satisﬁes V s0 M (π) maxπ V s0 M (π ). Then Lemma 1 ensures V s0 M (π) maxπ V s0 M (π ) o(1) for any other MDP M supporting D. This implies EM D [V s0 M (π)] maxπ EM D [V s0 M (π )] o(1) and V s0 Mtest(π) maxπ V s0 Mtest(π ) o(1).

Since Weak Proximity implies Condition 1, Lemma 1 and all the above statements remain true when D satisﬁes Weak Proximity. Naturally then, in our settings it is only interesting to consider problems when at least one of either ξr or ξtr is Ω( 1

H ). Our next result is a lower bound which shows that when ξr = Θ( 1

H ) and ξtr = 0, then Weak Proximity is not sufﬁcient to efﬁciently generalize in the Average Performance Setting. For the statement of this result, recall that ξr, ξtr, α are parameters used to satisfy Weak Proximity, while β, k are parameters used to satisfy SIO.

Theorem 1 Let the query model be SQM. For any k 3, there exists D satisfying Weak Proximity with ξr = Θ( 1

H ), ξtr = 0 & α = 0 and SIO with β = 0 & k, such that the MDPs supporting D are deterministic and the following holds. Any (possibly randomized) algorithm requires Ω min | A |H, 2d total query cost to ﬁnd (with probability at least 1/2 over the randomness of the algorithm) a policy π satisfying EM D [V s0 M (π)] max linear policy π EM D [V s0 M (π )] 1/4.

We defer the proof to Appendix A.1. Let us discuss this theorem, which is stated for the Average Performance Setting, when the MDPs supporting D all share a deterministic transition function. Recall that SQM is the stronger query model we consider, which strengthens this lower bound, and trivially implies a lower bound for when WQM is the query model. Also recall that SIO is the stronger individual optimization property that we consider, and it ensures that the user can efﬁciently ﬁnd a linear policy providing optimal value w.r.t any initial state for any individual MDP, since β = 0. Moreover, Weak Proximity (b) ensures that each MDP supporting D shares a policy that provides optimal value (w.r.t s0), since α = 0. And Weak Proximity (a) explicitly requires that the reward functions are (non-trivially) close, in the sense deﬁned by Condition 1, because ξr = Θ( 1

H ). Despite this signiﬁcant structure, the theorem demonstrates that one can still require an exponential query cost to ﬁnd a policy that is nearly as good as the best linear policy (which is of course easier than ﬁnding the best generic policy). Note that this lower bound holds with α = β = ξtr = 0, and so implies a lower bound for when any of α, β, ξtr are strictly positive. As we discuss at the end of Section 4.1, Theorem 1 (and its forthcoming corollaries) immediately applies to the task of learning a feature mapping which maps similar states to the same vector, for the purpose of efﬁciently solving Average Performance and Meta RL settings.

We note that in the construction used to prove the lower bound of Theorem 1, the algorithm we provide to satisfy the SIO property is extremely simple and natural. It is simply a greedy version of Monte Carlo Tree Search, which is extremely popular in practice [29, 40].

Let us provide some intuition for our proof of Theorem 1. In the |A|-ary tree hard instance used in our proof, there are Ω(| A |H) possible trajectories. The fact that ξr = Θ( 1

H ) allows us enough degrees of freedom to hide the policy that generalizes across D, so that identifying it requires querying each of the Ω(| A |H) trajectories. We leverage recent techniques [14, 45] to construct a suitable featurization of the state-action space, that is expressive enough to allow for efﬁciently ﬁnding an optimal linear policy for any single MDP, but does not leak any further information.

A similar result holds for the Meta RL setting. Recall that by SIO (b), the user has access to an algorithm which can solve any Mtest at test time in O(| A |Hk) queries, even if it does no training. So it only makes sense to train, if one can use this training to solve Mtest in o(| A |Hk) queries. The following corollary to Theorem 1 demonstrates that this may require exponential query cost during training time. Its proof is presented in Appendix A.2.

Corollary 1 Let the query model be SQM. For any k 3, there exists D satisfying Weak Proximity with ξr = Θ( 1

H ), ξtr = 0 & α = 0 and SIO with β = 0 & k, such that the MDPs supporting D are deterministic and the following holds. If a (possibly randomized) algorithm at test time can identify π satisfying V s0 Mtest(π) max linear policy π V s0 Mtest(π ) 1/4

in o(| A |Hk) queries, with probability at least 1/2 over the selection of Mtest (and the randomness of the algorithm), then this algorithm must have required Ω min | A |H, 2d total query cost at training time.

So far we have presented results for when the MDPs supporting D share a deterministic transition function but have (slightly) varying rewards. For the remainder of Section 4.1, we present analogous results for when the MDPs share a reward function but have (slightly) varying transitions, again under both SIO and SQM. Recall from our discussion of Lemma 1 that when ξr = 0, it is only interesting to consider problems when ξtr is Ω( 1

H ). Unfortunately, our next result is a corollary of Theorem 1 which shows that efﬁciently solving the Average Performance Setting is impossible in this regime.

Corollary 2 Let the query model be SQM. For any k 3, there exists D satisfying Weak Proximity with ξr = 0, ξtr = Θ( 1

H ) & α = 0 and SIO with β = 0 & k, such that the following holds. Any (possibly randomized) algorithm requires Ω min | A |H, 2d total query cost to ﬁnd (with probability at least 1/2 over the randomness of the algorithm) a policy π satisfying

EM D [V s0 M (π)] max linear policy π EM D [V s0 M (π )] 1/4.

We defer the proof to Appendix A.3. We recall the discussion of Theorem 1, and note that the same discussion applies here, after swapping ξtr with ξr. An analogous result holds for the Meta RL setting. As we discussed before presenting Corollary 1, it only makes sense to train, if one can use this training in order to solve Mtest in o(| A |Hk) queries. The following result shows that this is impossible without exponential total query cost at training time. Its proof is presented in Appendix A.4.

Corollary 3 Let the query model be SQM. For any k 3, there exists D satisfying Weak Proximity with ξr = 0, ξtr = Θ( 1

H ) & α = 0 and SIO with β = 0 & k, such that the following holds. If a (possibly randomized) algorithm at test time can identify π satisfying

V s0 Mtest(π) max linear policy π V s0 Mtest(π ) 1/4

in o(| A |Hk) queries, with probability at least 1/2 over the selection of Mtest (and the randomness of the algorithm), then this algorithm must have required Ω min | A |H, 2d total query cost at training time.

In conjunction with Lemma 1, the results of Theorem 1 and Corollaries 1, 2 & 3 suggest that the classical (and quite natural) way of measuring variation in MDPs using Condition 1 is not the right metric for the modern Average Performance & Meta RL settings. When both ξr and ξtr are o( 1

H ), then these settings are trivially solvable. But when either ξr or ξtr is Θ( 1

H ) then these settings become exponentially hard, even under the additional Weak Proximity condition as well as SIO & SQM.

Note that Theorem 1 and Corollaries 1, 2 & 3 all hold in the setting where each MDP supporting D shares a state-action space. So these lower bounds immediately apply to more complex settings where the MDPs are deﬁned on disjoint state-action spaces, and where learning an appropriate representation is necessary. Indeed, it is popular in practice to learn a feature mapping which maps similar states to the same vector. Our results show that if such a mapping enables efﬁcient solution of the Average Performance & Meta RL settings, then learning the mapping itself is worst case inefﬁcient.

4.2 Upper Bound

We now show that Strong Proximity permits efﬁcient generalization when the MDPs supporting D share deterministic transitions. While this setting is restricted, we study it because our Theorem 1 shows that even this setting can be worst case inefﬁcient under strong assumptions. Furthermore, past literature on even traditional RL with a single MDP has often focused on the deterministic setting [47, 15]. Notably, to prove our upper bound we only require the weaker WQM and weaker WIO. Our method is deﬁned in Algorithm 1. It exploits Strong Proximity, which requires the existence of a policy which provides optimal value for each MDP from any given initial state, even though the objectives in Eqs. (1) & (2) are deﬁned only in terms of value w.r.t s0.

Let us describe Algorithm 1. It represents policy π as a vector which stores one action for each timestep in {0, 1 . . . H 1}. It initializes arbitrary π and incrementally updates it at each timestep

Algorithm 1 Inputs: horizon length H, distribution D, sample size n, oracle b V as deﬁned in WIO

1: Initialize π as an arbitrary function from {0, 1 . . . H 1} to A 2: for t {0, 1 . . . H 1} do 3: for i {1, 2 . . . n} do 4: Sample Mi D 5: for a A do 6: Begin a new episode in Mi at s0 7: if t > 0 then Execute action sequence {π(t )}0 t <t to arrive at st end if 8: Take action a to arrive at s = TMi(st, a) and receive RMi(st, a) 9: Query b V to obtain b V s Mi and store Qi,a = RMi(st, a) + b V s Mi 10: end for 11: end for 12: Store at argmaxa A{ 1

n Pn i=1 Qi,a } and deﬁne π(t) = at 13: end for 14: return π

t. At the beginning of any timestep t > 0, π has been constructed to play the action π(t ) = at at each timestep t < t. The algorithm then executes {π(t )}0 t <t to arrive at st. Crucially, due to the assumption of a shared state-action space and shared deterministic transitions, the state st is fully determined by π and does not depend on the particular Mi. Exploiting WIO, the method queries b V to estimate the value in Mi of each child state of st. Averaging this estimated value over {Mi}n i=1 yields an estimate of the expected value (over the randomness in D) of each action at st. Finally, the algorithm picks the action at with the highest estimated value, and updates π to play at at timestep t. This algorithm operates in the standard RL framework and falls under the purview of WQM.

The following result provides a performance guarantee for Algorithm 1. Recall that α, ξtr, ξr are parameters used to satisfy Strong Proximity and β is a parameter used to satisfy WIO.

Theorem 2 Let the query model be WQM. Consider any D satisfying WIO with β 0 and Strong Proximity with ξtr = 0 and any α, ξr 0, such that the MDPs supporting D are deterministic. Fix ϵ, δ > 0, and let π be the output of Algorithm 1 when run with n = H2

ϵ2 log 2H| A |

δ samples. Then with probability at least 1 δ, we are guaranteed that

EM D [V s0 M (π)] max π EM D [V s0 M (π )] ϵ 3 α H 3 β H.

Hence the total query cost under WQM required to achieve this guarantee is polynomial in q D, |A|, H, d.

We defer the proof to Appendix B. A few comments are in order. First, note that Theorem 2 directly provides a guarantee for the Average Performance setting. It also provides a guarantee for the Meta RL setting, since the π found by Algorithm 1 will on average perform well for Mtest, and the user can use π to warm start any ﬁnetuning or adaptation at test time. Second, the speciﬁed value of n depends only on quantities that are either known a priori or chosen by the user. This makes Algorithm 1 parameter free the user does not need to know the values of α, β, ξr, ξtr to run this method.

Third, note Theorem 2 holds under WIO. By contrast, Weak Proximity was insufﬁcient for efﬁcient generalization even when paired with SIO. This suggests that a condition that is both necessary and sufﬁcient for efﬁcient generalization lies somewhere between Weak and Strong Proximity assuming, of course, that we do not assume an individual optimization property that is even stronger than SIO. Indeed, SIO is already quite strong, since SIO says that a linear policy is sufﬁcient to optimize any individual MDP, but in practice one typically employs nonlinear neural network policies.

Finally, observe that ξr does not appear in the error bound. So ξr can be arbitrarily large, and Theorem 2 requires no explicit conditions on the reward functions of the MDPs supporting D, as in the sense of Condition 1. Instead, the implicit reward structure induced by the shared nearly optimal policy required by Strong Proximity is sufﬁcient. Comparing this observation with the result of Theorem 1 suggests that the classical explicit constraints on rewards and transitions is not appropriate for modern RL generalization settings. Instead, implicit constraints of the sort afforded by Strong Proximity offer a more ﬁne grained characterization of when efﬁcient generalization is possible.

Recall from Theorem 1 that a lower bound holds for Weak Proximity and SIO even with α = β = 0. However, Strong Proximity and WIO provide enough structure that the error bound of Theorem 2 can tolerate α, β 0. But these α, β terms in the error bound of Theorem 2 scale linearly with H. It is natural to question whether this scaling is due to a suboptimality of Algorithm 1 or looseness in our analysis. We provide a partial answer to this question in Appendix C, where we prove that the dependency on β given in the result of Theorem 2 is tight to within a logarithmic factor in H.

5 Discussion

In this paper, we studied the design of RL agents that generalize. We proved that efﬁcient generalization is worst case impossible, even under structural conditions like Weak Proximity and strong assumptions on the query model and tractability of individual MDPs. This result extends to the task of learning representations for the purpose of efﬁcient generalization. On the positive side, we provided Strong Proximity, which permits efﬁcient generalization, even under mild assumptions on the query model and individual tractability. Our analysis highlights that classical metrics for measuring similarity of MDPs are inappropriate for modern RL. It also suggests that a condition which is both necessary and sufﬁcient for efﬁcient generalization lies between Weak & Strong Proximity unless we make (arguably unreasonable) assumptions on the tractability of individual MDPs.

Negative Societal Impacts. Our work is theoretical, and we do not foresee any direct societal impacts, at least in the short term. In the long term, our work may increase the technological feasibility of developing agents that can be deployed in society. In this scenario, a bad actor may deploy harmful, malicious agents. This must be prevented by properly understanding the technology (which our work aims to do), and working with policy makers to prevent bad actors from accessing it.

Limitations of Our Work. The primary limitation of our work is that our upper bound has limited applicability. It holds only when the MDPs share a state-action space, and when the MDPs are determinstic, which is very restrictive in practice. Our rationale for working in this restricted setting was due to our lower bounds, which show that even this toy setting can be worst case inefﬁcient, and because it is necessary to understand the toy setting before looking at more complex scenarios. Nevertheless, our upper bound is several steps removed from the practice of RL. It is best interpreted as a preliminary sufﬁcient condition for when efﬁcient generalization is possible, albeit in a toy setting, and is far from conclusive on this matter.

Future Work. Note that our upper bound might apply if we are a priori given a feature mapping which maps similar states of different MDPs to the same state space. For example, in self driving, learning to drive in different countries might be difﬁcult because the images of trafﬁc signs are different. But if a known feature map extracts the underlying meaning of these signs, then our upper bound could conceivably apply. The key direction for future work, is how to learn such a feature mapping efﬁciently, while ensuring that it is still useful for generalization.

Acknowledgments and Disclosure of Funding

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745016. We further acknowledge the support of NSF via RI 2007517 and IIS-1909816. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation.

[1] P. Abbeel and A. Y. Ng. Exploration and apprenticeship learning in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2005. [2] A. Agarwal, S. Kakade, and L. F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Proceedings of the Conference on Learning Theory, 2020. [3] R. Agarwal, M. C. Machado, P. S. Castro, and M. G. Bellemare. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. In International Conference on Learning Representations, 2021. [4] M. G. Azar, R. Munos, and H. J. Kappen. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning, 2012. [5] M. Bertran, N. Martinez, M. Phielipp, and G. Sapiro. Instance-based generalization in reinforcement learning. In Advances in Neural Information Processing Systems, 2020. [6] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3:213 231, 2003. [7] E. Brunskill and L. Li. Sample complexity of multi-task reinforcement learning. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2013. [8] P. S. Castro. Scalable methods for computing state similarity in deterministic markov decision processes. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2020. [9] P. S. Castro and D. Precup. Using bisimulation for policy transfer in mdps. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2010. [10] I. Clavera, A. Nagabandi, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. In International Conference on Learning Representations, 2019. [11] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. Quantifying generalization in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2019. [12] K. Cobbe, C. Hesse, J. Hilton, and J. Schulman. Leveraging procedural generation to benchmark reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2020. [13] S. S. Du, Y. Luo, R. Wang, and H. Zhang. Provably efﬁcient q-learning with function approximation via distribution shift error checking oracle. In Advances in Neural Information Processing Systems, 2019. [14] S. S. Du, S. M. Kakade, R. Wang, and L. F. Yang. Is a good representation sufﬁcient for sample efﬁcient reinforcement learning? In International Conference on Learning Representations, 2020. [15] S. S. Du, J. D. Lee, G. Mahajan, and R. Wang. Agnostic q-learning with function approximation in deterministic systems: Near-optimal bounds on approximation error and sample complexity. In Advances in Neural Information Processing Systems, 2020. [16] A. Fallah, A. Mokhtari, and A. Ozdaglar. Provably convergent policy gradient methods for model-agnostic meta-reinforcement learning. ar Xiv preprint arxiv:2002.05135, 2020. [17] J. Farebrother, M. C. Machado, and M. Bowling. Generalization and regularization in DQN. ar Xiv preprint arxiv:1810.00123, 2018. [18] F. Feng, W. Yin, and L. F. Yang. Does knowledge transfer always help to learn a better policy? ar Xiv preprint arxiv:1912.02986, 2019. [19] N. Ferns, P. Panangaden, and D. Precup. Metrics for ﬁnite markov decision processes. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2004. [20] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, 2017. [21] S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of the IEEE International Conference on Robotics and Automation, 2017.

[22] T. Haarnoja, V. Pong, A. Zhou, M. Dalal, P. Abbeel, and S. Levine. Composable deep reinforcement learning for robotic manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation, 2018. [23] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2018. [24] N. Jiang. PAC reinforcement learning with an imperfect model. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2018. [25] S. Kakade, M. Kearns, and J. Langford. Exploration in metric state spaces. In Proceedings of the International Conference on Machine Learning, 2003. [26] M. Kearns and D. Koller. Efﬁcient reinforcement learning in factored mdps. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 1999. [27] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49:209 232, 2002. [28] M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 1999. [29] L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In Proceedings of the European Conference on Machine Learning, 2006. [30] C. L. Lan, M. G. Bellemare, and P. S. Castro. Metrics and continuity in reinforcement learning. ar Xiv preprint arxiv:2102.01514, 2021. [31] T. Lattimore, C. Szepesvari, and G. Weisz. Learning with good feature representations in bandits and in RL with a generative model. In Proceedings of the International Conference on Machine Learning, 2020. [32] A. Lazaric and M. Ghavamzadeh. Bayesian multi-task reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2010. [33] V. Mnih et al. Human-level control through deep reinforcement learning. Nature, 518:529 533, 2015. [34] A. Nichol, V. Pfau, C. Hesse, O. Klimov, and J. Schulman. Gotta learn fast: A new benchmark for generalization in RL. ar Xiv preprint arxiv:1804.03720, 2018. [35] C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song. Assessing generalization in deep reinforcement learning. ar Xiv preprint arxiv:1810.12282, 2018. [36] A. Rajeswaran, K. Lowrey, E. V. Todorov, and S. M. Kakade. Towards generalization and simplicity in continuous control. In Advances in Neural Information Processing Systems. 2017. [37] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efﬁcient off-policy meta-reinforcement learning via probabilistic context variables. In Proceedings of the International Conference on Machine Learning, 2019. [38] B. V. Roy and S. Dong. Comments on the du-kakade-wang-yang lower bounds. ar Xiv preprint arxiv:1911.07910, 2019. [39] A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, 2018. [40] D. Silver et al. Mastering the game of go without human knowledge. Nature, 550:354 359, 2017. [41] A. Sonar, V. Pacelli, and A. Majumdar. Invariant policy optimization: Towards stronger generalization in reinforcement learning. ar Xiv preprint arxiv:2006.01096, 2020. [42] X. Song, Y. Jiang, S. Tu, Y. Du, and B. Neyshabur. Observational overﬁtting in reinforcement learning. In International Conference on Learning Representations, 2020. [43] H. Wang, S. Zheng, C. Xiong, and R. Socher. On the generalization gap in reparameterizable reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2019. [44] L. Wang, Q. Cai, Z. Yang, and Z. Wang. On the global optimality of model-agnostic metalearning. In Proceedings of the International Conference on Machine Learning, 2020.

[45] R. Wang, S. S. Du, L. F. Yang, and R. Salakhutdinov. On reward-free reinforcement learning with linear function approximation. In Advances in Neural Information Processing Systems. 2020.

[46] G. Weisz, P. Amortila, B. Janzer, Y. Abbasi-Yadkori, N. Jiang, and C. Szepesvàri. On queryefﬁcient planning in mdps under linear realizability of the optimal state-value function. ar Xiv preprint arxiv:2102.02049, 2021.

[47] Z. Wen and B. Van Roy. Efﬁcient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, 2013.

[48] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Proceedings of the Conference on Robot Learning, 2019.

[49] A. Zhang, N. Ballas, and J. Pineau. A dissection of overﬁtting and generalization in continuous reinforcement learning. ar Xiv preprint arxiv:1806.07937, 2018.

[50] A. Zhang, R. T. Mc Allister, R. Calandra, Y. Gal, and S. Levine. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] . (b) Did you describe the limitations of your work? [Yes] See Section 5.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 5. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] .

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 4 for the complete statements of all the theoretical results, see the beginning of Section 3 for notation and preliminaries, and see Sections 3.2 & 3.3 for the Conditions and Properties that are fundamental to our results. (b) Did you include complete proofs of all theoretical results? [Yes] See Appendices A, B, C, & D.

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] No experiments were run. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] No experiments were run. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] No experiments were run. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] No experiments were run.

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] No such assets were used or created. (b) Did you mention the license of the assets? [N/A] No such assets were used or created.

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

No such assets were used or created. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] No such assets were used or created.

(e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] No such assets were used or created. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] We did not crowdsource or conduct research with human subjects. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] We did not crowdsource or conduct research with human subjects. (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] We did not crowdsource or conduct research with human subjects.