# heuristicguided_reinforcement_learning__ac396af3.pdf

Heuristic-Guided Reinforcement Learning

Ching-An Cheng Microsoft Research Redmond, WA chinganc@microsoft.com

Andrey Kolobov Microsoft Research Redmond, WA akolobov@microsoft.com

Adith Swaminathan Microsoft Research Redmond, WA adswamin@microsoft.com

We provide a framework for accelerating reinforcement learning (RL) algorithms by heuristics constructed from domain knowledge or ofﬂine data. Tabula rasa RL algorithms require environment interactions or computation that scales with the horizon of the sequential decision-making task. Using our framework, we show how heuristic-guided RL induces a much shorter-horizon subproblem that provably solves the original task. Our framework can be viewed as a horizon-based regularization for controlling bias and variance in RL under a ﬁnite interaction budget. On the theoretical side, we characterize properties of a good heuristic and its impact on RL acceleration. In particular, we introduce the novel concept of an improvable heuristic, a heuristic that allows an RL agent to extrapolate beyond its prior knowledge. On the empirical side, we instantiate our framework to accelerate several state-of-the-art algorithms in simulated robotic control tasks and procedurally generated games. Our framework complements the rich literature on warm-starting RL with expert demonstrations or exploratory datasets, and introduces a principled method for injecting prior knowledge into RL.

1 Introduction

Many recent empirical successes of reinforcement learning (RL) require solving problems with very long decision-making horizons. Open AI Five [1] used episodes that were 20000 timesteps on average, while Alpha Star [2] used roughly 5000 timesteps. Long-term credit assignment is a very challenging statistical problem, with the sample complexity growing quadratically (or worse) with the horizon [3]. Long horizons (or, equivalently, large discount factors) also increase RL s computational burden, leading to slow optimization convergence [4]. This makes RL algorithms require prohibitively large amounts of interactions and compute: even with tuned hyperparameters, Alpha Star needed over 108 samples and Open AI Five needed over 107 PFLOPS of compute.

A popular approach to mitigate the statistical and computational issues of tabula rasa RL methods is to warm-start or regularize learning with prior knowledge [1, 2, 5 10]. For instance, Alpha Star learned a policy and value function from human demonstrations and regularized the RL agent using imitation learning (IL). AWAC [9] warm-started a policy using batch policy optimization on exploratory datasets. While these approaches have been effective in different domains, none of them explicitly address RL s complexity dependence on horizon.

In this paper, we propose a complementary regularization technique that relies on heuristic value functions, or heuristics1 for short, to effectively shorten the problem horizon faced by an online RL agent for fast learning. We call this approach Heuristic-Guided Reinforcement Learning (Hu RL). The core idea is simple: given a Markov decision process (MDP) M = (S, A, P, r, γ) and a heuristic h : S R, we select a mixing coefﬁcient λ [0, 1] and have the agent solve a new MDP f M = (S, A, P, er, eγ) with a reshaped reward and a smaller discount (i.e. a shorter horizon): er(s, a) := r(s, a) + (1 λ)γEs P ( |s,a)[h(s )] and eγ := λγ. (1)

1We borrow this terminology from the planning literature to refer to guesses of V in an MDP [11].

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Hu RL effectively introduces horizon-based regularization that determines whether long-term value information should come from collected experiences or the heuristic. By modulating the effective horizon via λ, we trade off the bias and the complexity of solving the reshaped MDP. Hu RL with λ = 1 recovers the original problem and with λ = 0 creates an easier contextual bandit problem [12].

A heuristic h in Hu RL represents a prior guess of the desired long-term return of states, which ideally is the optimal value function V of the unknown MDP M. When the heuristic h captures the state ordering of V well, conceptually, it becomes possible to make good long-term decisions by short-horizon planning or even acting greedily. How do we construct a good heuristic? In the planning literature, this is typically achieved by solving a relaxation of the original problem [13 15]. Alternatively, one can learn it from batch data collected by exploratory behavioral policies (as in ofﬂine RL [16]) or from expert policies (as in IL [17]).2 For some dense reward problems, a zero heuristic can be effective in reducing RL complexity, as exploited by the guidance discount framework [18 23]. In this paper, we view heuristics as a uniﬁed representation of various forms of prior knowledge, such as expert demonstrations, exploratory datasets, and engineered guidance.

Although the use of heuristics to accelerate search has been popular in planning and control algorithms, e.g., A* [24], MCTS [25], and MPC [7, 26 28], its theory is less developed for settings where the MDP is unknown. The closest work in RL is potential-based reward shaping (PBRS) [29], which reshapes the reward into r(s, a) = r(s, a)+γEs |s,a[h(s )] h(s) while keeping the original discount. PBRS can use any heuristic to reshape the reward while preserving the ordering of policies. However, giving PBRS rewards to an RL algorithm does not necessarily lead to faster learning, because the base RL algorithm would still seek to explore to resolve long-term credit assignment. Hu RL allows common RL algorithms to leverage the short-horizon potential provided by a heuristic to learn faster.

In this work, we provide a theoretical foundation of Hu RL to enable adopting heuristics and horizon reduction for accelerating RL, combining advances from the PBRS and the guidance discount literatures. On the theoretical side, we derive a bias-variance decomposition of Hu RL s horizon-based regularization in order to characterize the solution quality as a function of λ and h. Using this insight, we provide sufﬁcient conditions for achieving an effective trade-off, including properties required of a base RL algorithm that solves the reshaped MDP f Mλ. Furthermore, we deﬁne the novel concept of an improvable heuristic and prove that good heuristics for Hu RL can be constructed from data using existing pessimistic ofﬂine RL algorithms (such as pessimistic value iteration [30, 31]).

The effectiveness of Hu RL depends on the heuristic quality, so we design Hu RL to employ a sequence of mixing coefﬁcients (i.e. λs) that increases as the agent gathers more data from the environment. Such a strategy induces a learning curriculum that enables Hu RL to remain robust to non-ideal heuristics. Hu RL starts off by guiding the agent s search direction with a heuristic. As the agent becomes more experienced, it gradually removes the guidance and lets the agent directly optimize the true long-term return. We empirically validate Hu RL in Mu Jo Co [32] robotics control problems and Procgen games [33] with various heuristics and base RL algorithms. The experimental results demonstrate the versatility and effectiveness of Hu RL in accelerating RL algorithms.

2 Preliminaries

2.1 Notation

We focus on discounted inﬁnite-horizon Markov Decision Processes (MDPs) for ease of exposition. The technique proposed here can be extended to other MDP settings.3 A discounted inﬁnite-horizon MDP is denoted as a 5-tuple M = (S, A, P, r, γ), where S is the state space, A is the action space, P(s |s, a) is the transition dynamics, r(s, a) is the reward function, and γ [0, 1) is the discount factor. Without loss of generality, we assume r : S A [0, 1]. We allow the state and action spaces S and A to be either discrete or continuous. Let ( ) denote the space of probability distributions. A decision-making policy π is a conditional distribution π : S (A), which can be deterministic. We deﬁne some shorthand for writing expectations: For a state distribution d (S) and a function V : S R, we deﬁne V (d) := Es d[V (s)]; similarly, for a policy π and a function Q : S A R, we deﬁne Q(s, π) := Ea π( |s)[Q(s, a)]. Lastly, we deﬁne Es |s,a := Es P ( |s,a).

2We consider the RL setting for imitation where we suppose the rewards of expert trajectories are available. 3The results here can be readily applied to ﬁnite-horizon MDPs; for other inﬁnite-horizon MDPs, we need further, e.g., mixing assumptions for limits to exist.

Central to solving MDPs are the concepts of value functions and average distributions. For a policy π, we deﬁne its state value function V π as V π(s) := Eρπ s [P t=0 γtr(st, at)] , where ρπ s denotes the trajectory distribution of s0, a0, s1, . . . induced by running π starting from s0 = s. We deﬁne the state-action value function (or the Q-function) as Qπ(s, a) := r(s, a) + γEs |s,a[V π(s )]. We denote the optimal policy as π and its state value function as V := V π . Under the assumption that rewards are in [0, 1], we have V π(s), Qπ(s, a) [0, 1 1 γ ] for all π, s S, and a A. We denote the initial state distribution of interest as d0 (S) and the state distribution of policy π at time t as dπ t , with dπ 0 = d0. Given d0, we deﬁne the average state distribution of a policy π as dπ := (1 γ) P t=0 γtdπ t . With a slight abuse of notation, we also write dπ(s, a) := dπ(s)π(a|s).

2.2 Setup: Reinforcement Learning with Heuristics

We consider RL with prior knowledge expressed in the form of a heuristic value function. The goal is to ﬁnd a policy π that has high return through interactions with an unknown MDP M, i.e., maxπ V π(d0). While the agent here does not fully know M, we suppose that, before interactions start the agent is provided with a heuristic h : S R which the agent can query throughout learning.

The heuristic h represents a prior guess of the optimal value function V of M. Common sources of heuristics are domain knowledge as typically employed in planning, and logged data collected by exploratory or by expert behavioral policies. In the latter, a heuristic guess of V can be computed from the data by ofﬂine RL algorithms. For instance, when we have trajectories of an expert behavioral policy, Monte-Carlo regression estimate of the observed returns may be a good guess of V .

Using heuristics to solve MDP problems has been popular in planning and control, but its usage is rather limited in RL. The closest provable technique in RL is PBRS [29], where the reward is modiﬁed into r(s, a) := r(s, a) + γEs |s,a[h(s )] h(s). It can be shown that this transformation does not introduce bias into the policy ordering, and therefore solving the new MDP M := (S, A, P, r, γ) would yield the same optimal policy π of M.

Conceptually when the heuristic is the optimal value function h = V , the agent should be able to ﬁnd the optimal policy π of M by acting myopically, as V already contains all necessary long-term information for good decision making. However, running an RL algorithm with the PBRS reward (i.e. solving M := (S, A, P, r, γ)) does not take advantage of this shortcut. To make learning efﬁcient, we need to also let the base RL algorithm know that acting greedily (i.e., using a smaller discount) with the shaped reward can yield good policies. An intuitive idea is to run the RL algorithm to maximize V π λ(d0), where V π λ denotes the value function of π in an MDP Mλ := (S, A, P, r, λγ) for some λ [0, 1]. However this does not always work. For example, when λ = 0, maxπ V π λ(d0) only optimizes for the initial states d0, but obviously the agent is going to encounter other states in M. We next propose a provably correct version, Hu RL, to leverage this short-horizon insight.

3 Heuristic-Guided Reinforcement Learning

We propose a general framework, Hu RL, for leveraging heuristics to accelerate RL. In contrast to tabula rasa RL algorithms that attempt to directly solve the long-horizon MDP M, Hu RL uses a heuristic to guide the agent in solving a sequence of short-horizon MDPs so as to amortize the complexity of long-term credit assignment. In effect, Hu RL creates a heuristic-based learning curriculum to help the agent learn faster.

3.1 Algorithm

Hu RL takes a reduction-based approach to realize the idea of heuristic guidance. As summarized in Algorithm 1, Hu RL takes a heuristic h : S R and a base RL algorithm L as input, and outputs an approximately optimal policy for the original MDP M. During training, Hu RL iteratively runs the base algorithm L to collect data from the MDP M and then uses the heuristic h to modify the agent s collected experiences. Namely, in iteration n, the agent interacts with the original MDP M and saves the raw transition tuples4 Dn = {(s, a, r, s )} (line 2). Hu RL then deﬁnes a reshaped MDP f Mn := (S, A, P, ern, eγn) (line 3) by changing the rewards and lowering the discount factor:

ern(s, a) := r(s, a) + (1 λn)γEs |s,a[h(s )] and eγn := λnγ, (2)

4If L learns only with trajectories, we transform each tuple and assemble them to get the modiﬁed trajectory.

Algorithm 1 Heuristic-Guided Reinforcement Learning (Hu RL)

Require: MDP M = (S, A, P, r, γ), RL algorithm L, heuristic h, mixing coefﬁcients {λn}.

1: for n = 1, . . . , N do 2: Dn L.Collect Data(M) 3: Get λn from {λn} and construct f Mn = (S, A, P, ern, eγn) according to (2) using h and λn 4: πn L.Train(Dn, f Mn) 5: end for 6: return πN

where λn [0, 1] is the mixing coefﬁcient. The new discount eγn effectively gives f Mn a shorter horizon than M s, while the heuristic h is blended into the new reward in (2) to account for the missing long-term information. We call eγn = λnγ in (2) the guidance discount to be consistent with prior literature [20], which can be viewed in terms of our framework as using a zero heuristic. In the last step (line 4), Hu RL calls the base algorithm L to perform updates with respect to the reshaped MDP f Mn. This is realized by 1) setting the discount factor used in L to eγn, and 2) setting the sampled reward to r + (γ eγn)h(s ) for every transition tuple (s, a, r, s ) collected from M. We remark that the base algorithm L in line 2 always collects trajectories of lengths proportional to the original discount γ, while internally the optimization is done with a lower discount eγn in line 4.

Over the course of training, Hu RL repeats the above steps with a sequence of increasing mixing coefﬁcients {λn}. From (2) we see that as the agent interacts with the environment, the effects of the heuristic in MDP reshaping decrease and the effective horizon of the reshaped MDP increases.

3.2 Hu RL as Horizon-based Regularization

We can think of Hu RL as introducing a horizon-based regularization for RL, where the regularization center is deﬁned by the heuristic and its strength diminishes as the mixing coefﬁcient increases. As the agent collects more experiences, Hu RL gradually removes the effects of regularization and the agent eventually optimizes for the original MDP.

Hu RL s regularization is designed to reduce learning variance, similar to the role of regularization in supervised learning. Unlike the typical weight decay imposed on function approximators (such as the agent s policy or value networks), our proposed regularization leverages the structure of MDPs to regulate the complexity of the MDP the agent faces, which scales with the MDP s discount factor (or, equivalently, the horizon). When the guidance discount eγn is lower than the original discount γ (i.e. λn < 1), the reshaped MDP f Mn given by (2) has a shorter horizon and requires fewer samples to solve. However, the reduced complexity comes at the cost of bias, because the agent is now incentivized toward maximizing the performance with respect to the heuristic rather than the original long-term returns of M. In the extreme case of λn = 0, Hu RL would solve a zero-horizon contextual bandit problem with contexts (i.e. states) sampled from dπ of M.

3.3 A Toy Example

We illustrate this idea in a chain MDP environment in Fig. 1. The optimal policy π for this MDP s original γ = 0.9 always picks action , as shown in Fig. 1b-(1), giving the optimal value V in Fig. 1a-(2). Suppose we used a smaller guidance discount eγ = 0.5γ to accelerate learning. This is equivalent to Hu RL with a zero heuristic h = 0 and λ = 0.5. Solving this reshaped MDP yields a policy eπ that acts very myopically in the original MDP, as shown in Fig. 1b-(2); the value function of eπ in the original MDP is visualized in Fig. 1a-(4).

Now, suppose we use Fig. 1a-(4) as a heuristic in Hu RL instead of h = 0. This is a bad choice of heuristic (Bad h) as it introduces a large bias with respect to V (cf. Fig. 1a-(2)). On the other hand, we can roll out a random policy in the original MDP and use its value function as the heuristic (Good h), shown in Fig. 1a-(3). Though the random policy has an even lower return at the initial state s = 3, it gives a better heuristic because this heuristic shares the same trend as V in Fig. 1a-(1). Hu RL run with Good h and Bad h yields policies in Fig. 1b-(3,4), and the quality of the resulting solutions in the original MDP, V eπ λ (d0), is reported in Fig. 1c for different λ. Observe that Hu RL with a good heuristic can achieve V (d0) with a much smaller horizon λ 0.5. Using a bad h does not lead to π at all when λ = 0.5 (Fig. 1b-(4)) but is guaranteed to do so when λ converges to 1. (Fig. 1b-(5)).

(a) Heatmap of different values.

(b) Different policy behaviors.

(c) Hu RL with different h and λ.

Figure 1: Example of Hu RL in a chain MDP. Each cell in a row in each diagram represents a state from S = {1, . . . , 10}. The agent starts at state 3 (s0), and states 1 and 10 are absorbing (Abs in subﬁgure a-(1)). Actions A = { , } move the agent left or right in the chain unless the agent is in an absorbing state. Subﬁg. a-(1) shows the reward function: r(2, ) = 0.1, r(4, ) = 0.2, r(5, ) = 0.1, and all state-action pairs not shown in a-(1) yield r = 0. Subﬁg. a-(2) shows V for γ = 0.9. Subﬁg. a-(3) shows a good heuristic h V (random π). Subﬁg. a-(4) shows a bad heuristic h V (myopic π). Subﬁg. b-(1): π for V from a-(2). Subﬁg. b-(2): π from Hu RL with h = 0, λ = 0.5. Subﬁg. b-(3): π from Hu RL with the good h from (a).(3) and λ = 0.5. Subﬁg. b-(4): π from the bad h from a-(4), λ = 0.5. Subﬁg. b-(5): π from the bad h and λ = 1. Subﬁg. (c) illustrates the takeaway message: using Hu RL with a good h can ﬁnd π from s0 even with a small λ (see the x-axis), while Hu RL with a bad h requires a much higher λ to discover π .

4 Theoretical Analysis

When can Hu RL accelerate learning? Similar to typical regularization techniques, the horizon-based regularization of Hu RL leads to a bias-variance decomposition that can be optimized for better ﬁnite-sample performance compared to directly solving the original MDP. However, a non-trivial trade-off is possible only when the regularization can bias the learning toward a good direction. In Hu RL s case this is determined by the heuristic, which resembles a prior we encode into learning.

In this section we provide Hu RL s theoretical foundation. We ﬁrst describe the bias-variance trade-off induced by Hu RL. Then we show how suboptimality in solving the reshaped MDP translates into performance in the original MDP, and identify the assumptions Hu RL needs the base RL algorithm to satisfy. In addition, we explain how Hu RL relates to PBRS, and characterize the quality of heuristics and sufﬁcient conditions for constructing good heuristics from batch data using ofﬂine RL.

For clarity, we will focus on the reshaped MDP f M = (S, A, P, er, eγ) for a ﬁxed λ [0, 1], where er, eγ are deﬁned in (1). We can view this MDP as the one in a single iteration of Hu RL. For a policy π, we denote its state value function in f M as e V π, and the optimal policy and value function of f M as eπ

and e V , respectively. The missing proofs of the results from this section can be found in Appendix A.

4.1 Short-Horizon Reduction: Performance Decomposition

Our main result is a performance decomposition, which characterizes how a heuristic h and suboptimality in solving the reshaped MDP f M relate to performance in the original MDP M.

Theorem 4.1. For any policy π, heuristic f : S R, and mixing coefﬁcient λ [0, 1],

V (d0) V π(d0) = Regret(h, λ, π) + Bias(h, λ, π)

where we deﬁne

Regret(h, λ, π) := λ e V (d0) e V π(d0) + 1 λ

e V (dπ) e V π(dπ) (3)

Bias(h, λ, π) := V (d0) e V (d0) + γ(1 λ)

1 γ Es,a dπEs |s,a h h(s ) e V (s ) i (4)

Furthermore, b R, Bias(h, λ, π) = Bias(h+b, λ, π) and Regret(h, λ, π) = Regret(h+b, λ, π).

The theorem shows that suboptimality of a policy π in the original MDP M can be decomposed into 1) a bias term due to solving a reshaped MDP f M instead of the original MDP M, and 2) a regret term (i.e. the learning variance) due to π being suboptimal in the reshaped MDP f M. Moreover, it shows that heuristics are equivalent up to constant offsets. In other words, only the relative ordering between states that a heuristic induces matters in learning, not the absolute values.

Balancing the two terms trades off bias and variance in learning. Using a smaller λ replaces the long-term information with the heuristic and make the horizon of the reshaped MDP f M shorter. Therefore, given a ﬁnite interaction budget, the regret term in (3) can be more easily minimized, though the bias term in (4) can potentially be large if the heuristic is bad. On the contrary, with λ = 1, the bias is completely removed, as the agent solves the original MDP M directly.

4.2 Regret, Algorithm Requirement, and Relationship with PBRS

The regret term in (3) characterizes the performance gap due to π being suboptimal in the reshaped MDP f M, because Regret(h, λ, eπ ) = 0 for any h and λ. For learning, we need the base RL algorithm L to ﬁnd a policy π such that the regret term in (3) is small. By the deﬁnition in (3), the base RL algorithm L is required not only to ﬁnd a policy π such that e V (s) e V π(s) is small for states from d0, but also for states π visits when rolling out in the original MDP M. In other words, it is insufﬁcient for the base RL algorithm to only optimize for e V π(d0) (the performance in the reshaped MDP with respect to the initial state distribution; see Section 2.2). For example, suppose λ = 0 and d0 concentrates on a single state s0. Then maximizing e V π(d0) alone would only optimize π( |s0) and the policy π need not know how to act in other parts of the state space.

To use Hu RL, we need the base algorithm to learn a policy π that has small action gaps in the reshaped MDP f M but along trajectories in the original MDP M, as we show below. This property is satisﬁed by off-policy RL algorithms such as Q-learning [34]. Proposition 4.1. For any policy π, heuristic f : S R and mixing coefﬁcient λ [0, 1],

Regret(h, λ, π) = Eρπ(d0) h P t=0 γt e A (st, at) i

where ρπ(d0) denotes the trajectory distribution of running π from d0, and e A (s, a) = er(s, a) + eγEs |s,a[e V (s )] e V (s) 0 is the action gap with respect to the optimal policy eπ of f M.

Another way to comprehend the regret term is through studying its dependency on λ. When λ = 1, Regret(h, 0, π) = V (d0) V π(d0), which is identical to the policy regret in M for a ﬁxed initial distribution d0. On the other hand, when λ = 0, Regret(h, 0, π) = maxπ 1 1 γ Es dπ[er(s, π ) er(s, π)], which is the regret of a non-stationary contextual bandit problem where the context distribution is dπ (the average state distribution of π). In general, for λ (0, 1), the regret notion mixes a short-horizon non-stationary problem and a long-horizon stationary problem.

One natural question is whether the reshaped MDP f M has a more complicated and larger value landscape than the original MDP M, because these characteristics may affect the regret rate of a base algorithm. We show that f M preserves the value bounds and linearity of the original MDP. Proposition 4.2. Reshaping the MDP as in (1) preserves the following characteristics: 1) If h(s) [0, 1 1 γ ], then e V π(s) [0, 1 1 γ ] for all π and s S. 2) If f M is a linear MDP with feature vector φ(s, a) (i.e. r(s, a) and Es |s,a[g(s )] for any g can be linearly parametrized in φ(s, a)), then f M is also a linear MDP with feature vector φ(s, a).

On the contrary, the MDP Mλ := (S, A, P, r, λγ) in Section 2.2 does not have these properties. We can show that Mλ is equivalent to f M up to a PBRS transformation (i.e., r(s, a) = r(s, a) + γEs |s,a[h(s )] h(s)). Thus, Hu RL incorporates guidance discount into PBRS with nicer properties.

4.3 Bias and Heuristic Quality

The bias term in (4) characterizes suboptimality due to using a heuristic h in place of long-term state values in M. What is the best heuristic in this case? From the deﬁnition of the bias term in (4), we see that the ideal heuristic is the optimal value V , as Bias(V , λ, π) = 0 for all λ [0, 1]. By continuity, we can expect that if h deviates from V a little, then the bias is small.

Corollary 4.1. If infb R h + b V ϵ, then Bias(h, λ, π) (1 λγ)2

To better understand how the heuristic h affects the bias, we derive an upper bound on the bias by replacing the ﬁrst term in (4) with an upper bound that depends only on π .

Proposition 4.3. For g : S R and η [0, 1], deﬁne C(π, g, η) := Eρπ(d0) P t=1 ηt 1g(st) . Then Bias(h, λ, π) (1 λ)γ(C(π , V h, λγ) + C(π, h e V , γ)).

In Proposition 4.3, the term (1 λ)γC(π , V h, λγ) is the underestimation error of the heuristic h under the states visited by the optimal policy π in the original MDP M. Therefore, to minimize the ﬁrst term in the bias, we would want the heuristic h to be large along the paths that π generates.

However, Proposition 4.3 also discourages the heuristic from being arbitrarily large, because the second term in the bias in (4) (or, equivalently, the second term in Proposition 4.3) incentivizes the heuristic to underestimate the optimal value of the reshaped MDP e V . More precisely, the second term requires the heuristic to obey some form of spatial consistency. A quick intuition is the observation that if h(s) = V π (s) for some π or h(s) = 0, then h(s) e V (s) for all s S. More generally, we show that if the heuristic is improvable with respect to the original MDP M (i.e. the heuristic value is lower than that of the max of Bellman backup), then h(s) e V (s). By Proposition 4.3, learning with an improvable heuristic in Hu RL has a much smaller bias. Deﬁnition 4.1. Deﬁne the Bellman operator (Bh)(s, a) := r(s, a) + γEs |s,a[h(s )]. A heuristic function h : S R is said to be improvable with respect to an MDP M if maxa(Bh)(s, a) h(s).

Proposition 4.4. If h is improvable with respect to M, then e V (s) h(s), for all λ [0, 1].

4.4 Pessimistic Heuristics are Good Heuristics

While Corollary 4.1 shows that Hu RL can handle an imperfect heuristic, this result is not ideal. The corollary depends on the ℓ approximation error, which can be difﬁcult to control in large state spaces. Here we provide a more reﬁned sufﬁcient condition of good heuristics. We show that the concept of pessimism in the face of uncertainty provides a ﬁner mechanism for controlling the approximation error of a heuristic and would allow us to remove the ℓ -type error. This result is useful for constructing heuristics from data that does not have sufﬁcient support.

From Proposition 4.3 we see that the source of the ℓ error is the second term in the bias upper bound, as it depends on the states that the agent s policy visits which can change during learning. To remove this dependency, we can use improvable heuristics (see Proposition 4.4), as they satisfy h(s) e V (s). Below we show that Bellman-consistent pessimism yields improvable heuristics. Proposition 4.5. Suppose h(s) = Q(s, π ) for some policy π and function Q : S A R such that Q(s, a) (Bh)(s, a), s S, a A. Then h is improvable and f(s) V π (s) for all s S.

The Bellman-consistent pessimism in Proposition 4.5 essentially says that h is pessimistic with respect to the Bellman backup. This condition has been used as the foundation for designing pessimistic off-policy RL algorithms, such as pessimistic value iteration [30] and algorithms based on pessimistic absorbing MDPs [31]. In other words, these pessimistic algorithms can be used to construct good heuristics with small bias in Proposition 4.3 from ofﬂine data. With such a heuristic, the bias upper bound would be simply Bias(h, λ, π) (1 λ)γC(π , V h, λγ). Therefore, as long as enough batch data are sampled from a distribution that covers states that π visits, these pessimistic algorithms can construct good heuristics with nearly zero bias for Hu RL with high probability.

5 Experiments

We validate our framework Hu RL experimentally in Mu Jo Co (commercial license) [32] robotics control problems and Procgen games (MIT License) [33], where soft actor critic (SAC) [35] and proximal policy optimization (PPO) [36] were used as the base RL algorithms, respectively5. The goal is to study whether Hu RL can accelerate learning by shortening the horizon with heuristics. In particular, we conduct studies to investigate the effects of different heuristics and mixing coefﬁcients. Since the main focus here is on the possibility of leveraging a given heuristic to accelerate RL algorithms, in these experiments we used vanilla techniques to construct heuristics for Hu RL. Experimentally studying the design of heuristics for a domain or a batch of data is beyond the scope of the current paper but are important future research directions. For space limitation, here we report only the results of the Mu Jo Co experiments. The results on Procgen games along with other experimental details can also be found in Appendix C.

5Code to replicate all experiments is available at https://github.com/microsoft/Hu RL.

We consider four Mu Jo Co environments with dense rewards (Hopper-v2, Half Cheetah-v2, Humanoidv2, and Swimmer-v2) and a sparse reward version of Reacher-v2 (denoted as Sparse-Reacher-v2)6. We design the experiments to simulate two learning scenarios. First, we use Sparse-Reacher-v2 to simulate the setting where an engineered heuristic based on domain knowledge is available; since this is a goal reaching task, we designed a heuristic h(s) = r(s, a) 100 e(s) g(s) , where e(s) and g(s) denote the robot s end-effector position and the goal position, respectively. Second, we use the dense reward environments to model scenarios where a batch of data collected by multiple behavioral policies is available before learning, and a heuristic is constructed by an ofﬂine policy evaluation algorithm from the batch data (see Appendix C.1 for details). In brief, we generated these behavioral policies by running SAC from scratch and saved the intermediate policies generated in training. We then use least-squares regression to ﬁt a neural network to predict empirical Monte-Carlo returns of the trajectories in the sampled batch of data. We also use behavior cloning (BC) to warm-start all RL agents based on the same batch dataset in the dense reward experiments.

The base RL algorithm here, SAC, is based on the standard implementation in Garage (MIT License) [37]. The policy and value networks are fully connected independent neural networks. The policy is Tanh-Gaussian and the value network has a linear head.

Algorithms. We compare the performance of different algorithms below. 1) BC 2) SAC 3) SAC with BC warm start (SAC w/ BC) 4) Hu RL with the engineered heuristic (Hu RL) 5) Hu RL with a zero heuristic and BC warm start (Hu RL-zero) 6) Hu RL with the Monte-Carlo heuristic and BC warm start (Hu RL-MC) 7) SAC with PBRS reward (and BC warm start, if applicable) (PBRS). For the Hu RL algorithms, the mixing coefﬁcient was scheduled as λn = λ0 + (1 λ0)cω tanh(ω(n 1)), for n = 1, . . . , N, where λ0 [0, 1], ω > 0 controls the increasing rate, and cω is a normalization constant such that λ = 1 and λn [0, 1]. We chose these algorithms to study the effect of each additional warm-start component (BC and heuristics) added on top of vanilla SAC. Hu RL-zero is SAC w/ BC but with an extra λ schedule above that further lowers the discount, whereas SAC and SAC w/ BC keep a constant discount factor.

Evaluation and Hyperparameters. In each iteration, the RL agent has a ﬁxed sample budget for environment interactions, and its performance is measured in terms of undiscounted cumulative returns of the deterministic mean policy extracted from SAC. The hyperparameters used in the algorithms above were selected as follows. First, the learning rates and the discount factor of the base RL algorithm, SAC, were tuned for each environment. The tuned discount factor was used as the discount factor γ of the original MDP M. Fixing the hyperparameters above, we additionally tune λ0 and ω for the λ schedule of Hu RL for each environment and each heuristic. Finally, after all these hyperparameters were ﬁxed, we conducted additional testing runs with 30 different random seeds and report their statistics here. Sources of randomness included the data collection process of the behavioral policies, training the heuristics from batch data, BC, and online RL. However, the behavioral policies were ﬁxed across all testing runs. We chose this hyperparameter tuning procedure to make sure that the baselines (i.e. SAC) compared in these experiments are their best versions.

5.2 Results Summary

Fig. 2 shows the results on the Mu Jo Co environments. Overall, we see that Hu RL is able to leverage engineered and learned heuristics to signiﬁcantly improve the learning efﬁciency. This trend is consistent across all environments that we tested on.

For the sparse-reward experiments, we see that SAC and PBRS struggle to learn, while Hu RL is able to converge to the optimal performance much faster. For the dense reward experiments, similarly Hu RL-MC converges much faster, though the gain in Half Cheetah-v2 is minor and it might have converged to a worse local maximum in Swimmer-v2. In addition, we see that warm-starting SAC using BC (i.e. SAC w/ BC) can improve the learning efﬁciency compared with the vanilla SAC, but using BC alone does not result in a good policy. Lastly, we see that using the zero heuristic (Hu RL-zero) with extra λ-scheduling does not further improve the performance of SAC w/ BC. This comparison veriﬁes that the learned Monte-Carlo heuristic provides non-trivial information.

Interestingly, we see that applying PBRS to SAC leads to even worse performance than running SAC with the original reward. There are two reasons why SAC+PBRS is less desirable than SAC+Hu RL

6The reward is zero at the goal and 1 otherwise.

(a) Sparse-Reacher-v2

(b) Humanoid-v2

(c) Hopper-v2

(d) Swimmer-v2

(e) Half Cheetah-v2

(f) λ0 ablation.

Figure 2: Experimental results. (a) uses an engineered heuristic for a sparse reward problem; (b)-(e) use heuristics learned from ofﬂine data and share the same legend; (e) shows ablation results of different initial λ0 in Hopper-v2. The plots show the 25th, 50th, 75th percentiles of algorithm performance over 30 random seeds.

as we discussed before: 1) PBRS changes the reward/value scales in the induced MDP, and popular RL algorithms like SAC are very sensitive to such changes. In contrast, Hu RL induces values on the same scale as we show in Proposition 4.2. 2) In Hu RL, we are effectively providing the algorithm some more side-information to let SAC shorten the horizon when the heuristic is good.

The results in Fig. 2 also have another notable aspect. Because the datasets used in the dense reward experiments contain trajectories collected by a range of policies, it is likely that BC suffers from disagreement in action selection among different policies. Nonetheless, training a heuristic using a basic Monte-Carlo regression seems to be less sensitive to these conﬂicts and still results in a helpful heuristic for Hu RL. One explanation can be that heuristics are only functions of states, not of states and actions, and therefore the conﬂicts are minor. Another plausible explanation is that Hu RL only uses the heuristic to guide learning, and does not completely rely on it to make decisions Thus, Hu RL can be more robust to the heuristic quality, or, equivalently, to the quality of prior knowledge.

5.3 Ablation: Effects of Horizon Shortening

To further verify that the acceleration in Fig. 2 is indeed due to horizon shortening, we conducted an ablation study for Hu RL-MC on Hopper-v2, whose results are presented in Fig. 2f. Hu RLMC s best λ-schedule hyperparameters on Hopper-v2, which are reﬂected in its performance in the aforementioned Fig. 2c, induced a near-constant schedule at λ = 0.95; to obtain the curves in Fig. 2f, we ran Hu RL-MC with constant-λ schedules for several more λ values. Fig. 2f shows that increasing λ above 0.98 leads to a performance drop. Since using a large λ decreases bias and makes the reshaped MDP more similar to the original MDP, we conclude that the increased learning speed on Hopper-v2 is due to Hu RL s horizon shortening (coupled with the guidance provided by its heuristic).

6 Related Work

Discount regularization. The horizon-truncation idea can be traced back to Blackwell optimality in the known MDP setting [18]. Reducing the discount factor amounts to running Hu RL with a zero heuristic. Petrik and Scherrer [19], Jiang et al. [20, 21] study the MDP setting; Chen et al. [22] study POMDPs. Amit et al. [23] focus on discount regularization for Temporal Difference (TD) methods, while Van Seijen et al. [6] use a logarithmic mapping to lower the discount for online RL.

Reward shaping. Reward shaping has a long history in RL, from the seminal PBRS work [29] to recent bilevel-optimization approaches [38]. Tessler and Mannor [5] consider a complementary problem to Hu RL: given a discount γ , they ﬁnd a reward r that preserves trajectory ordering in the original MDP. Meanwhile there is a vast literature on bias-variance trade-off for online RL with

horizon truncation. TD(λ) [39, 40] and Generalized Advantage Estimates [41] blend value estimates across discount factors, while Sherstan et al. [42] use the discount factor as an input to the value function estimator. TD( ) [43] computes differences between value functions across discount factors.

Heuristics in model-based methods. Classic uses of heuristics include A* [24], Monte-Carlo Tree Search (MCTS) [25], and Model Predictive Control (MPC) [44]. Zhong et al. [26] shorten the horizon in MPC using a value function approximator. Hoeller et al. [27] additionally use an estimate for the running cost to trade off solution quality and amount of computation. Bejjani et al. [28] show heuristic-accelerated truncated-horizon MPC on actual robots and tune the value function throughout learning. Bhardwaj et al. [7] augment MPC with a terminal value heuristic, which can be viewed as an instance of Hu RL where the base algorithm is MPC. Asai and Muise [45] learn an MDP expressible in the STRIPS formalism that can beneﬁt from relaxation-based planning heuristics. But Hu RL is more general, as it does not assume model knowledge and can work in unknown environments.

Pessimistic extrapolation. Ofﬂine RL techniques employ pessimistic extrapolation for robustness [30], and their learned value functions can be used as heuristics in Hu RL. Kumar et al. [46] penalize out-of-distribution actions in off-policy optimization while Liu et al. [31] additionally use a variational auto-encoder (VAE) to detect out-of-distribution states. We experimented with VAEﬁltered pessimistic heuristics in Appendix C. Even pessimistic ofﬂine evaluation techniques [16] can be useful in Hu RL, since function approximation often induces extrapolation errors [47].

Heuristic pessimism vs. admissibility. Our concept of heuristic pessimism can be easily confused for the well-established notion of admissibility [48], but in fact they are opposites. Namely, an admissible heuristic never underestimates V (in the return-maximization setting), while a pessimistic one never overestimates V . Similarly, our notion of improvability is distinct from consistency: they express related ideas, but with regards to pessimistic and admissible value functions, respectively. Thus, counter-intuitively from the planning perspective, our work shows that for policy learning, inadmissible heuristics are desirable. Pearl [49] is one of the few works that has analyzed desirable implications of heuristic inadmissibility in planning.

Other warm-starting techniques. Hu RL is a new way to warm-start online RL methods. Bianchi et al. [50] use a heuristic policy to initialize agents policies. Vinyals et al. [2], Hester et al. [10] train a value function and policy using batch IL and then used them as regularization in online RL. Nair et al. [9] use off-policy RL on batch data and ﬁne-tune the learned policy. Recent approaches of hybrid IL-RL have strong connections to Hu RL [17, 51, 52]. In particular, Cheng et al. [17] is a special case of Hu RL with a max-aggregation heuristic. Farahmand et al. [8] use several related tasks to learn a task-dependent heuristic and perform shorter-horizon planning or RL. Knowledge distillation approaches [53] can also be used to warm-start learning, but in contrast to them, Hu RL expects prior knowledge in the form of state value estimates, not features, and doesn t attempt to make the agent internalize this knowledge. A Hu RL agent learns from its own environment interactions, using prior knowledge only as guidance. Reverse Curriculum approaches [54] create short horizon RL problems by initializing the agent close to the goal, and moving it further away as the agent improves. This gradual increase in the horizon inspires the Hu RL approach. However, Hu RL does not require the agent to be initialized on expert states and can work with many different base RL algorithms.

7 Discussion and Limitations

This work is an early step towards theoretically understanding the role and potential of heuristics in guiding RL algorithms. We propose a framework, Hu RL, that can accelerate RL when an informative heuristic is provided. Hu RL induces a horizon-based regularization of RL, complementary to existing warm-starting schemes, and we provide theoretical and empirical analyses to support its effectiveness. While this is a conceptual work without foreseeable societal impacts yet, we hope that it will help counter some of AI s risks by making RL more predictable via incorporating prior into learning.

We remark nonetheless that the effectiveness of Hu RL depends on the available heuristic. While Hu RL can eventually solve the original RL problem even with a non-ideal heuristic, using a bad heuristic can slow down learning. Therefore, an important future research direction is to adaptively tune the mixing coefﬁcient based on the heuristic quality with curriculum or meta-learning techniques. In addition, while our theoretical analysis points out a strong connection between good heuristics for Hu RL and pessimistic ofﬂine RL, techniques for the latter are not yet scalable and robust enough for high-dimensional problems. Further research on ofﬂine RL can unlock the full potential of Hu RL.

[1] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. ar Xiv preprint ar Xiv:1912.06680, 2019.

[2] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019.

[3] Christoph Dann and Emma Brunskill. Sample complexity of episodic ﬁxed-horizon reinforcement learning. In NIPS, 2015.

[4] Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Neur IPS, 2018.

[5] Chen Tessler and Shie Mannor. Maximizing the total reward via reward tweaking. ar Xiv preprint ar Xiv:2002.03327, 2020.

[6] Harm Van Seijen, Mehdi Fatemi, and Arash Tavakoli. Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Neur IPS, 2019.

[7] Mohak Bhardwaj, Sanjiban Choudhury, and Byron Boots. Blending mpc & value function approximation for efﬁcient reinforcement learning. In ICLR, 2021.

[8] Amir-massoud Farahmand, Daniel N Nikovski, Yuji Igarashi, and Hiroki Konaka. Truncated approximate dynamic programming with task-dependent terminal value. In AAAI, 2016.

[9] Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with ofﬂine datasets. ar Xiv preprint ar Xiv:2006.09359, 2020.

[10] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al. Deep Q-learning from demonstrations. In AAAI, 2018.

[11] Mausam and Andrey Kolobov. Planning with Markov decision processes: An AI perspective. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 6:1 210, 2012.

[12] Dylan Foster and Alexander Rakhlin. Beyond UCB: Optimal and efﬁcient contextual bandits with regression oracles. In ICML, 2020.

[13] J. Hoffmann and B. Nebel. The FF planning system: Fast plan generation through heuristic search. Journal of Artiﬁcial Intelligence Research, 14:253 302, 2001.

[14] S. Richter and M. Westphal. The LAMA planner: Guiding cost-based anytime planning with landmarks. Journal of Artiﬁcial Intelligence Research, 39:127 177, 2010.

[15] Andrey Kolobov, Mausam, and Daniel S. Weld. Classical planning in MDP heuristics: With a little help from generalization. In AAAI, 2010.

[16] Caglar Gulcehre, Sergio Gómez Colmenarejo, Ziyu Wang, Jakub Sygnowski, Thomas Paine, Konrad Zolna, Yutian Chen, Matthew Hoffman, Razvan Pascanu, and Nando de Freitas. Regularized behavior value estimation. ar Xiv preprint ar Xiv:2103.09575, 2021.

[17] Ching-An Cheng, Andrey Kolobov, and Alekh Agarwal. Policy improvement via imitation of multiple oracles. In Neur IPS, 2020.

[18] David Blackwell. Discrete dynamic programming. The Annals of Mathematical Statistics, pages 719 726, 1962.

[19] Marek Petrik and Bruno Scherrer. Biasing approximate dynamic programming with a lower discount factor. In NIPS, 2008.

[20] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In AAMAS, 2015.

[21] Nan Jiang, Satinder Singh, and Ambuj Tewari. On structural properties of mdps that bound loss due to shallow planning. In IJCAI, 2016.

[22] Yi-Chun Chen, Mykel J Kochenderfer, and Matthijs TJ Spaan. Improving ofﬂine value-function approximations for pomdps by reducing discount factors. In IROS, 2018.

[23] Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. In ICML, 2020.

[24] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100 107, 1968.

[25] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, 2012.

[26] Mingyuan Zhong, Mikala Johnson, Yuval Tassa, Tom Erez, and Emanuel Todorov. Value function approximation and model predictive control. In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 100 107, 2013.

[27] David Hoeller, Farbod Farshidian, and Marco Hutter. Deep value model predictive control. In Co RL, 2020.

[28] Wissam Bejjani, Rafael Papallas, Matteo Leonetti, and Mehmet R Dogar. Planning with a receding horizon for manipulation in clutter using a learned value function. In Humanoids, pages 1 9, 2018.

[29] Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, 1999.

[30] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efﬁcient for ofﬂine RL? ar Xiv preprint ar Xiv:2012.15085, 2020.

[31] Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch off-policy reinforcement learning without great exploration. In Neur IPS, 2020.

[32] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012.

[33] Sharada Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Dipam Chakraborty, Gražvydas Šemetulskis, João Schapke, Jonas Kubilius, Jurgis Pašukonis, Linas Klimas, Matthew Hausknecht, Patrick Mac Alpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc, John Schulman, and Karl Cobbe. Measuring sample efﬁciency and generalization in reinforcement learning benchmarks: Neur IPS 2020 Procgen benchmark. ar Xiv preprint ar Xiv:2103.15332, 2021.

[34] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efﬁcient? In Neur IPS, 2018.

[35] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018.

[36] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

[37] The garage contributors. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019.

[38] Yujing Hu, Weixun Wang, Hangtian Jia, Yixiang Wang, Yingfeng Chen, Jianye Hao, Feng Wu, and Changjie Fan. Learning to utilize shaping rewards: A new approach of reward shaping. In Neur IPS, 2020.

[39] Harm Seijen and Rich Sutton. True online td (lambda). In ICML, 2014.

[40] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Beyond the one-step greedy approach in reinforcement learning. In ICML, 2018.

[41] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In ICLR, 2016.

[42] Craig Sherstan, Shibhansh Dohare, James Mac Glashan, Johannes Günther, and Patrick M Pilarski. Gamma-nets: Generalizing value estimation over timescale. In AAAI, 2020.

[43] Joshua Romoff, Peter Henderson, Ahmed Touati, Emma Brunskill, Joelle Pineau, and Yann Ollivier. Separating value functions across time-scales. In ICML, 2019.

[44] Jacques Richalet, André Rault, JL Testud, and J Papon. Model predictive heuristic control. Automatica, 14(5):413 428, 1978.

[45] Masataro Asai and Christian Muise. Learning neural-symbolic descriptive planning models via cube-space priors: The voyage home (to strips). In IJCAI, 2020.

[46] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. In Neur IPS, 2020.

[47] Tyler Lu, Dale Schuurmans, and Craig Boutilier. Non-delusional q-learning and value iteration. In Neur IPS, 2018.

[48] Stuart J. Russell and Peter Norvig. Artiﬁcial Intelligence: A Modern Approach. Pearson, 4th edition, 2020.

[49] Judea Pearl. Heuristic search theory: Survey of recent results. In IJCAI, 1981.

[50] Reinaldo AC Bianchi, Murilo F Martins, Carlos HC Ribeiro, and Anna HR Costa. Heuristicallyaccelerated multiagent reinforcement learning. IEEE transactions on cybernetics, 44(2):252 265, 2013.

[51] Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In ICML, 2017.

[52] Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement learning & imitation learning. In ICLR, 2018.

[53] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2015.

[54] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. In Co RL, 2017.

[55] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, 2002.

[56] Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In ICML, 2020.

[57] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for distributed reinforcement learning. In ICML, 2018.

[58] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In ICML, 2018.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] Section 7.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] Section 7. It is a conceptual work that doesn t have foreseeable societal impacts yet. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] The assumptions are in the theorem, proposition, and lemma statements throughout the paper. (b) Did you include complete proofs of all theoretical results? [Yes] Appendix A and Appendix B. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] In the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Appendix C. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Section 5 and Appendix C. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Appendix C. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] References [37], [32], and [33] in Section 5.1. (b) Did you mention the license of the assets? [Yes] In Section 5.1.

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

Heuristic computation and scripts to run training. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]