# online_planning_with_lookahead_policies__ce5373fa.pdf

Online Planning with Lookahead Policies

Yonathan Efroni Mohammad Ghavamzadeh Shie Mannor

Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call h-RTDP, that replaces the 1-step greedy policy with a h-step lookahead policy. We analyze h-RTDP in its exact form and establish that increasing the lookahead horizon, h, results in an improved sample complexity, with the cost of additional computations. This is the ﬁrst work that proves improved sample complexity as a result of increasing the lookahead horizon in online planning. We then analyze the performance of h-RTDP in three approximate settings: approximate model, approximate value updates, and approximate state representation. For these cases, we prove that the asymptotic performance of h-RTDP remains the same as that of a corresponding approximate DP algorithm, the best one can hope for without further assumptions on the approximation errors.

1 Introduction

Dynamic Programming (DP) algorithms return an optimal policy, given a model of the environment. Their convergence in the presence of lookahead policies [4, 13] and their performance in different approximate settings [4, 25, 27, 17, 1, 14] have been well-studied. Standard DP algorithms require simultaneous access to the entire state space at run time, and as such, cannot be used in practice when the number of states is too large. Real Time Dynamic Programming (RTDP) [3, 29] is a DP-based algorithm that mitigates the need to access all states simultaneously. Similarly to DP, RTDP updates are based on the Bellman operator, calculated by accessing the model of the environment. However, unlike DP, RTDP learns how to act by interacting with the environment. In each episode, RTDP interacts with the environment, acts according to the greedy action w.r.t. the Bellman operator, and samples a trajectory. RTDP is, therefore, an online planning algorithm.

Despite the popularity and simplicity of RTDP and its extensions [5, 6, 24, 8, 29, 22], precise characterization of its convergence was only recently established for ﬁnite-horizon MDPs [15]. While lookahead policies in RTDP are expected to improve the convergence in some of these scenarios, as they do for DP [4, 13], to the best of our knowledge, these questions have not been addressed in previous literature. Moreover, previous research haven t addressed the questions of how lookahead policies should be used in RTDP, nor studied RTDP s sensitivity to possible approximation errors. Such errors can arise due to a misspeciﬁed model, or exist in value function updates, when e.g., function approximation is used.

Part of this work was done during an internship in Facebook AI Research Microsoft Research, New York, NY Technion, Israel Google Research Nvidia Research

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

In this paper, we initiate a comprehensive study of lookahead-policy based RTDP with approximation errors in online planning. We start by addressing the computational complexity of calculating lookahead policies and study its advantages in approximate settings. Lookahead policies can be computed naively by exhaustive search in O(Ah) for deterministic environments or O(ASh) for stochastic environments. Since such an approach is infeasible, we offer in Section 3 an alternative approach for obtaining a lookahead policy with a computational cost that depends linearly on a natural measure: the total number of states reachable from a state in h time steps. The suggested approach is applicable both in deterministic and stochastic environments.

In Section 5, we introduce and analyze h-RTDP, a RTDP-based algorithm that replaces the 1-step greedy used in RTDP by a h-step lookahead policy. The analysis of h-RTDP reveals that the sample complexity is improved by increasing the lookahead horizon h. To the best of our knowledge, this is the ﬁrst theoretical result that relates sample complexity to the lookahead horizon in online planning setting. In Section 6, we analyze h-RTDP in the presence of three types of approximation: when (i) an inexact model is used, instead of the true one, (ii) the value updates contain error, and ﬁnally (iii) approximate state abstraction is used. Interestingly, for approximate state abstraction, h-RTDP convergence and computational complexity depends on the size of the abstract state space.

In a broader context, this work shows that RTDP-like algorithms could be a good alternative to Monte Carlo tree search (MCTS) [7] algorithms, such as upper conﬁdence trees (UCT) [21], an issue that was empirically investigated in [22]. We establish strong convergence guarantees for extensions of h-RTDP: under no assumption other than initial optimistic value, RTDP-like algorithms combined with lookahead policies converge in polynomial time to an optimal policy (see Table 1), and their approximations inherit the asymptotic performance of approximate DP (ADP). Unlike RTDP, MCTS acts by using a p

log N/N bonus term instead of optimistic initialization. However, in general, its convergence can be quite poor, even worse than uniformly random sampling [9, 26].

2 Preliminaries

Finite Horizon MDPs. A ﬁnite-horizon MDP [4] with time-independent dynamics6 is a tuple M = (S, A, r, p, H), where S and A are the state and action spaces with cardinalities S and A, respectively, r(s, a) [0, 1] is the immediate reward of taking action a at state s, and p(s |s, a) is the probability of transitioning to state s upon taking action a at state s. The initial state in each episode is arbitrarily chosen and H N is the MDP s horizon. For any N N, denote [N] := {1, . . . , N}.

A deterministic policy π : S [H] A is a mapping from states and time step indices to actions. We denote by at := πt(s) the action taken at time t at state s according to a policy π. The quality of a policy π from a state s at time t is measured by its value function, i.e., V π t (s) := E PH t =t r(st , πt (st )) | st = s , where the expectation is over all the randomness in the environment. An optimal policy maximizes this value for all states s S and time steps t [H], i.e., V t (s) := maxπ V π t (s), and satisﬁes the optimal Bellman equation,

V t (s) = TV t+1(s) := max a r(s, a) + p( |s, a)V t+1

= max a E r(s1, a) + V t+1(s2) | s1 = s . (1)

By repeatedly applying the optimal Bellman operator T, for any h [H], we have

V t (s) = T h V t+h(s) = max a r(s, a) + p( |s, a)T h 1V t+h

= max πt,...,πt+h 1 E h h X

t =1 r(st , πt+t 1(st )) + V t+h(sh+1) | s1 = s i . (2)

We refer to T h as the h-step optimal Bellman operator. Similar Bellman recursion is deﬁned for the value of a given policy, π, i.e., V π, as V π t (s) = T h π V π t+h(s) := r(s, πt(s)) + p( |s, πt(s))T h 1 π V π t+h, where T h π is the h-step Bellman operator of policy π.

h-Lookahead Policy. An h-lookahead policy w.r.t. a value function V RS returns the optimal ﬁrst action in an h-horizon MDP. For a state s S, it returns

6The results can also be applied to time-dependent MDPs, however, the notations will be more involved.

ah(s) arg max a r(s, a) + p( |s, a)T h 1V

= arg max π1(s) max π2,...,πh E h h X

t=1 r(st, πt(st)) + V (sh+1)|s1 = s i . (3)

We can see V represent our prior-knowledge of the problem. For example, it is possible to show [4] that if V is close to V , then the value of a h-lookahead policy w.r.t. V is close to V .

For a state s S and a number of time steps h [H], we deﬁne the set of reachable states from s in h steps as Sh(s) = {s | π : pπ(sh+1 = s | s1 = s, π) > 0}, and denote by Sh(s) its cardinality. We deﬁne the set of reachable states from s in up to h steps as ST ot h (s) := h t=1St(s), its cardinality as ST ot h (s) := Ph t=1 St(s), and the maximum of this quantity over the entire state space as ST ot h = maxs ST ot h (s). Finally, we denote by N := ST ot 1 the maximum number of accessible states in 1-step (neighbors) from any state.

Regret and Uniform-PAC. We consider an agent that repeatedly interacts with an MDP in a sequence of episodes [K]. We denote by sk t and ak t , the state and action taken at the time step t of the k th episode. We denote by Fk 1, the ﬁltration that includes all the events (states, actions, and rewards) until the end of the (k 1) th episode, as well as the initial state of the k th episode. Throughout the paper, we denote by πk the policy that is executed during the k th episode and assume it is Fk 1 measurable. The performance of an agent is measured by its regret, deﬁned as Regret(K) := PK k=1 V 1 (sk 1) V πk 1 (sk 1) , as well as by the Uniform-PAC criterion [10], which we generalize to deal with approximate convergence. Let ϵ, δ > 0 and Nϵ = P k=1 1 V 1 (sk 1) V πk 1 (sk 1) ϵ be the number of episodes in which the algorithm outputs a policy whose value is ϵ-inferior to the optimal value. An algorithm is called Uniform-PAC, if Pr( ϵ > 0 : Nϵ F(S, 1/ϵ, log 1/δ, H)) δ, where F( ) depends polynomially (at most) on its parameters. Note that Uniform-PAC implies (ϵ, δ)-PAC, and thus, it is a stronger property. As we analyze algorithms with inherent errors in this paper, we use a more general notion of -Uniform PAC by deﬁning the random variable N ϵ =P k=1 1 V 1 (sk 1) V πk 1 (sk 1) +ϵ , where > 0. Finally, we use O(x) to represent x up to constants and poly-logarithmic factors in δ, and O(x) to represent x up to constants.

3 Computing h-Lookahead Policies

Computing an action returned by a h-lookahead policy at a certain state is a main component in the RTDP-based algorithms we analyze in Sections 5 and 6. A naive procedure that returns such action is the exhaustive search. Its computational cost is O(Ah) and O(ASh) for deterministic and stochastic systems, respectively. Such an approach is impractical, even for moderate values of h or S.

Instead of the naive approach, we formulate a Forward-Backward DP (FB-DP) algorithm, whose pseudo-code is given in Appendix 9. The FB-DP returns an action of an h-lookahead policy from a given state s. Importantly, in both deterministic and stochastic systems, the computation cost of FBDP depends linearly on the total number of reachable states from s in up to h steps, i.e., ST ot h (s). In the worst case, we may have Sh(s) = O(min Ah, S ). However, when ST ot h (s) is small, signiﬁcant improvement is achieved by avoiding unnecessary repeated computations.

FB-DP has two subroutines. It ﬁrst constructs the set of reachable states from state s in up to h steps, {St(s)}h t=1, in the forward-pass . Given this set, in the second backward-pass it simply applies backward induction (Eq. 3) and returns an action suggested by the h-lookahead policy, ah(s). Note that at each stage t [h] of the backward induction (applied on the set {St(s)}h t=1) there are St(s) states on which the Bellman operator is applied. Since applying the Bellman operator costs O(NA) computations, the computational cost of the backward-pass is O NAST ot h (s) .

In Appendix 9, we describe a DP-based approach to efﬁciently implement forward-pass and analyze its complexity. Speciﬁcally, we show the computational cost of the forward-pass is equivalent to that of the backward-pass (see Propsition 8). Meaning, the computational cost of FB-DP is O NAST ot h (s)) - same order as the cost of backward induction given the set ST ot h (s).

4 Real-Time Dynamic Programming

Real-time dynamic programming (RTDP) [3] is a well-known online planning algorithm that assumes access to a transition model and a reward function. Unlike DP algorithms (policy, value iteration, or asynchronous value iteration) [4] that solve an MDP using ofﬂine calculations and sweeps over the entire states (possibly in random order), RTDP solves it in real-time, using samples from the environment (either simulated or real) and DP-style Bellman updates from the current state. Furthermore, unlike DP algorithms, RTDP needs to tradeoff exploration-exploitaion, since it interacts with the environment via sampling trajectories. This makes RTDP a good candidate for problems in which having access to the entire state space is not possible, but interaction is.

Algorithm 1 contains the pseudo-code of RTDP in ﬁnite-horizon MDPs. The value is initialized optimistically, V 0 t+1(s) = H t V t+1(s). At each time step t [H] and episode k [K], the agent updates the value of the current state sk t by the optimal Bellman operator. It then acts greedily w.r.t. the current value at the next time step V k 1 t+1 . Finally, the next state, sk t+1, is sampled either from the model or the real-world. When the model is exact, there is no difference in sampling from the model and real-world, but these are different in case the model is inexact as in Section 6.1.

The following high probability bound on the regret of a Decreasing Bounded Process (DBP), proved in [15], plays a key role in our analysis of exact and approximate RTDP with lookahead policies in Sections 5 and 6. An adapted process {Xk, Fk}k 0 is a DBP, if for all k 0, (i) Xk Xk 1 almost surely (a.s.), (ii) Xk C2, and (iii) X0 = C1 C2. Interestingly, contrary to the standard regret bounds (e.g., in bandits), this bound does not depend on the number of rounds K.

Theorem 1 (Regret Bound of a DBP [15]). Let {Xk, Fk}k 0 be a DBP and RK = PK k=1 Xk 1 E[Xk | Fk 1] be its K-round regret. Then,

Pr{ K > 0 : RK 9(C1 C2) ln(3/δ)} δ.

5 RTDP with Lookahead Policies

In this section, we devise and analyze a lookahead-based RTDP algorithm, called h-RTDP, whose pseudo-code is shown in Algorithm 2. Without loss of generality, we assume that H/h N. We divide the horizon H into H/h intervals, each of length h time steps. h-RTDP stores HS/h values in the memory, i.e., the values at time steps H = {1, h+1, . . . , H +1}.7 For each time step t [H], we denote by hc H, the next time step for which a value is stored in the memory, and by tc = hc t, the number of time steps until there (see Figure 1). At each time step t of an episode k [K], given the current state sk t , h-RTDP selects an action ak t returned by the tc-lookahead policy w.r.t. V k 1 hc ,

ak t = atc(sk t ) arg max π1(sk t ) max π2,...,πtc E h tc X

t =1 r(st , πt (st )) + V k 1 hc (stc+1) | s1 = sk t i . (4)

H+1=7 t=1 h+1=4

Figure 1: Varying lookahead horizon of a h-greedy policy in h-RTDP (see Eq. 4) with h = 3 and H = 6. The blue arrows show the lookahead horizon from a speciﬁc time step t, and the red bars are the time steps for which a value is stored in memory, i.e., H = {1 , h + 1 = 4 , 2h + 1 = H + 1 = 7}.

Thus, h-RTDP uses a varying lookahead horizon tc that depends on how far the current time step is to the next one for which a value is stored. Throughout the paper, with an abuse of notation, we refer to this policy as a h-lookahead policy. Finally, it can be seen that h-RTDP generalizes RTDP as they are equal for h = 1.

We are now ready to establish ﬁnite-sample performance guarantees for h-RTDP; see Appendix 10 for the detailed proofs. We start with two lemmas from which we derive the main convergence result of this section. Lemma 2. For all s S, n {0} [ H

h ], and k [K], the value function of h-RTDP is (i) Optimistic: V nh+1(s) V k nh+1(s), and (ii) Non Increasing: V k nh+1(s) V k 1 nh+1(s).

7In fact, h-RTDP does not need to store V1 and VH+1, they are only used in the analysis.

Algorithm 1 Real-Time DP (RTDP)

init: s S, t {0} [H],

V 0 t+1(s) = H t for k [K] do

Initialize sk 1 arbitrarily for t [H] do

V k t (sk t ) = T V k 1 t+1 (sk t ) ak t arg maxa r(sk t , a)+p( |sk t , a) V k 1 t+1 Act by ak t , observe sk t+1 p( | sk t , ak t ) end for end for

Algorithm 2 RTDP with Lookahead (h-RTDP)

init:: s S, n {0} [ H

h ], V 0 nh+1(s) = H nh for k [K] do

Initialize sk 1 arbitrarily for t [H] do

if (t 1) mod h = 0 then

V k t (sk t ) = T h V k 1 hc (sk t ) end if ak t arg maxa r(sk t , a) + p( |sk t , a)T hc t 1 V k 1 hc Act by ak t , observe sk t+1 p( | sk t , ak t ) end for end for

Lemma 3 (Optimality Gap and Expected Decrease). The expected cumulative value update at the k th

episode of h-RTDP satisﬁes V k 1 (sk 1) V πk 1 (sk 1) = P H

h 1 n=1 P s S V k 1 nh+1(s) E[ V k nh+1(s) | Fk 1].

Properties (i) and (ii) in Lemma 2 show that { V k nh+1(s)}k 0 is a DBP, for any s and n. Lemma 3 relates V k 1 (sk 1) V πk 1 (sk 1) (LHS) to the expected decrease in V k at the k th episode (RHS). When the LHS is small, then V k 1 (sk 1) V 1 (sk 1), due to the optimism of V k 1 , and h-RTDP is about to converge to the optimal value. This is why we refer to the LHS as the optimality gap. Using these two lemmas and the regret bound of a DBP (Theorem 1), we prove a ﬁnite-sample convergence result for h-RTDP (see Appendix 10 for the full proof). Theorem 4 (Performance of h-RTDP). Let ϵ, δ > 0. The following holds for h-RTDP:

1. With probability 1 δ, for all K > 0, Regret(K) 9SH(H h)

2. Pr n ϵ > 0 : Nϵ 9SH(H h) ln(3/δ)

Proof Sketch. Applying Lemmas 2 and 3, we may write

k=1 V k 1 (sk 1) V πk 1 (sk 1) =

s V k 1 nh+1(s) E[ V k nh+1(s) | Fk 1]

k=1 Xk 1 E[Xk | Fk 1]. (5)

Where we deﬁne Xk := P H

s V k 1 nh+1(s) and use linearity of expectation. By Lemma 2,

{Xk}k 0 is decreasing and bounded from below by P H

h 1 n=1 P s V nh+1(s) 0. We conclude the

proof by observing that X0 P H

s V 0 nh+1(s) SH(H h)/h, and applying Theorem 1.

Remark 1 (RTDP and Good Value Initialization). A closer look into the proof of Theorem 4 shows we can easily obtain a stronger result which depends on the initial value V 0. The regret can be bounded by Regret(K) O P H

h 1 n=1 V 0 nh+1(s) V nh+1(s) , which formalizes the intuition the

algorithm improves as the initial value V 0 better estimates V . For clarity purposes we provide the worse-case bound. Remark 2 (Computational Complexity of h-RTDP). Using FB-DP (Section 3) as a solver of a h-lookahead policy, the per-episode computation cost of h-RTDP amounts to applying FB-DP for H time steps, i.e., it is bounded by O(HNAST ot h ). Since ST ot h the total number of reachable states in up to h time steps is an increasing function of h, the computation cost of h-RTDP increases with h, as expected. When ST ot h is signiﬁcantly smaller than S, the per-episode computational complexity of

h-RTDP is S independent. As discussed in Section 3, using FB-DP, in place of exhaustive search, can signiﬁcantly improve the computational cost of h-RTDP. Remark 3 (Improved Sample Complexity of h-RTDP). Theorem 4 shows that h-RTDP improves the sample complexity of RTDP by a factor 1/h. This is consistent with the intuition that larger horizon of the applied lookahead policy results in faster convergence (less samples). Thus, if RTDP is used in a real-time manner, one way to boost its performance is to combine with lookahead policies. Remark 4 (Sparse Sampling Approaches). In this work, we assume h-RTDP has access to a hlookahead policy (3) solver, such as FB-DP presented in Section 3. We leave studying the sparse sampling approach [19, 28] for approximately solving h-lookahead policy for future work.

6 Approximate RTDP with Lookahead Policies

In this section, we consider three approximate versions of h-RTDP in which the update deviates from its exact form described in Section 5. We consider the cases in which there are errors in the 1) model, 2) value updates, and when we use 3) approximate state abstraction. We prove ﬁnite-sample bounds on the performance of h-RTDP in the presence of these approximations. Furthermore, in Section 6.3, given access to an approximate state abstraction, we show that the convergence of h-RTDP depends on the cardinality of the abstract state space which can be much smaller than the original one. The proofs of this section generalize that of Theorem 4, while following the same recipe . This shows the generality of the proof technique, as it works for both exact and approximate settings.

6.1 h-RTDP with Approximate Model (h-RTDP-AM)

In this section, we analyze a more practical scenario in which the transition model used by h-RTDP to act and update the values is not exact. We assume it is close to the true model in the total variation (TV ) norm, (s, a) S A, ||p( |s, a) ˆp( |s, a)||1 ϵP , where ˆp denotes the approximate model. Throughout this section and the relevant appendix (Appendix 11), we denote by ˆT and ˆV the optimal Bellman operator and optimal value of the approximate model ˆp, respectively. Note that ˆT and ˆV satisfy (1) and (2) with p replaced by ˆp. h-RTDP-AM is exactly the same as h-RTDP (Algorithm 2) with the model p and optimal Bellman operator T replaced by their approximations ˆp and ˆT. We report the pseudocode of h-RTDP-AM in Appendix 11.

Although we are given an approximate model, ˆp, we are still interested in the performance of (approximate) h-RTDP on the true MDP, p, and relative to its optimal value, V . If we solve the approximate model and act by its optimal policy, the Simulation Lemma [20, 30] suggests that the regret is bounded by O(H2ϵP K). For h-RTDP-AM, the situation is more involved, as its updates are based on the approximate model and the samples are gathered by interacting with the true MDP. Nevertheless, by properly adjusting the techniques from Section 5, we derive performance bounds for h-RTDP-AM. These bounds reveal that the asymptotic regret increases by at most O(H2ϵP K), similarly to the regret of the optimal policy of the approximate model. Interestingly, the proof technique follows that of the exact case in Theorem 4. We generalize Lemmas 2 and 3 from Section 5 to the case that the update rule uses an inexact model (see Lemmas 9 and 10 in Appendix 11). This allows us to establish the following performance bound for h-RTDP-AM (proof in Appendix 11). Theorem 5 (Performance of h-RTDP-AM). Let ϵ, δ > 0. The following holds for h-RTDP-AM:

1. With probability 1 δ, for all K > 0, Regret(K) 9SH(H h)

h ln(3/δ) + H(H 1)ϵP K.

2. Let P = H(H 1)ϵP . Then, Pr n ϵ > 0 : N P ϵ 9SH(H h) ln(3/δ)

These bounds show the approximate convergence resulted from the approximate model. However, the asymptotic performance gaps both in terms of the regret and Uniform PAC of h-RTDPAM approach those experienced by an optimal policy of the approximate model. Interestingly, although h-RTDP-AM updates using the approximate model, while interacting with the true MDP, its convergence rate (to the asymptotic performance) is similar to that of h-RTDP (Theorem 4).

6.2 h-RTDP with Approximate Value Updates (h-RTDP-AV)

Another important question in the analysis of approximate DP algorithms is their performance under approximate value updates, motivated by the need to use function approximation. This is often modeled by an extra noise |ϵV (s)| ϵV added to the update rule [4]. Following this approach, we

study such perturbation in h-RTDP. Speciﬁcally, in h-RTDP-AV the value update rule is modiﬁed such that it contains an error term (see Algorithm 2), V k t (sk t ) = ϵV (sk t ) + T h V k 1 hc (sk t ).

For ϵV (sk t ) = 0, the exact h is recovered. The pseudocode of h-RTDP-AV is supplied in Appendix 12.

Similar to the previous section, we follow the same proof technique as for Theorem 4 to establish the following performance bound for h-RTDP-AV (proof in Appendix 12). Theorem 6 (Performance of h-RTDP-AV). Let ϵ, δ > 0. The following holds for h-RTDP-AV:

1. With probability 1 δ, for all K > 0, Regret(K) 9SH(H h)

h ϵV ) ln( 3

2. Let V = 2HϵV . Then, Pr n ϵ > 0 : N

h ϵ 9SH(H h)(1+ V

δ ) hϵ o δ.

As in Section 6.1, the results of Theorem 6 exhibit an asymptotic linear regret O(HϵV K/h). As proven in Proposition 20 in Appendix 15, such performance gap exists in ADP with approximate value updates. Furthermore, the convergence rate in S to the asymptotic performance of h-RTDP-AV is similar to that of its exact version (Theorem 4). Unlike in h-RTDP-AM, the asymptotic performance of h-RTDP-AV improves with h. This quantiﬁes a clear beneﬁt of using lookahead policies in online planning when the value function is approximate.

6.3 h-RTDP with Approximate State Abstraction (h-RTDP-AA)

We conclude the analysis of approximate h-RTDP with exploring the advantages of combining it with approximate state abstraction [1]. The central result of this section establishes that given an approximate state abstraction, h-RTDP converges with sample, computation, and space complexity independent of the size of the state space S, as long as ST ot h is smaller than S (i.e., when performing h-lookahead is S independent, Remark 2). This is in contrast to the computational complexity of ADP in this setting, which is still O(HSA) (see Appendix 15.3 for further discussion). State abstraction has been widely investigated in approximate planning [12, 11, 16, 1], as a means to deal with large state space problems. Among existing approximate abstraction settings, we focus on the following one. For any n {0} [ H

h 1], we deﬁne φnh+1 : S Sφ to be a mapping from the state space S to reduced space Sφ, Sφ = |Sφ| S. We make the following assumption: Assumption 1 (Approximate Abstraction, [23], deﬁnition 3.3). For any s, s S and n {0} [ H

h 1] for which φnh+1(s) = φnh+1(s ), we have |V nh+1(s) V nh+1(s )| ϵA.

Let us denote by { V k φ,nh+1}H/h n=0 the values stored in memory by h-RTDP-AA at the k th episode. Unlike previous sections, the value function per time step contains Sφ entries, V k φ,1+nh RSφ. Note that if ϵA = 0, then optimal value function can be represented in the reduced state space Sφ. However, if ϵA is positive, exact representation of V is not possible. Nevertheless, the asymptotic performance of h-RTDP-AA will be close , up to error of ϵA, to the optimal policy.

Furthermore, the deﬁnition of the multi-step Bellman operator (2) and h-greedy policy (3) should be revised, and with some abuse of notation, deﬁned as

ak t arg max π0(sk t ) max π1,...,πtc 1 E

t =0 rt + V k 1 φ,hc (φhc(stc)) | s0 = sk t

T h φ V k 1 φ,hc (sk t ) := max π0,...,πh 1 E

t =0 rt + V k 1 φ,t+h(φt+h(sh)) | s0 = sk t

Eq. (6) and (7) indicate that similar to (3), the h-lookahead policy uses the given model to plan for h time steps ahead. Differently from (3), the value after h time steps is the one deﬁned in the reduced state space Sφ. Note that the deﬁnition of the h-greedy policy for h-RTDP-AA in (6) is equivalent to the one used in Algorithm 8, obtained by similar recursion as for the optimal Bellman operator (2).

h-RTDP-AA modiﬁes both the value update and the calculation of the h-lookahead policy (the value update and action choice in algorithm 2). The h-lookahead policy is replaced by h-lookahead deﬁned in (6). The value update is substituted by (7), i.e, V k φ,t(φt(sk t )) = T h φ V k 1 φ,hc (sk t ). The full pseudocode of h-RTDP-AA is supplied in Appendix 13. By similar technique, as in the proof of Theorem 4, we establish the following performance guarantees to h-RTDP-AA (proof in Appendix 13).

Setting h-RTDP Regret (This work) ADP Regret [4] UCT Exact (5) O SH(H h)/h

0 Ω(exp(exp(H))) [9] App. Model (6.1) O SH(H h)/h+ P K

P K N.A App. Value (6.2) O SH(H h)gϵ H/h/h+ V K/h

V K/h N.A App. Abstraction (6.3) O SφH(H h)/h + AK/h

Table 1: The lookhead horizon is h and the horizon of the MDP is H. We denote gϵ H/h = (1 + HϵV /h), P = H(H 1)ϵP , V = 2HϵV , and A = HϵA. The table summarizes the regret bounds of the h RTDP settings studied in this work and compares them to those of their corresponding ADP approaches. The performance of ADP is based on standard analysis, supplied in Propositions 19, 20, 21 in Appendix 15.

Theorem 7 (Performance of h-RTDP-AA). Let ϵ, δ > 0. The following holds for h-RTDP-AA:

1. With probability 1 δ, for all K > 0, Regret(K) 9SφH(H h)

h ln(3/δ) + HϵA

2. Let A = HϵA. Then, Pr n ϵ > 0 : N

h ϵ 9SφH(H h) ln(3/δ)

Theorem 7 establishes S-independent performance bounds that depend on the size of the reduced state space Sφ. The asymptotic regret and Uniform PAC guarantees are approximate, as the state abstraction is approximate. Furthermore, they are improving with the quality of approximation ϵA, i.e., their asymptotic gap is O(HϵA/h) relative to the optimal policy. Moreover, the asymptotic performance of h-RTDP-AA improves as h is increased. Importantly, since the computation complexity of each episode of h-RTDP is independent of S (Section 3), the computation required to reach the approximate solution in h-RTDP-AA is also S-independent. This is in contrast to the computational cost of DP that depends on S and is O(SHA) (see Appendix 15.3 for further discussion).

7 Discussion and Conclusions

RTDP vs. DP. The results of Sections 5 and 6 established ﬁnite-time convergence guarantees for the exact h-RTDP and its three approximations. In the approximate settings, as expected, the regret has a linear term of the form K, where is linear in the approximation errors ϵP , δ, and ϵA, and thus, the performance is continuous in these parameters, as we would desire. We refer to K as the asymptotic regret, since it dominates the regret as K .

A natural measure to evaluate the quality of h-RTDP in the approximate settings is comparing its regret to that of its corresponding approximate DP (ADP). Table 1 summarizes the regrets of the approximate h-RTDPs studied in this paper and their corresponding ADPs. ADP calculates approximate values {V nh+1}H/h n=0 by backward induction. Based on these values, the same h-lookahead policy by which h-RTDP acts is evaluated. In the analysis of ADP, we use standard techniques developed for the discounted case in [4]. From Table 1, we reach the following conclusion: the asymptotic performance (in terms of regret) of approximate h-RTDP is equivalent to that of a corresponding approximate DP algorithm. Furthermore, it is important to note that the asymptotic error decreases with h for the approximate value updates and approximate abstraction settings for both RTDP and DP algorithms. In these settings, the error is caused by approximation in the value function. By increasing the lookahead horizon h, the algorithm uses less such values and relies more on the model which is assumed to be correct. Thus, the algorithm becomes less affected by the value function approximation.

Conclusions. In this paper, we formulated h-RTDP, a generalization of RTDP that acts by a lookahead policy, instead of by a 1-step greedy policy, as in RTDP. We analyzed the ﬁnite-sample performance of h-RTDP in its exact form, as well as in three approximate settings. The results indicate that h-RTDP converges in a very strong sense. Its regret is constant w.r.t. to the number of episodes, unlike in, e.g., reinforcement learning where a lower bound of O(

SAHT) exists [2, 18]. Furthermore, the analysis reveals that the sample complexity of h-RTDP improves by increasing the lookahead horizon h (Remark 3). Moreover, the asymptotic performance of h-RTDP was shown to be equivalent to that of ADP (Table 1), which under no further assumption on the approximation error, is the best we can hope for.

We believe this work opens interesting research venues, such as studying alternatives to the solution of the h-greedy policy (see Section 9), studying a Receding-Horizon extension of RTDP, RTDP with

function approximation, and formulating a Thompson-Sampling version of RTDP, as the standard RTDP is an optimistic algorithm. As the analysis developed in this work was shown to be quite generic, we hope that it can assist with answering some of these questions. On the experimental side, more needs to be understood, especially comparing RTDP with MCTS and studying how RTDP can be combined with deep neural networks as the value function approximator.

8 Broader Impact

Online planning algorithms, such as A and RTDP, have been extensively studied and applied in AI for well over two decades. Our work quantiﬁes the beneﬁts of using lookahead-policies in this class of algorithms. Although lookahead-policies have also been widely used in online planning algorithms, their theoretical justiﬁcation was lacking. Our study sheds light on the beneﬁts of lookahead-policies. Moreover, the results we provide in this paper suggest improved ways for applying lookahead-policies in online planning with beneﬁts when dealing with various types of approximations. This work opens up the room for practitioners to improve their algorithms and base lookahead policies on solid theoretical ground.

Acknowledgements. We thank the reviewers for their helpful comments and feedback. Y.E. and S.M. were partially supported by the ISF under contract 2199/20.

[1] David Abel, D. Hershkowitz, and Michael Littman. Near optimal behavior via approximate state abstraction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pages 2915 2923, 2016. [2] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263 272. JMLR. org, 2017. [3] Andrew Barto, Steven Bradtke, and Satinder Singh. Learning to act using real-time dynamic programming. Artiﬁcial intelligence, 72(1-2):81 138, 1995. [4] D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming. Athena Scientiﬁc, 1996. [5] Blai Bonet and Hector Geffner. Planning with incomplete information as heuristic search in belief space. In Proceedings of the Fifth International Conference on Artiﬁcial Intelligence Planning Systems, pages 52 61. AAAI Press, 2000. [6] Blai Bonet and Hector Geffner. Labeled rtdp: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12 21, 2003. [7] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, 2012. [8] Vadim Bulitko and Greg Lee. Learning in real-time search: A unifying framework. Journal of Artiﬁcial Intelligence Research, 25:119 157, 2006. [9] Pierre-Arnaud Coquelin and Rémi Munos. Bandit algorithms for tree search. ar Xiv preprint cs/0703062, 2007. [10] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713 5723, 2017. [11] Thomas Dean, Robert Givan, and Sonia Leach. Model reduction techniques for computing approximately optimal solutions for Markov decision processes. In Proceedings of the 13th conference on Uncertainty in artiﬁcial intelligence, pages 124 131, 1997. [12] Richard Dearden and Craig Boutilier. Abstraction and approximate decision-theoretic planning. Artiﬁcial Intelligence, 89(1-2):219 283, 1997. [13] Y. Efroni, G. Dalal, B. Scherrer, and S. Mannor. How to combine tree-search methods in reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, pages 3494 3501, 2019.

[14] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiple-step greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, pages 5238 5247, 2018. [15] Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies. In Advances in Neural Information Processing Systems, pages 12203 12213, 2019. [16] Eyal Even-Dar and Yishay Mansour. Approximate equivalence of markov decision processes. In Learning Theory and Kernel Machines, pages 581 594, 2003. [17] Matthieu Geist and Olivier Pietquin. Algorithmic survey of parametric value function approximation. IEEE Transactions on Neural Networks and Learning Systems, 24(6):845 867, 2013. [18] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pages 4863 4873, 2018. [19] Michael Kearns, Yishay Mansour, and Andrew Ng. A sparse sampling algorithm for nearoptimal planning in large Markov decision processes. Machine learning, 49(2-3):193 208, 2002. [20] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209 232, 2002. [21] Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In European conference on machine learning, pages 282 293, 2006. [22] Andrey Kolobov, Daniel S Weld, et al. Lrtdp versus uct for online probabilistic planning. In Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, 2012. [23] L. Li, T. Walsh, and M. Littman. Towards a uniﬁed theory of state abstraction for MDPs. In Proceedings of the 9th International Symposium on Artiﬁcial Intelligence and Mathematics, pages 531 539, 2006. [24] Brendan Mc Mahan, Maxim Likhachev, and Geoffrey Gordon. Bounded real-time dynamic programming: Rtdp with monotone upper bounds and performance guarantees. In Proceedings of the 22nd international conference on Machine learning, pages 569 576. ACM, 2005. [25] Rémi Munos. Performance bounds in l_p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541 561, 2007. [26] Rémi Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends R in Machine Learning, 7(1):1 129, 2014. [27] Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, and Matthieu Geist. Approximate modiﬁed policy iteration. In Proceedings of the 29th International Conference on Machine Learning, pages 1207 1214, 2012. [28] Aaron Sidford, Mengdi Wang, Xian Wu, and Yinyu Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 770 787, 2018. [29] A. Strehl, L. Li, and M. Littman. PAC reinforcement learning bounds for RTDP and rand-RTDP. In Proceedings of AAAI workshop on learning for search, 2006. [30] Alexander Strehl, Lihong Li, and Michael Littman. Reinforcement learning in ﬁnite MDPs: PAC analysis. Journal of Machine Learning Research, 10(Nov):2413 2444, 2009.