# planning_and_learning_with_adaptive_lookahead__961f84cc.pdf

Planning and Learning with Adaptive Lookahead

Aviv Rosenberg1*, Assaf Hallak2, Shie Mannor2,3, Gal Chechik2,4, Gal Dalal2

1 Amazon Science, 2 Nvidia Research, 3 Technion, 4 Bar-Ilan University avivros@amazon.com

Some of the most powerful reinforcement learning frameworks use planning for action selection. Interestingly, their planning horizon is either fixed or determined arbitrarily by the state visitation history. Here, we expand beyond the naive fixed horizon and propose a theoretically justified strategy for adaptive selection of the planning horizon as a function of the state-dependent value estimate. We propose two variants for lookahead selection and analyze the trade-off between iteration count and computational complexity per iteration. We then devise a corresponding deep Q-network algorithm with an adaptive tree search horizon. We separate the value estimation per depth to compensate for the off-policy discrepancy between depths. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and Atari.

1 Introduction The celebrated Policy Iteration (PI) and Value Iteration (VI) (Sutton and Barto 2018) algorithms are the basis for most state-of-the-art reinforcement learning (RL) algorithms. Since both PI and VI are based on a one-step greedy approach for policy improvement, so are the most commonly used policy-gradient (Schulman et al. 2017; Haarnoja et al. 2018) and Q-learning (Mnih et al. 2013; Hessel et al. 2018) based approaches. In each iteration, they perform an improvement of their current policy by looking one step forward and acting greedily. While this is the simplest and most common paradigm, stronger performance was recently achieved using multi-step lookahead. Notably, in Alpha Go (Silver et al. 2018) and Mu Zero (Schrittwieser et al. 2020), the multi-step lookahead is implemented via Monte Carlo Tree Search (MCTS) (Browne et al. 2012). In MCTS, the search depth is not chosen adaptively but gradually increases with the aggregation of state visitations. Several recent works rigorously analyzed the properties of multi-step lookahead in common RL schemes (Efroni et al. 2018; Moerland et al. 2020; Hallak et al. 2021; Sikchi, Zhou, and Held 2022). These and other related literature studied a fixed planning horizon chosen in advance. However, both in simulated and real-world environments, a large variety of

*Research conducted while the author was an intern at Nvidia Research. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Algorithm #Iterations Iteration complexity PI (Scherrer 2016) 1 c(1) H-PI 1 H c(H) (Efroni et al. 2018)

TLPI (this paper) 1 h(κ) c(1) + θ(κ)c(h(κ))

QLPI (this paper) log γ log κ 1 h(κ)

O PH h=1 θhc(h)

Table 1: Algorithm comparison summary. The iterations count (number of iterations) is divided by S(A 1) log(1 γ)

log γ . The iteration complexity (computational complexity per iteration) is divided by S. c(k) is the computational complexity of k-step planning. h(κ) is the smallest integer h s.t. γh(κ) κ. θ(κ) is the fraction of κ-contracting states. θ1, . . . , θH are the contraction quantiles sizes.

states benefit differently from various lookahead horizons. A grasping robot far from its target will learn very little from looking a few steps into the future, but if the target is within reach, much more precision and planning are required to grasp the object correctly. Similarly, at the beginning of a chess game, lookahead grants little information about which move is better, while agents in mid-game intricate situations benefit immensely from considering all future possibilities for the next few moves. Indeed, in this work, we devise a methodology for adaptive selection of the planning horizons in each state and show it achieves a significant speed-up of the learning process. We propose two complementing approaches to determine the suitable horizon per state in each PI iteration. To do so, we keep track of the room for improvement for the value function. Our first algorithm, Threshold-based Lookahead PI (TLPI), ensures the desired convergence rate and minimizes the computational complexity for each iteration. Alternatively, our second algorithm, Quantile-based Lookahead PI (QLPI), takes the per-iteration computational complexity as a given budget and aims for the best possible convergence rate. We prove that both TLPI and QLPI converge to the optimum and achieve a significantly lower computational cost than their fixed-horizon alternative (see Table 1). Next, we devise QL-DQN: a DQN (Mnih et al. 2013) variant of QLPI, where the policy chooses an action by employ-

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

ing an exhaustive tree search (Hallak et al. 2021) looking h steps into the future. The tree-depth h is chosen adaptively per state to achieve an overall improved convergence rate at a reduced computational cost. To sustain on-policy consistency while generalizing over the multiple depths, we use a different value network per depth, where the first layers are shared across networks. We test our method on Atari and show it improves upon a fixed-depth tree search. To summarize, our contributions are the following. First, we propose to use adaptive state-dependent lookahead and devise corresponding algorithms. Our analysis shows they converge with improved computational complexity. Second, we extend our approach to online learning with a DQN variant that uses an exhaustive tree search of adaptive depth and per-depth value network. Third, we evaluate the proposed methods on maze and Atari environments and show better results than a fixed lookahead horizon.

2 Preliminaries We consider a discounted MDP M = (S, A, P, r, γ), where S is a finite state space of size S, A is a finite action space of size A, r : S A [0, 1] is the reward function, P : S A S is the transition function, and γ (0, 1) is the discount factor. Let π : S A be a stationary policy, and V π RS be the value function of π defined by V π(s) = E [P t=0 γtr(st, π(st) | s0 = s]. The goal of a planning algorithm is to find the optimal policy π such that, for every s S,

V (s) = V π (s) = max π:S A V π(s).

Given a policy π, let T π : RS RS be the Bellman operator: T π[V ] = rπ + γP πV, where rπ(s) = r(s, π(s)) and P π(s |s) = P(s |s, π(s)). It is well known that the value of policy π is the unique solution to the linear equations: T π[V π] = V π. Let T : RS RS be the optimal Bellman operator defined as:

T[V ](s) = max a A r(s, a) + γ X

s S P(s |s, a)V (s ).

Then, the optimal value is the unique solution to the nonlinear equations T[V ] = V and T is a γ-contraction in the max-norm over the state space:

V T[V π] γ V V π .

2.1 PI and h-PI PI starts from an arbitrary policy π0 and performs iterations that consist of: (1) an evaluation step that evaluates the value of the current policy, and (2) an improvement step that performs a 1-step improvement based on the computed value. That is, for n = 0, 1, 2, . . . ,

πn+1(s) = arg max a A r(s, a) + γ X

s S P(s | s, a)V πn(s ).

By the contraction property of the Bellman operator, one can prove that PI finds the optimal policy after at most (log 1

γ ) 1S(A 1) log 1/1 γ iterations (Scherrer 2016).

s0 s1 s2 sn 1 sn ......

Figure 1: Chain MDP example with deterministic transitions and rewards 0 everywhere except for r(sn, u) = 1 γ. Using a fixed horizon h per state for each PI iteration leads to the same convergence rate as using horizon ℓin only a single state for ℓ= 2, . . . , h, but at a much higher computational cost.

The PI algorithm can be extended to h-PI by performing h-step improvements (instead of 1-step). Formally, define the Q-function of policy π with a h-step lookahead as

Qπ h(s, a) = max πt Es,a "h 1 X

t=0 γtr(st, πt(st)) + γh V π(sh)

where Es,a[ ] = E[ |s0 = s, π0(s) = a]. Then, the update rule of h-PI is πn+1(s) = arg maxa A Qπn h (s, a). The operator induced by h-step lookahead is a γh-contraction which allows to reduce a factor of h from the bound on the number of iterations until convergence (Efroni et al. 2018), i.e., it is bounded by l (h log 1

γ ) 1S(A 1) log 1 1 γ m . Multi-step lookahead guarantees that the number of iterations to convergence is smaller than the 1-step lookahead, but it comes with a computational cost. Computing the hstep improvement may take exponential time in h. In tabular MDPs, this can be mitigated with the use of dynamic programming (Efroni, Ghavamzadeh, and Mannor 2020), while in MDPs with large (or infinite) state space, MCTS (Browne et al. 2012) or the alternative exhaustive tree-search (Hallak et al. 2021) are used in forward-looking fashion. To compare our algorithms, in the rest of the paper we measure the computational complexity as follows: Definition 2.1. Let c(h) be the computational cost of performing a h-step improvement in a single state. For example, in a deterministic full A-ary tree we have c(h) = O(Ah).

3 Motivating Example To show the potential of our approach, consider the chain MDP example in Figure 1: Example 3.1 (Chain MDP). Let M be an MDP with n + 1 states s0, s1, . . . , sn and a single sink state sn+1. Each of the n+1 states transitions to the consecutive state with action u, and to the sink state with action d. All rewards are 0 except for state sn in which u yields reward 1 γ. Now consider the standard PI algorithm initialized with π0(si) = d for all i {0, . . . , n}. Since the reward at the end of the chain needs to propagate backward, in each iteration the value of only a single state is updated. Thus, PI

takes exactly n iterations to converge to the optimal policy π (si) = u for all i. Instead, with a fixed horizon h = 2, the reward propagates through two states in each iteration (instead of one) and therefore convergence takes n/2 iterations. Generally, PI with a fixed horizon h, i.e., h-PI, converges in n/h iterations. While h-PI converges faster (in terms of iterations) as h increases, in most states, performing h-step lookahead does not contribute to the speed-up at all. In our example, we can achieve exactly the same convergence rate as 2-PI by using a 2-step lookahead in only a single state in each iteration (and 1-step in all other states). Specifically, we need to pick the state that is exactly 2 steps behind the last updated state in the chain. For general h, consider applying ℓ-step lookahead in only one state the one that is ℓsteps behind the last updated state in the chain for ℓ= 2, . . . , h and 1-step in the others. This guarantees the same number of iterations until convergence as h-PI, but with much less computation time. Namely, while the per-iteration computational cost of h-PI is O n c(h) , we can achieve the same convergence rate with just O n c(1) + Ph ℓ=2 c(ℓ) . In practice, when n is large and c(h) scales exponentially with h, this gap can be immense: O n 2h versus O n + 2h .

4 Contraction-Based Adaptive Lookahead

In this section, we introduce the concept of dynamically adapting the planning lookahead horizon during runtime, based on the online obtained contraction. In Example 3.1, h-PI convergence rate can be achieved when using a lookahead larger than 1 in just h states. The question is how to choose these states? In the example, the chosen states are evidently those with the maximal distance between their 1-step improvement and optimal value, i.e., arg maxs S |V (s) T[V πt](s)|. In this section, we show that this approach also leads to theoretical guarantees on the convergence of the PI algorithm. To understand how the convergence rate depends on the distance of the 1-step improvement from the approximate optimal value, we delve into the theoretical properties of PI. Since the standard 1-step improvement yields a contraction of γ while the h-step improvement gives γh, h-PI converges h times faster than standard PI (Efroni et al. 2018). Importantly, this contraction is with respect to the L norm; i.e., the states with the worst (largest) contraction coefficient determine the convergence rate of PI. This behavior is the source of weakness of using a fixed lookahead. Example 3.1 shows that one state may slow down convergence, but it also hints at an elegant solution: use larger lookahead in states with larger contraction coefficient. We leverage this observation and present two new algorithms: TLPI which aims to achieve a fixed contraction in all states with a reduced computational cost, and QLPI which aims to achieve maximal contraction in every iteration within a fixed computational budget. While both algorithms seek to optimize a similar problem, their analyses differ and shed light on the problem from different perspectives: TLPI depends on the contraction factor per state, while QLPI considers only the ordering of the states with respect to their

Algorithm 1: TLPI

1: Input: S, A, r, P, γ, κ, β, e V . 2: Initialization: Arbitrary π0, t 0. 3: while πt changes do 4: Evaluation: compute V πt, and set U(s, a) . 5: 1-step improvement: U(s, a) Qπt 1 (s, a) (s, a). 6: h(κ)-step improvement: U(s, a) Qπt h(κ)(s, a) for every (s, a) s.t.: (here U(s) = maxa U(s, a))

|e V (s) U(s)| > κ e V V πt β. (1)

7: Set πt+1(s) arg maxa A U(s, a) for every s S. 8: end while

contraction factors. Since we do not have access to the optimal value V (s), the algorithms rely on warm-starts or an approximation of the optimal value V , denoted by e V . While obtaining a good approximation of the value function is hard, we aim at a simpler task: find an approximation that is informative for allocating the depth resources. The approximated values may be far from optimal, as long as they yield similar allocation across depths. In many cases, we can obtain such an approximation through, e.g., state aggregation, training agents on similar tasks, or by running PI for a small number of iterations. In Sections 5 and 6, we show that these approximation methods are indeed effective in practice.

4.1 Threshold-Based Lookahead Policy Iteration TLPI (Algorithm 1) takes as input the approximated value e V , a desired contraction factor κ and a correction term β. We assume that V e V ϵ. This implies we can measure contraction up to some approximation error that scales with ϵ. The algorithm ensures that in each iteration, the value in every state contracts by at least κ. This is achieved by first performing 1-step improvement in all states and then performing h(κ)-improvement in states whose measured contraction is less than κ, where h(κ) is the smallest integer h such that γh κ. Since we do not have an accurate estimate of the optimal value, we use the correction term β to make sure that no states falsely seem to achieve the desired contraction due to the approximation error ϵ (see Equation (1)). The following result states that TLPI converges at least as fast as h-PI (Efroni et al. 2018) with h set to h(κ) 1, and with improved computational complexity. To measure the trade-off between the contraction factor (that determines the convergence rate) and the computational complexity needed to achieve it, Definition 4.1 presents θ(κ) as the fraction of states in which we perform a large lookahead.

Definition 4.1 (Def. of θ(κ)). Let {πt}T t=1 be the sequence of policies generated by TLPI with correction term β and approximated value e V . Let κ (0, 1), and define

Xt = {s : |e V (s) T[V πt](s)| κ e V V πt β}

as the set of states, that after 1-step improvement in iteration t, are β-close to be contracted by κ with respect to e V . Then,

Algorithm 2: QLPI

1: Input: S, A, r, P, γ, (θ1 . . . θH), m, e V . 2: Initialization: Arbitrary π0, t 0. 3: while πt changes do 4: Evaluation: compute V πt, and set U(s, a) . 5: for h = 1, 2, . . . , H do 6: Compute qh as the (1 θh m/S) quantile of {|e V (s) maxa U(s, a)|}s S. 7: h-step improvement: U(s, a) Qπt h (s, a) for every (s, a) s.t.: |V (s) maxa A U(s, a)| qh. 8: end for 9: Set πt+1(s) arg maxa A U(s, a) for every s S. 10: end while

denote by θ(κ) = max1 t T |S \ Xt|/S the largest fraction of states with contraction less than κ, observed along all policy updates. Theorem 4.2. The TLPI algorithm with approximated value e V and correction term β = ϵ(κ + 1) converges in

at most (h(κ) 1) log 1

γ 1 S(A 1) log 1 1 γ

tions. Moreover, its per-iteration computational complexity is bounded by S c(1) + θ(κ)c(h(κ)) .

Proof sketch. The proof to bound the number of iterations follows Scherrer (2016) while utilizing two key observations. First, the convergence analysis of PI only uses the contraction property of the Bellman operator w.r.t. V , and not w.r.t. an arbitrary pivot vector. The distance to V can be approximated using e V , and the approximation error is handled by the correction term β. Second, by the construction of the algorithm, a contraction of at least κ in every state is guaranteed. The computational complexity follows because we perform 1-step lookahead in all states and h(κ)- step lookahead in θ(κ) of the states by Definition 4.1. For the complete proof, see Appendix A.1.

To illustrate the merits of TLPI and Thoerem 4.2, consider the Chain MDP in Example 3.1 where we set κ = γh

for some h N (and assume e V = V for simplicity). In every iteration, the states not contracted by κ after 1-step improvement are the h states closest to the end of the chain that have not been updated yet (recall that each state reaches the correct optimal value after just one non-idle update). Thus, θ(κ) = h/S and the per-iteration computational complexity is S c(1) + h c(h).

4.2 Quantile-Based Lookahead Policy Iteration QLPI (Algorithm 2) resembles TLPI, but instead of a contraction coefficient κ, it takes as input a vector of quantiles (budgets) (θ1, . . . , θH) [0, 1]H for some predetermined maximal considered lookahead H. Instead of the actual distance to the optimal value, QLPI relies only on the ordering of the states in terms of distance from the optimum. This allows for weaker requirements on the approximated value e V as it should only preserve the order.

Definition 4.3. Let ps and ps be the positions of state s in the orderings of {|V (s) V πt(s)|}s S and {|e V (s) V πt(s)|}s S, respectively. We define the approximation e V to be m-order-preserving if, for every s S, |ps ps| m. State-aggregation is an example of an approximation that preserves orders and that is available in many domains where states are based on locality (like the maze environment considered in Section 5). Assume that we have access to a state-aggregation scheme that splits the state space into S/m groups of size m such that for every two states s1, s2 in the same group |V (s1) V (s2)| ϵ and for any state s3 from a different group |V (s1) V (s3)| > 2ϵ. Then the optimal value of the aggregated MDP V agg is m-orderpreserving as long as |V agg(s) V (s)| ϵ for every s S, since the position of a state can be shifted due to the aggregation by at most the size of its group m. QLPI attempts to maximize the contraction in every iteration while using ℓ-step lookahead in at most θℓ S+m states. This is achieved by performing ℓ-step improvement on the (θℓ+m/S) portion of states that are furthest away from e V . The following result is complementary to Theorem 4.2: now, instead of choosing the desired iteration complexity (via κ in TLPI), we choose the desired computational complexity per iteration via budgets (θ1, . . . , θH). For the resulting iteration complexity we define the induced contraction factor: Definition 4.4 (Def. of κ(θ)). Let {πt}T t=1 be the sequence of policies generated by QLPI. Let hθ t (s) be the largest lookahead applied in state s in iteration t when running QLPI with quantiles (θ1, . . . , θH). For a given κ, define

Yt(κ) = {s : |V (s) T hθ t (s)V πt(s)| κ V V πt }

as the set of states contracted by κ in iteration t. The induced contraction factor κ(θ) is defined as the minimal κ such that Yt(κ) = S for every t.

Though its formal definition may seem complex, κ(θ) is simply the effective contraction obtained by QLPI. Theorem 4.5. The QLPI algorithm converges in at most l (log 1 κ(θ) ) 1S(A 1) log 1 1 γ m iterations. Moreover, its per-iteration computational complexity is bounded by S PH h=1(θh + m/S)c(h). We provide the proof in Appendix A.2; it is based on similar ideas as the proof of Theorem 4.2. To illustrate the merits of QLPI and Theorem 4.5, consider the Chain MDP in Example 3.1 where we set θ1 = 1 and θ2 = = θh = 1/S for some h N (again e V = V for simplicity). In every iteration QLPI first performs 1-step lookahead in all states, and then, for each ℓ= 2, . . . , h, it performs ℓ-step lookahead in exactly one state the state that is ℓsteps behind the last updated state in the chain. As explained in Section 3, the induced contraction is κ(θ) = γh and QLPI converges in n/h iterations with optimal periteration complexity of S c(1) + Ph ℓ=2 c(ℓ). Finally, we highlight the complementary nature of the two algorithms: while in TLPI the complexity parameter

Figure 2: Snapshot of our maze environment, a 30 30 grid world. The agent is spawned in the top left corner (blue) and needs to reach one of four randomly chosen goals (green), while avoiding the trap (red). White pixels denote walls. Upon reaching a goal state, the agent re-appears in a new random location.

θ(κ) is governed by the desired contraction coefficient, in QLPI the induced contraction κ(θ) is the outcome of the predetermined computational budget.

5 Maze Experiments

In the first set of experiments, we evaluate our adaptive lookahead algorithms, TLPI and QLPI, on a grid world with walls (Tennenholtz et al. 2022). Specifically, we used a 30 30 grid world that is divided to four rooms with doors between them; see Figure 2. The agent is spawned in the top left corner (blue) and needs to reach one of four randomly chosen goals (green) where the reward is 1, while avoiding the trap (red) that incurs a reward of 1. There are four deterministic actions (up, down, right, left). Upon reaching a goal, the agent is moved to a random state. We set γ = 0.98.

Fixed lookahead. We begin with testing the fixed-horizon h-PI with values h = 1, 2, . . . , 7. To corroborate that larger lookahead values reduce the number of PI iterations required for convergence, in Figure 3, we show the distance from the solution as the function of iteration for the different depths. The plot demonstrates the effect of the lookahead in a less pathological example than Example 3.1. In Figure 5, we compare the overall computational complexity, and not only the number of iterations, of the different fixed lookahead values. To measure performance, we count the number of queries to the simulator (environment) until convergence to the optimal value. More efficient lookahead horizons will require fewer overall calls to the simulator. Beginning with the fixed lookahead results in the leftmost plot, we see the trade-off when picking the lookahead. For a lookahead too short (1 in this case), the convergence requires too many iterations such that even the low computational complexity of each iteration is not sufficient to compensate for the total compute time. Note that h-PI with h = 1 is the standard PI algorithm, which evidently performs worse than the best-fixed lookahead although it is overwhelmingly the most widely used version of PI. On the other extreme of a very large lookahead, each iteration is too computationally

0 10 20 30 40 Iterations

Distance to optimum

Figure 3: The distance to the optimum, captured by V V πt , as a function of the iteration number t for fixed lookahead values of h = 1, 2, . . . , 7 in maze environment.

expensive, despite the smaller number of iterations.

TLPI. To verify our observation that a long lookahead is wasteful in large parts of the state space, we first plot a histogram of the contraction factor along several PI iterations in Figure 4. Here we see that indeed the effective contraction factor κ is much smaller than γ (i.e., more contractive than 1-step lookahead) in roughly 90% of the states. Next, we run TLPI with κ = γ2, γ3, . . . , γ7 and an accurate approximated value e V = V . The results are given in Figure 5, second plot. By Theorem 4.2, when setting κ = γh, we expect the same number of iterations until convergence as h-PI but with better computational complexity. In fact, the results reveal even stronger behavior: TLPI(γh) for all h = 1, 2, . . . , 7 achieves similar computational complexity compared to the best fixed lookahead witnessed in h-PI.

QLPI. In all our experiments we run QLPI with θ1 = 1 and θ3 = θ5 = θ6 = θ7 = 0 (again e V = V ). For (θ2, θ4, θ8) we set the following values: (0.3, 0.2, 0.1), (0.2, 0.15, 0.05), (0.2, 0.05, 0.02) and (0.1, 0.05, 0.02), which respectively depict decreasing weights to depths 2, 4, 8. The results are presented in Figure 5, third plot. Again we can see that for all the parameters, QLPI performs as well as the best-fixed lookahead. Moreover, notice that for some of the choices of θ vectors, the performance significantly improves upon the best-fixed horizon.

Approximate V via state aggregation. We again run QLPI with budget values (0.1, 0.05, 0.02), but replace V with an approximation we obtain with state aggregation. Namely, we merge squares of k k into a single state, solve the smaller aggregated MDP, and use its optimal value as an approximation for V . We perform this experiment with k = 2, 3, 4, 5 and include the aggregated MDP solution process in the total simulator query count. This way, our final algorithm does not have any prior knowledge of V . The results are presented in Figure 5, last plot. As expected, the performance is slightly worse than the original QLPI that uses the accurate V , but for all different aggregation choices the algorithm still performs as well as the best-fixed lookahead in h-PI.

Figure 4: Histograms for the fraction of states per effective lookahead along several iterations of PI. The effective lookahead of contraction factor κ is defined as h = logγ(κ), i.e., γh = κ.

Figure 5: Number of simulator queries until convergence. Lower is better. Results are averaged across 10 runs and error bars represent standard deviation. QLPI is run with lookaheads 1, 2, 4, and 8, where the quantiles in the x-axis represent θ2, θ4, θ8.

To summarize, the maze experiments show that adaptive planning lookaheads manage to reach the solution with better sample complexity (i.e. number of simulator queries) compared to fixed-horizon h-PI. More importantly, our methods are robust to hyperparameter choices: the improved results are obtained uniformly with all various tested parameters of TLPI and QLPI. This alleviates the heavy burden of finding the best-fixed horizon for a given environment.

6 QL-DQN and Atari Experiments

In this section, we extend our adaptive lookahead algorithm QLPI to neural network function approximation. We present Quantile-based lookahead DQN (QL-DQN): the first DQN algorithm that uses state-dependent lookahead that is dynamically chosen throughout the learning process. QL-DQN is visualized in Figure 7 and operates as follows:

1. We introduce a per-depth Q-function. Technically, maintain H parallel Q-networks (where H is the maximal tree depth) and use the h-th network to predict the value of a leaf in depth h. To improve generalization and data reuse, the networks share initial layers (feature extractors). 2. Maintain a sorted replay buffer according to the distance between the current value estimate and approximated optimal value (for efficiency we utilize a priority queue).

For V , we train a standard DQN (depth 0) agent for only 1M steps a relatively quick process. 3. For a replay buffer of size N, maintain H quantiles based on the above ordering. The quantile sizes are θ1 N, . . . , θH N. The values θi are hyper-parameters. 4. Choosing tree depth: Per state during simulation, start by spanning the 1st level of the tree. If the best leaf value distance from the approximate optimal value is in the 1st quantile, end the tree-search. Otherwise, go one level deeper, compare to the 2nd quantile, and re-iterate with the same logic. Continue possibly until the max depth H is reached. Notice that the tree-search is feasible in reasonable run-time thanks to highly efficient parallel Atari simulation on GPU (Hallak et al. 2021). 5. Choosing action given depth: After spanning the tree as described above, return the first action (from the root) that corresponds to the highest leaf value. 6. Store (state, action, cumulative reward to leaf, leaf state, tree-depth) to the replay buffer. 7. For training, set the bootstrap target to be the cumulative reward from the tree-search plus the Q-value corresponding tree-depth calculated on the leaf-state. We note that the per-depth Q-function is crucial in order to keep online consistency and achieve convergence. We found

Figure 6: Average and std of training reward of QL-DQN (in red) and DQN with fixed tree-depths 0, 1, 2, 3 in various Atari environments. x-axis is time in hours (a plot with step-based x-axis is given in Appendix B).

Stop here if 𝑠0 in 1st contraction quantile

Stop here if 𝑠0 in 2nd contraction quantile

Stop here if 𝑠0 in ith contraction quantile

Shared layers

Depth specific layers 𝑄(𝑠1, 𝑎) 𝑄(𝑠2, 𝑎) 𝑄(𝑠ℎ, 𝑎)

Figure 7: The QL-DQN algorithm. When choosing actions, the policy uses a tree depth based on the ranking of s0 s contraction coefficient in the replay buffer. The per-depth Q network has a shared basis and depth-specific heads.

that in practice, the Q-function learned by DQN is not the true cumulative reward of rollouts. Instead, it is some function that minimizes the Bellman error. This phenomenon is orthogonal to our work and was also recently studied in (Fujimoto et al. 2022). In our context of multiple-depth Qnetwork, if we bootstrap using the target from one depth for another, the above phenomenon causes inconsistencies that lead to divergence. To handle this inconsistency, we introduced the multi-head Q-network for multiple depths and found that it solves the issue. All other parts of the algorithm and hyper-parameter choices are taken as-is from the original DQN paper (Mnih et al. 2013). We train QL-DQN on several Atari environments (Bellemare et al. 2013). Since our work aims to improve sample complexity over fixed-horizon baselines, our metric of interest here is the reward as a function of training time. Hence, in Figure 6 we present the convergence of QL-DQN versus DQN with fixed depths 0 through 3, as a function of time. The plots consist of the average score across 5 seeds together with std values. Note that depth 0 corresponds to standard DQN (the baseline). As seen, QL-DQN achieves better performance on Video Pinball and Tutankham, while on Solaris and Berzerk, it is on par with the best-fixed lookahead.

The conclusion here is again that we obtain a better or similarly-performing agent to a pre-determined fixed planning horizon. This comes with the benefit of robustness to the expensive hyper-parameter choice of the best-fixed horizon per a given environment.

7 Discussion In this paper we propose the first planning and learning algorithms that dynamically adapt the multi-step lookahead horizon as a function of the state and the current value function estimate. We demonstrate the significant potential of adaptive lookahead both theoretically proving convergence with improved computational complexity, and empirically demonstrating their favorable performance in a maze and Atari. Our algorithms often perform as well as the best-fixed horizon in hindsight in almost all the experiments, while in some cases they surpass it. Future work warrants an investigation whether the best-fixed horizon can always be outperformed by an adaptive horizon. Theoretically, our guarantees rely on prior knowledge of an approximate optimal value, raising the question whether one can choose lookahead horizons adaptively without any prior knowledge, e.g., using transfer learning based on similarity between domains. Moreover, when the forward model performing the lookahead is inaccurate or learned from data, the adaptive state-dependent lookahead itself may serve as a quantifier for the level of trust in the value function estimate (short lookahead) versus the model (long lookahead). This can offer a way for state-wise regularization of the learning or planning problem. Our work is also related to the growing Sim2Real literature. In particular, when having several simulators with different computational costs and fidelity levels. The lookahead problem then translates to choosing in which states to use which simulator with which lookahead. Our focus in this paper was reducing iteration and overall complexity; we thus ignored more intricate details of the forward search itself. Additional practical aspects such as CPUGPU planning efficiency trade-offs (Hallak et al. 2021) can also affect the lookahead selection problem. One promising direction is to expand at each step only the few most promising nodes, and keep the search width fixed after a certain value. This gives linear complexity in the search depth instead of exponential, at the risk of missing relevant paths.

References Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253 279. Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1): 1 43. Efroni, Y.; Dalal, G.; Scherrer, B.; and Mannor, S. 2018. Beyond the one-step greedy approach in reinforcement learning. In International Conference on Machine Learning, 1387 1396. PMLR. Efroni, Y.; Ghavamzadeh, M.; and Mannor, S. 2020. Online Planning with Lookahead Policies. Advances in Neural Information Processing Systems, 33. Fujimoto, S.; Meger, D.; Precup, D.; Nachum, O.; and Gu, S. S. 2022. Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. ar Xiv preprint ar Xiv:2201.12417. Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, 1861 1870. PMLR. Hallak, A.; Dalal, G.; Dalton, S.; Mannor, S.; and Chechik, G. 2021. Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction. Advances in Neural Information Processing Systems, 34. Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; and Silver, D. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602. Moerland, T. M.; Deichler, A.; Baldi, S.; Broekens, J.; and Jonker, C. M. 2020. Think too fast nor too slow: The computational trade-off between planning and reinforcement learning. ar Xiv preprint ar Xiv:2005.07404. Scherrer, B. 2016. Improved and generalized upper bounds on the complexity of policy iteration. Mathematics of Operations Research, 41(3): 758 774. Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. 2020. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839): 604 609. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347. Sikchi, H.; Zhou, W.; and Held, D. 2022. Learning offpolicy with online planning. In Conference on Robot Learning, 1622 1633. PMLR.

Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419): 1140 1144. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press. Tennenholtz, G.; Hallak, A.; Dalal, G.; Mannor, S.; Chechik, G.; and Shalit, U. 2022. On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning. In International Conference on Learning Representations (ICLR).