# policy_mirror_descent_with_lookahead__a8d4487a.pdf

Policy Mirror Descent with Lookahead

Kimon Protopapas Anas Barakat

Policy Mirror Descent (PMD) stands as a versatile algorithmic framework encompassing several seminal policy gradient algorithms such as natural policy gradient, with connections with state-of-the-art reinforcement learning (RL) algorithms such as TRPO and PPO. PMD can be seen as a soft Policy Iteration algorithm implementing regularized 1-step greedy policy improvement. However, 1-step greedy policies might not be the best choice and recent remarkable empirical successes in RL such as Alpha Go and Alpha Zero have demonstrated that greedy approaches with respect to multiple steps outperform their 1-step counterpart. In this work, we propose a new class of PMD algorithms called h-PMD which incorporates multi-step greedy policy improvement with lookahead depth h to the PMD update rule. To solve discounted infinite horizon Markov Decision Processes with discount factor γ, we show that h-PMD which generalizes the standard PMD enjoys a faster dimension-free γh-linear convergence rate, contingent on the computation of multi-step greedy policies. We propose an inexact version of h-PMD where lookahead action values are estimated. Under a generative model, we establish a sample complexity for h-PMD which improves over prior work. Finally, we extend our result to linear function approximation to scale to large state spaces. Under suitable assumptions, our sample complexity only involves dependence on the dimension of the feature map space instead of the state space size.

1 Introduction

Policy Mirror Descent (PMD) is a general class of algorithms for solving reinforcement learning (RL) problems. Motivated by the surge of interest in understanding popular Policy Gradient (PG) methods, PMD has been recently investigated in a line of works [25, 53, 29, 20]. Notably, PMD encompasses several PG methods as particular cases via its flexible mirror map. A prominent example is the celebrated Natural PG method. PMD has also close connections to state-of-the-art methods such as TRPO and PPO [50] which have achieved widespread empirical success [41, 42], including most recently in fine-tuning Large Language Models via RL from human feedback [35]. Interestingly, PMD has also been inspired by one of the most fundamental algorithms in RL: Policy Iteration (PI). While PI alternates between policy evaluation and policy improvement, PMD regularizes the latter step to address the instability issue of PI with inexact policy evaluation.

Policy Iteration and its variants have been extensively studied in the literature, see e.g. [5, 32, 39]. In particular, PI has been studied in conjunction with the lookahead mechanism [12], i.e. using multistep greedy policy improvement instead of single-step greedy policies. Intuitively, the idea is that the application of the Bellman operator multiple times before computing a greedy policy leads to a more accurate approximation of the optimal value function. Implemented via Monte Carlo Tree Search (MCTS), multi-step greedy policy improvement2 contributed to the empirical success of some of the

Department of Computer Science, ETH Z urich, Switzerland. Contact: kprotopapas@student.ethz.ch, barakat9anas@gmail.com. Most of this work was completed when both authors were affiliated with ETH Z urich, K.P as a Master student and A.B. as a postdoctoral fellow. A.B. is currently affiliated with Singapore University of Technology and Design as a research fellow. 2Note that this is different from n-step return approaches which are rather used for policy evaluation [47].

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

most impressive applications of RL including Alpha Go, Alpha Zero and Mu Zero [44, 46, 45, 40]. Besides this practical success, a body of work [12, 13, 49, 52, 51] has investigated the role of lookahead in improving the performance of RL algorithms, reporting a convergence rate speed-up when enhancing PI with h-step greedy policy improvement with a reasonable computational overhead.

In this work, we propose to cross-fertilize the PMD class of algorithms with multi-step greedy policy improvement to obtain a novel class of PMD algorithms enjoying the benefits of lookahead. Our contributions are as follows:

We propose a novel class of algorithms called h-PMD enhancing PMD with multi-step greedy policy updates where h is the depth of the lookahead. This class collapses to standard PMD when h = 1 . When the stepsize parameter goes to infinity, we recover the PI algorithm with multiple-step greedy policy improvement previously analyzed in [12]. When solving a Markov Decision Process problem with h-PMD, we show that the value suboptimality gap converges to zero geometrically with a contraction factor of γh. This rate is faster than the standard convergence rate of PI and the similar rate for PMD which was recently established in [20] for h 2. This rate improvement requires the computation of a h-step lookahead value function at each iteration which can be performed using planning methods such as tree-search. These results for exact h-PMD are exposed in Sec. 4, when value functions can be evaluated exactly.

We examine the inexact setting where the h-step action value functions are not available. In this setting, we propose a Monte Carlo sample-based procedure to estimate the unknown hstep lookahead action-value function at each state-action pair. We provide a sample complexity for this Monte Carlo procedure, improving over standard PMD thanks to the use of lookahead. Larger lookahead depth translates into a better sample complexity and a faster suboptimality gap convergence rate. However, this improvement comes at a more intensive computational effort to perform the lookahead steps. Sec. 5 discusses this inexact setting and the aforementioned tradeoff.

We extend our results to the function approximation setting to address the case of large state spaces where tabular methods are not tractable. In particular, we design a h-PMD algorithm where the h-step lookahead action-value function is approximated using a linear combination of state-action feature vectors in the policy evaluation step. Under linear function approximation and using a generative sampling model, we provide a performance bound for our h-PMD algorithm with linear function approximation. Our resulting sample complexity only depends on the dimension of the feature map space instead of the size of the state space and improves over prior work analyzing PMD with function approximation.

We perform simulations to verify our theoretical findings empirically on the standard Deep Sea RL environment from Deep Mind s bsuite [34]. Our experiments illustrate the convergence rate improvement of h-PMD with increasing lookahead depth h in both the exact and inexact settings.

2 Preliminaries

Notation. For any integer n N \ {0}, the set of integers from 1 to n is denoted by [n]. For a finite set X, we denote by (X) the probability simplex over the set X . For any d N, we endow the Euclidean space Rd with the norm defined for every v Rd by v := maxi [d] |vi| where vi is the i-th coordinate. The relative interior of a set Z is denoted by ri(Z) .

Markov Decision Process (MDP). We consider an infinite horizon discounted Markov Decision Process M = (S, A, r, γ, P, ρ) where S and A are finite sets of states and actions respectively with cardinalities S = |S| and A = |A| respectively, P : S A (S) is the Markovian probability transition kernel, r : S A [0, 1] is the reward function, γ (0, 1) is the discount factor and ρ (S) is the initial state distribution. A randomized stationary policy is a mapping π : S (A) specifying the probability π(a|s) of selecting action a in state s for any s S, a A. The set of all such policies is denoted by Π. At each time step, the learning agent takes an action a with probability π(a|s), the environment transitions from the current state s S to the next state s S with probability P(s |s, a) and the agent receives a reward r(s, a).

Value functions and optimal policy. For any policy π Π, we define the state value function V π : S R for every s S by V π(s) = Eπ [P t=0 γtr(st, at)|s0 = s] . The stateaction value function Qπ : S A R can be similarly defined for every s S, a A by Qπ(s, a) = Eπ [P t=0 γtr(st, at)|s0 = s, a0 = a] where Eπ is the expectation over the state-

action Markov chain (st, at) induced by the MDP and the policy π generating actions. The goal is to find a policy π maximizing V π. A classic result shows that there exists an optimal deterministic policy π Π maximizing V π simultaneously for all the states [36]. In this work, we focus on searching for an ϵ-optimal policy, i.e. a policy π satisfying V V π ϵ where V := V π .

Bellman operators and Bellman equations. We will often represent the reward function r by a vector in RSA and the transition kernel by the operator P RSA S acting on vectors v RS (which can also be seen as functions defined over S) as follows: (Pv)(s, a) = P

s S P(s |s, a)v(s ) for any s S, a A. We further define the mean operator M π : RSA RS mapping any vector Q RSA to a vector in RS whose components are given by (M πQ)(s) = P

a A π(a|s)Q(s, a) , and the maximum operator M : RSA RS defined by (M Q)(s) = maxa A Q(s, a) for any s S, Q RSA . Using these notations, we introduce for any given policy π Π the associated expected Bellman operator T π : RS RS defined for every V RS by T πV := M π(r + γPV ) and the Bellman optimality operator T : RS RS defined by T V := M (r + γPV ) for every V RS . Using the notations rπ := M πr and P π := M πP, we can also simply write T πV = rπ + γP πV and T V = maxπ Π T πV for any V RS .

The Bellman optimality operator T and the expected Bellman operator T π for any given policy π Π are γ-contraction mappings w.r.t. the norm and hence admit unique fixed points V and V π respectively. In particular, these operators satisfy the following inequalities: T πV V π γ V V π and T V V γ V V for any V RS . Moreover, we recall that for e = (1, 1, . . . , 1) RS and any c R 0, V RS, we have T π(V + c e) = T πV + γc e and T (V + c e) = T V + γc e . We recall that the set G(V ) := {π Π : T πV = T V } of 1 -step greedy policies w.r.t. the optimal value V coincides with the set of stationary optimal policies.

3 A Refresher on PMD and PI with Lookahead

3.1 PMD and its Connection to PI

Policy Gradient methods consist in performing gradient ascent updates with respect to an objective Eρ[V π(s)] w.r.t. the policy π parametrized by its values π(a|s) for s, a S A in the case of a tabular policy parametrization. Interestingly, policy gradient methods have been recently shown to be closely connected to a class of Policy Mirror Descent algorithms by using dynamical reweighting of the Bregman divergence with discounted visitation distribution coefficients in a classical mirror descent update rule using policy gradients. We refer the reader to section 4 in [53] or section 3.1 in [20] for a detailed exposition. The resulting PMD update rule is given by

πk+1 s argmaxπs (A) ηk Qπk s , πs Dϕ(πs, πk s ) , (PMD)

where we use the shorthand notations πk s = πk( |s) and πs = π( |s), where (ηk) is a sequence of positive stepsizes, Qπk s RA is a vector containing state action values for πk at state s, and Dϕ is the Bregman divergence induced by a mirror map ϕ : dom ϕ R such that (A) dom ϕ , i.e. for any π, π Π, s S,

Dϕ(πs, π s) = ϕ(πs) ϕ(π s) ϕ(π s), πs π s ,

where we suppose throughout this paper that the function ϕ is of Legendre type, i.e. strictly convex and essentially smooth in the relative interior of dom ϕ (see [37], section 26). The mirror map choice gives rise to a large class of algorithms. Two special cases are noteworthy: (a) When ϕ is the squared 2-norm, the Bregman divergence is the squared Euclidean distance and the resulting algorithm is a projected Q-ascent and (b) when ϕ is the negative entropy, the corresponding Bregman divergence is the Kullback-Leibler (KL) divergence and (PMD) becomes the popular Natural Policy Gradient algorithm (see e.g. [53], section 4 for details).

Now, using our operator notations, one can observe that the (PMD) update rule above is equivalent to the following update rule (see Lemma A.1 for a proof),

πk+1 argmaxπ Π {ηk T πV πk Dϕ(π, πk)} , (1)

where Dϕ(π, πk) RS is a vector whose components are given by Dϕ(πs, πk s ) for s S . Note that the maximization can be carried out independently for each state component.

Connection to Policy Iteration. As previously noticed in the literature [53, 20], PMD can be interpreted as a soft PI. Indeed, taking infinite stepsizes ηk in PMD (or a null Bregman divergence) immediately leads to the following update rule:

πk+1 argmaxπ Π {T πV πk} , (PI)

which corresponds to synchronous Policy Iteration (PI). The method alternates between a policy evaluation step to estimate the value function V πk of the current policy πk and a policy improvement step implemented as a one-step greedy policy with respect to the current value function estimate.

3.2 Policy Iteration with Lookahead

As discussed in [12], Policy Iteration can be generalized by performing h 1 greedy policy improvement steps at each iteration instead of a single step. The resulting h-PI update rule is:

πk+1 argmaxπ Π T πT h 1V πk , (h-PI)

where h N \ {0} . The h-greedy policy πk+1 w.r.t V πk selects the first optimal action of a non-stationary h-horizon optimal control problem. It can also be seen as the 1-step greedy policy w.r.t. T h 1V πk . In the rest of this paper, we will denote by Gh(V ) the set of h-step greedy policies w.r.t. any value function V RS, i.e. Gh(V ) := argmaxπ Π T πT h 1V = {π Π : T πT h 1V = T h V }. Notice that G1(V ) = G(V ). Overall, the h-PI method alternates between finding a h-greedy policy and estimating the value of this policy. Interestingly, similarly to PI, a h-greedy policy guarantees monotonic improvement, i.e. V π V π component-wise for any π Π, π Gh(V π) . Moreover, since T h is a γh-contraction, it can be shown that the sequence ( V V πk )k is contracting with coefficient γh . See section 3 in [12] for further details about h-PI.

Remark 3.1. While the iteration complexity always improves with larger depth h (see Theorem 3 in [12]), each iteration becomes more computationally demanding. As mentioned in [14], when the model is known, the h-greedy policy can be computed with Dynamic Programming (DP) in linear time in the depth h.

Another possibility is to implement a tree-search of depth h starting from the root state s to compute the h-greedy policy in deterministic MDPs. We refer to [14] for further details. Sampling-based tree search methods such as Monte Carlo Tree Search (MCTS) (see [9]) can be used to circumvent the exponential complexity in the depth h of tree search methods.

4 Policy Mirror Descent with Lookahead

In this section we propose our novel Policy Mirror Descent (PMD) algorithm which incorporates h-step lookahead for policy improvement: h-PMD.

4.1 h-PMD: Using h-Step Greedy Updates

Similarly to the generalization of PI to h-PI discussed in the previous section, we obtain our algorithm h-PMD by incorporating lookahead to PMD as follows:

πk+1 = argmaxπ Π ηk T πT h 1V πk Dϕ(π, πk) . (h-PMD)

Notice that we recover (1), i.e. (PMD), by setting h = 1 . We now provide an alternative way to write the h-PMD update rule which is more convenient for its implementation. We introduce a few additional notations for this purpose. For any policy π Π, let V π h := T h 1V π for any integer h 1 . Consider the corresponding action value function defined for every s, a S A by:

Qπ h(s, a) := r(s, a) + γ(PV π h )(s, a) . (2)

Using these notations, it can be easily shown that the h-PMD rule above is equivalent to:

πk+1 s = argmaxπ Π ηk Qπk h (s, ), πs Dϕ(πs, πk s ) . (h-PMD )

Before moving to the analysis of this scheme, we provide two concrete examples for the implementation of h-PMD corresponding to the choice of two standard mirror maps.

(1) Projected Qh-ascent. When the mirror map ϕ is the squared 2-norm, the corresponding Bregman divergence is the squared Euclidean distance and h-PMD becomes

πk+1 s = Proj (A)(πk s + ηk Qπk h (s, )) , (3)

for every s S and Proj (A) is the projection operator on the simplex (A) .

(2) Multiplicative Qh-ascent. When the mirror map ϕ is the negative entropy, the Bregman divergence Dϕ is the Kullback-Leibler divergence and for every (s, a) S A,

πk+1(a|s) = πk(a|s)exp(ηk Qπk h (s, a)) Zk(s) , Zk(s) := X

a A πk(a|s) exp(ηk Qπk h (s, a)) . (4)

Connection to Alpha Zero. Before switching gears to the analysis, we point out an interesting connection between h-PMD and the successful class of Alpha Zero algorithms [46, 45, 40] which are based on heuristics.

It has been shown in [16] that Alpha Zero can be seen as a regularized policy optimization algorithm, drawing a connection to the standard PMD algorithm (with h = 1). We argue that Alpha Zero is even more naturally connected to our h-PMD algorithm. Indeed, the Q values in [16, Eq. (7) p. 3] correspond more closely to the lookahead action value function Qπ h approximated via tree search instead of the standard Qπ function with h = 1. This connection clearly delineates the dependence on the lookahead depth (or tree search depth) h which was not clear in [16]. We have implemented a version of our h-PMD algorithm with Deep Mind s MCTS implementation (see section D.6 and the code provided for details). Conducting further experiments to compare our algorithm to Alpha Zero on similar large scale settings would be interesting. We are not aware of any theoretical convergence guarantee for Alpha Zero which relies on many heuristics. In contrast, our h-PMD algorithm enjoys theoretical guarantees that we establish in the next sections.

4.2 Convergence Analysis for Exact h-PMD

Using the contraction properties of the Bellman operators, it can be shown that the suboptimality gap V πk V for PI iterates converges to zero at a linear rate with a contraction factor γ, regardless of the underlying MDP. This instance-independent convergence rate was generalized to PMD in [20]. In this section, we establish a linear convergence rate for the subobtimality value function gap of the exact h-PMD algorithm with a contraction factor of γh where h is the lookahead depth. We assume for now that the value function V πk and a greedy policy πk+1 in (h-PMD) can be computed exactly. These assumptions will be relaxed in the next section dealing with inexact h-PMD. Theorem 4.1 (Exact h-PMD). Let (ck) be a sequence of positive reals and let the stepsize ηk in (h-PMD) satisfy ηk 1 ck minπ Gh(V πk ) Dϕ(π, πk) where we recall that Gh(V πk) is the set of greedy policies with respect to T h 1V πk and that the minimum is computed component-wise. Initialized at π0 ri(Π), the iterates (πk) of (h-PMD) with h N \ {0} satisfy for every k N,

V V π0 + 1 1 γ

A few remarks are in order regarding this result:

Compared to [20], our new algorithm achieves a faster γh-rate where h is the depth of the lookahead. It should be noted that the approach in [20] is claimed to be optimal over all methods that are guaranteed to increase the probability of the greedy action at each timestep. Adding lookahead to the policy improvement step circumvents this restriction. Unlike prior work [53, 25] featuring distribution mismatch coefficients which can scale with the size of the state space, our convergence rate does not depend on any other instance-dependent quantities. This is thanks to the analysis which circumvents the use of the performance difference lemma similarly to [20].

Choosing ct = γ2h(t+1)c0 for a positive constant c0 for every integer t, the bound becomes (after bounding the geometric sum): V V πk γhk V V π0 + c0 (1 γ)2 . As c0 0 we recover the linear convergence result of h-PI established in [12].

This faster rate comes at the cost of a more complex value function computation at each iteration. However, our experiments suggest that the benefits of the faster convergence rate greatly outweigh the extra cost of computing the lookahead, in terms of both overall running time until convergence and sample complexity (in the inexact case). See section 7 and Appendix D for evidence. The cost of computing the adaptive stepsizes is typically simply that of computing a Bregman divergence between two policies. See Appendix A.3, D.4 for a more detailed discussion. We defer the proof of Theorem 4.1 to Appendix A. Our proof highlights the relationship between h-PMD and h-PI through the explicit use of Bellman operators. Notice that even in the particular case of h = 1, our proof is more compact compared to the one in [20] (see sec. 6 and Lemma A.1 to A.3 therein for comparison) using our convenient notations.

5 Inexact and Stochastic h-Policy Mirror Descent

In this section, we discuss the case where the lookahead action value Qπk h defined in (2) is unknown. We propose a procedure to estimate it using Monte Carlo sampling and we briefly discuss an alternative using a tree search method. The resulting inexact h-PMD update rule for every s S is

πk+1 s = argmaxπs (A) n ηk ˆQπk h (s, ), πs Dϕ(πs, πk s ) o , (5)

where ˆQπk h is the estimated version of the lookahead action value Qπk h induced by policy πk . We conclude this section by providing a convergence analysis for inexact h-PMD and discussing its sample complexity using the proposed Monte Carlo estimator.

5.1 Monte Carlo h-Greedy Policy Evaluation

Estimating the lookahead action value function Qπ h for a given policy π involves solving a h-horizon planning problem using samples from the MDP. Our estimation procedure combines a standard planning method with Monte Carlo estimation under a generative model of the MDP. We give an algorithmic description of the procedure below, and defer the reader to Appendix B.1 for a precise definition of the procedure. Applying the recursive algorithm below with k := h returns an estimate for the action value function Qπ h(s, a) at a given state-action pair (s, a) S A.

Algorithm 1 Lookahead Q-function Estimation via Monte Carlo Planning

Procedure Q(k, s, a, π) if k = 1 then

return r(s, a) + γ

M PM i=1 ˆV π(s i) (where s i P( |s, a) for i [M]) else

return r(s, a) + γ

M PM i=1 maxa A Q(k 1, s i, a , π) (where s i P( |s, a) for i [M]) end if

Note that the base case of the recursion estimates the value function using Monte Carlo rollouts, see Appendix B.1. Remark 5.1. Notice that actions are exhaustively selected in our procedure like in a planning method. Using bandit ideas to guide Monte Carlo planning (see e.g. [23]) would be interesting to investigate for more efficiency. We leave the investigation of such selective action sampling and exploration for future work. In practice, lookahead policies are often computed using tree search techniques [12]. Remark 5.2. When h = 1, the procedure collapses to a simple Monte Carlo estimate of the Q-value function Qπk at each iteration k of the h-PMD algorithm.

5.2 Analysis of Inexact h-PMD

The next theorem is a generalization of Theorem 4.1 to the inexact setting in which the lookahead function Qπk h can be estimated with some errors at each iteration k of h-PMD.

Theorem 5.3 (Inexact h-PMD). Suppose there exists b R+ s.t. ˆQπk h Qπk h b where the maximum norm is over both the state and action spaces. Let Gh = argmaxπ Π M π ˆQπk h and let (ck) be a sequence of positive reals and let the stepsize ηk in (h-PMD) satisfy ηk 1 ck minπ Gh Dϕ(π, πk)

where we recall that the minimum is computed component-wise. Initialized at π0 ri(Π), the iterates (πk) of inexact (h-PMD) with h N \ {0} satisfy for every k N,

V V π0 + 1 1 γ

+ 2b (1 γ)(1 γh) .

The proof of this result is similar to the proof of Theorem 4.1 and consists in propagating the error b in the analysis. We defer it to Appendix B.

As can be directly seen from this formulation, the γh convergence rate from the exact setting generalizes to the inexact setting. Therefore, higher lookahead depths lead to faster convergence rates. Another improvement with relation to [20] is that the additive error term has a milder dependence on the effective horizon 1 γ: the term 2b (1 γ)(1 γh) is smaller for all values of h 2 than 4b (1 γ)2 . Larger lookahead depths yield a smaller asymptotic error in terms of the suboptimality gap.

We are now ready to establish the sample complexity of inexact h-PMD under a generative model using Theorem 5.3 together with concentration results for our Monte Carlo lookahead action value estimator. Theorem 5.4 (Sample complexity of h-PMD). Assume that inexact h-PMD is run for a number of iterations K > 1 h(1 γ) log( 4 ϵ(1 γ)(1 γh)), using the choice of stepsize defined by the sequence

(ck) := (γ2h(k+1)). Additionally, suppose we are given a target suboptimality value ϵ > 0, and a probability threshold δ > 0. Finally, assume the lookahead value function ˆQπk h is approximated at each iteration with the Monte Carlo estimation procedure described in section 5.1, with the following parameter values: M0 = O( γ2h

(1 γ)4(1 γh)2ε2 ) and for all j [h] , Mj = M = O( 1 (1 γ)6ε2 ) . Then, with probability at least 1 δ, the suboptimality at all iterations k satisfy the following bound:

V V πk γhk V V π0 + 1 γhk

(1 γ)(1 γh)

Using this procedure, h-PMD uses at most KM0HS + h KMSA samples in total. The overall sample complexity of the inexact h-PMD is then given by O( S hϵ2(1 γ)6(1 γh)2 + SA ϵ2(1 γ)7 ) where the

notation O hides at most polylogarithmic factors.

Compared to Theorem 5.1 in [20], the dependence on the effective horizon improves by a factor of the order of 1/(1 γ) for h > A 1(1 γ) 1 thanks to our Monte Carlo lookahead estimator. Their sample complexity is of order O( 1 ϵ2(1 γ)8 ), whereas for h > 1 A(1 γ) ours becomes O( 1 ϵ2(1 γ)7 ).

Our sample complexity does not depend on quantities such as the distribution mismatch coefficient used in prior work [53, 25], which may scale with the state space size.

6 Extension to Function Approximation

In MDPs with prohibitively large state action spaces, we represent each state-action pair (s, a) S A by a feature vector ψ(s, a) Rd where typically d SA and where ψ : S A Rd is a feature map, also represented as a matrix Ψ RSA d. Assuming that this feature map holds enough information about each state action pair (more rigorous assumptions will be provided later), it is possible to approximate any value function using a matrix vector product Ψθ, where θ Rd. In this section, we propose an inexact h-PMD algorithm using action value function approximation.

6.1 Inexact h-PMD with Function Approximation

Using the notations above, approximating Qπk h by Ψθk, the inexact h-PMD update rule becomes:

πk+1 s argmaxπs (S) ηk (Ψθk)s, πs Dϕ(πs, πk s ) . (7)

We note that the policy update above can be implicit to avoid computing a policy for every state. It is possible to implement this algorithm using our Monte Carlo planning method in a similar fashion

to [27]: use the procedure in section 5.1 to estimate Qπk h (s, a) for all state action pairs (s, a) in some set C S A, and use these estimates as targets for a least squares regression to approximate Qπk h by Ψθk at time step k. See Appendix C for further details. We conclude this section with a convergence analysis of h-PMD with linear function approximation.

6.2 Convergence Analysis of h-PMD with Linear Function Approximation

Our analysis follows the approach of [27]. We make the following standard assumptions.

Assumption 6.1. The feature matrix Ψ RSA d where d SA is full rank.

Assumption 6.2 (Approximate Universal value function realizability). There exists ϵFA > 0 s.t. for any π Π, infθ Rd Qπ h Ψθ ϵFA .

We defer a discussion about Assumption 6.2 to Appendix C.2.

Theorem 6.3 (Convergence of h-PMD with linear function approximation). Let (θk) be the sequence of iterates produced by h-PMD with linear function approximation, run for K iterations, starting with policy π0 ri(Π), using stepsizes defined by (ck). Assume that targets ˆQπk h were computed using the procedure described in sec. 5.1 such that with probability at least 1 δ, k K , z C S A the targets satisfy | ˆQπk h (z) Qπk h (z)| ϵ.3

Then, there exists a choice of C for which the policy iterates satisfy at each iteration:

V V π0 + 1 1 γ

d ϵ + 2 (1 +

d)ϵFA (1 γ)(1 γh) .

The proof of Theorem 6.3 can be found in Appendix C.3. Notice that our performance bound does not depend on the state space size like in sec. 5 and depends on the dimension d of the feature space instead. We should notice though that the choice of the set C is crucial for this result and given by the Kiefer-Wolfowitz theorem [22] which guarantees that |C| = O(d2) as discussed in [27]. Moreover, our method for function approximation yields a sample complexity that is instance independent. Existing results for PMD with function approximation [3, 54] depend on distribution mismatch coefficients which can scale with the size of the state space.

In terms of computational complexity, at each iteration, h-PMD uses O((MA)h) elementary operations where M is the size of the minibatch of trajectories used, A is the size of the action space and h is the lookahead depth. This computational complexity which scales exponentially in h is inherent to tree search methods. Despite this seemingly prohibitive computational scaling, tree search has been instrumental in practice for achieving state of the art results in some environments [40, 19] and modern implementations (notably using GPU-based search on vectorized environments) make tree search much more efficient in practice.

7 Simulations

We conduct simulations to investigate the effect of the lookahead depth on the convergence rate and illustrate our theoretical findings. We run the h-PMD algorithm for different values of h in both exact and inexact settings on the Deep Sea environment from Deep Mind s bsuite [34] using a grid size of 64 by 64, and a discount factor γ = 0.99. Additional experiments are provided in Appendix D. Our codebase where all our experiments can be replicated is available here: https: //gitlab.com/kimon.protopapa/pmd-lookahead.

Exact h-PMD. We run the exact h-PMD algorithm for 100 iterations for increasing values of h using the KL divergence. Similar results were observed for the Euclidean divergence. We tested two different stepsize schedules: (a) in dotted lines in Fig. 1 (left), ηk equal to its lower bound in sec. 4, with the choice ck := γ2h(k+1) (note the dependence on h); and (b) in solid lines, ηk identical stepsize schedule across all values of h with ck := γ2(k+1) to isolate the effect of the lookahead. We clearly observe under both stepsize schedules that h-PMD converges in fewer iterations when lookahead value functions are used instead of regular Q functions.

3This can be achieved by choosing the parameters as in Theorem 5.4 (see Appendix C for details).

0 10 20 30 40 Iteration

0 25 50 75 100 Iteration

0 10 20 30 Runtime (s)

h=1 h=5 h=10 h=15 h=20

Figure 1: Suboptimality value function gap for h-PMD in the exact (left) and inexact (middle/right) settings, plotted against iterations in the exact case (left) and against both iterations (middle) and runtime (right) in the inexact case. 16 runs performed for each h, mean in solid line and standard deviation as shaded area. In dotted lines (left), the step size ηk is equal to its lower bound in sec. 4, with the choice ck := γ2h(k+1) (note the dependence on h) and in solid lines, the step size ηk is set using an identical stepsize schedule across all values of h with ck := γ2(k+1) to isolate the effect of the lookahead. Notice that higher values of h still perform better even in terms of runtime.

Inexact h-PMD. In this setting, we estimated the value function Qh using the vanilla Monte Carlo planning procedure detailed in section 5 in a stochastic variant of the Deep Sea environment, and all the parameters except for h were kept identical across runs. As predicted by our results, h-PMD converges in fewer iterations when h is increased (see Fig. 1 (middle)). We also observed in our simulations that h-PMD uses less samples overall (see Appendix D for the total number of samples used at each iteration), and usually converges after less overall computation time (see Fig. 1 (right)). We also refer the reader to Appendix D where we have performed additional experiments with a larger lookahead depth h = 100. In this case, the algorithm converges in a single iteration. This is theoretically expected as computing the lookahead values with very large h boils down to computing the optimal values like in value iteration with a large number of iterations. We also performed additional experiments in continuous control tasks to illustrate the general applicability of our algorithm (see Fig. 8 in Appendix D).

Remark 7.1. (Choice of the lookahead depth h.) The lookahead depth is a hyperparameter of the algorithm and can be tuned similarly to other hyperparameters such as the step size of the algorithm. Of course, the value would depend on the environment and the structure of the reward at hand. Sparse and delayed reward settings will likely benefit from lookahead with larger depth values. We have performed several simulations with different values of h for each environment setting and the performance can potentially improve drastically with a better lookahead depth value (see also appendix D for further simulations). In addition, we observe that larger lookahead depth is not always better: see Fig. 7 in the appendix for an example where large lookahead depth becomes slower and does not perform better. Note that in this more challenging practical setting the best performance is not obtained for higher values of h: intermediate values of h perform better. This illustrates the tradeoff in choosing the depth h between an improved convergence rate and the computational cost induced by a larger h. We believe further investigation regarding the selection of the depth parameter might be useful to further improve the practical performance of the algorithm.

8 Related Work

Policy Mirror Descent and Policy Gradient Methods. Motivated by their empirical success [41, 42], the analysis of PG methods has recently attracted a lot of attention [1, 6, 21, 7, 53, 25, 20]. Among these works, a line of research focused on the analysis of PMD [53, 25, 28, 29, 20, 3] as a flexible algorithmic framework, and its particular instances such as the celebrated natural policy gradient [21, 54]. In particular, for solving unregularized MDPs (which is our main focus), Xiao [53] established linear convergence of PMD in the tabular setting with a rate depending on the instance dependent distribution mismatch coefficients, and extended their results to the inexact setting using a generative model. Similar results were established for the particular case of the natural policy gradient algorithm [6, 21]. More recently, Johnson, Pike-Burke, and Rebeschini [20] showed a dimension-free

rate for PMD with exact policy evaluation using adaptive stepsizes, closing the gap with PI. Notably, they further show that this rate is optimal and their analysis inspired from PI circumvents the use of the performance difference lemma which was prevalent in prior work. We improve over these results using lookahead and we extend our analysis beyond the tabular setting by using linear function approximation. To the best of our knowledge, the use of multi-step greedy policy improvement in PMD and its analysis are novel in the literature.

Policy Iteration with Multiple-Step Policy Improvement. Multi-step greedy policies have been successfully used in applications such as the game of Go and have been studied in a body of works [12, 13, 52, 15, 49]. In particular, multi-step greedy policy improvement has been studied in conjunction with PI in [12, 52]. Efroni et al. [12] introduced h-PI which incorporates h-lookahead to PI and generalized existing analyses on PI with 1-step greedy policy improvement to h-step greedy policies. However, their work requires access to an h-greedy policy oracle, and the h-PI algorithm is unstable in the stochastic setting even for h = 1 because of potentially large policy updates. We address all these issues in the present work with our h-PMD algorithm. Winnicki and Srikant [52] showed that a first-visit version of a PI scheme using a single sample path for policy evaluation converges to the optimal policy provided that the policy improvement step uses lookahead instead of a 1-step greedy policy. Unlike our present work, the latter work assumes access to a lookahead greedy policy oracle. More recently, Alegre et al. [2] proposed to use lookahead for policy improvement in a transfer learning setting. Extending Generalized Policy Improvement (GPI) [4] which identifies a new policy that simultaneously improves over all previously learned policies each solving a specific task, they introduce h-GPI which is a multi-step extension of GPI. We rather consider a lookahead version of PMD including policy gradient methods instead of PI. We focus on solving a single task and provide policy optimization guarantees in terms of iteration and sample complexities.

Lookahead-Based Policy Improvement with Tree Search. Lookahead policy improvement is usually implemented via tree search methods [9, 43, 40, 24, 18, 14]. Recently, Dalal et al. [11] proposed a method combining policy gradient and tree search to reduce the variance of stochastic policy gradients. The method relies on a softmax policy parametrization incorporating a tree expansion. Our general h-PMD framework which incorporates a different lookahead policy improvement step also brings together PG and tree search methods. Morimura et al. [31] designed a method with a mixture policy of PG and MCTS for non-Markov Decision Processes. Grill et al. [17] showed that Alpha Zero as well as several other MCTS algorithms compute approximate solutions to a family of regularized policy optimization problems which bear similarities with our h-PMD algorithm update rule for h = 1 (see sec. 3.2 therein).

9 Conclusion

In this work, we introduced a novel class of PMD algorithms with lookahead inspired by the success of lookahead policies in practice. We have shown an improved γh-linear convergence rate depending on the lookahead depth in the exact setting. We proposed a stochastic version of our algorithm enjoying an improved sample complexity. We further extended our results to scale to large state spaces via linear function approximation with a performance bound independent of the state space size. Our paper offers several interesting directions for future research. A possible avenue for future work is to investigate the use of more general function approximators such as neural networks to approximate the lookahead value functions and scale to large state-action spaces. Recent work for PMD along these lines [3] might be a good starting point. Enhancing our algorithm with exploration mechanisms (see e.g. the recent work [33]) is an important future direction. Our PMD update rule might offer some advantage in this regard as was recently observed in the literature [28]. Constructing more efficient fully online estimators for the lookahead action values using MCTS and designing adaptive lookahead strategies to select the tree search horizon [38] are promising avenues for future work. Our algorithm brings together two popular and successful families of RL algorithms: PG methods and tree search methods. Looking forward, we hope our work further stimulates research efforts to investigate practical and efficient methods enjoying the benefits of both classes of methods.

Acknowledgments and Disclosure of Funding

We would like to thank the anonymous area chair and reviewers for their useful comments which helped us to improve our paper. Most of this work was completed when A.B. was affiliated with ETH Z urich as a postdoctoral fellow. This work was partially supported by an ETH Foundations of Data Science (ETH-FDS) postdoctoral fellowship.

[1] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431 4506, 2021. 9

[2] L. N. Alegre, A. Bazzan, A. Nowe, and B. C. da Silva. Multi-step generalized policy improvement by leveraging approximate models. In Advances in Neural Information Processing Systems, volume 36, pages 38181 38205, 2023. 10

[3] C. Alfano, R. Yuan, and P. Rebeschini. A novel framework for policy mirror descent with general parameterization and linear convergence. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 8, 9, 10

[4] A. Barreto, S. Hou, D. Borsa, D. Silver, and D. Precup. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48):30079 30087, 2020. 10

[5] D. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996. 1

[6] J. Bhandari and D. Russo. On the linear convergence of policy gradient methods for finite mdps. In International Conference on Artificial Intelligence and Statistics, pages 2386 2394. PMLR, 2021. 9

[7] J. Bhandari and D. Russo. Global optimality guarantees for policy gradient methods. Operations Research, 2024. 9

[8] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016. 31

[9] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1 43, 2012. 4, 10

[10] G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538 543, 1993. 17

[11] G. Dalal, A. Hallak, S. Mannor, and G. Chechik. Softtreemax: Policy gradient with tree search, 2022. 10

[12] Y. Efroni, G. Dalal, B. Scherrer, and S. Mannor. Beyond the one-step greedy approach in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1387 1396. PMLR, 10 15 Jul 2018. 1, 2, 4, 5, 6, 10

[13] Y. Efroni, G. Dalal, B. Scherrer, and S. Mannor. Multiple-step greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, volume 31, 2018. 2, 10

[14] Y. Efroni, G. Dalal, B. Scherrer, and S. Mannor. How to combine tree-search methods in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3494 3501, 2019. 4, 10

[15] Y. Efroni, M. Ghavamzadeh, and S. Mannor. Online planning with lookahead policies, 2020. 10

[16] J.-B. Grill, F. Altch e, Y. Tang, T. Hubert, M. Valko, I. Antonoglou, and R. Munos. Montecarlo tree search as regularized policy optimization. In International Conference on Machine Learning, pages 3769 3778. PMLR, 2020. 5

[17] J.-B. Grill, F. Altch e, Y. Tang, T. Hubert, M. Valko, I. Antonoglou, and R. Munos. Monte-Carlo tree search as regularized policy optimization. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3769 3778. PMLR, 13 18 Jul 2020. 10

[18] A. Hallak, G. Dalal, S. Dalton, I. Frosio, S. Mannor, and G. Chechik. Improve agents without retraining: Parallel tree search with off-policy correction, 2023. 10

[19] M. Hessel, I. Danihelka, F. Viola, A. Guez, S. Schmitt, L. Sifre, T. Weber, D. Silver, and H. van Hasselt. Muesli: Combining improvements in policy optimization, 2022. 8

[20] E. Johnson, C. Pike-Burke, and P. Rebeschini. Optimal convergence rate for exact policy mirror descent in discounted markov decision processes. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 1, 2, 3, 4, 5, 6, 7, 9, 16, 17, 18, 19, 29

[21] S. Khodadadian, P. R. Jhunjhunwala, S. M. Varma, and S. T. Maguluri. On linear and superlinear convergence of natural policy gradient algorithm. Systems & Control Letters, 164:105214, 2022. 9

[22] J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960. 8, 27

[23] L. Kocsis and C. Szepesv ari. Bandit based monte-carlo planning. In Machine Learning: ECML 2006, pages 282 293, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. 6

[24] L. Kocsis and C. Szepesv ari. Bandit based monte-carlo planning. In J. F urnkranz, T. Scheffer, and M. Spiliopoulou, editors, Machine Learning: ECML 2006, pages 282 293, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. 10

[25] G. Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming, 198(1):1059 1106, 2023. 1, 5, 7, 9, 18

[26] R. T. Lange. gymnax: A JAX-based reinforcement learning environment library, 2022. 30

[27] T. Lattimore and C. Szepesvari. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, 2019. 8, 27, 28

[28] Y. Li and G. Lan. Policy mirror descent inherently explores action space. ar Xiv preprint ar Xiv:2303.04386, 2023. 9, 10

[29] Y. Li, G. Lan, and T. Zhao. Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity. Mathematical Programming, pages 1 57, 2023. 1, 9, 18

[30] J. Mei, B. Dai, A. Agarwal, M. Ghavamzadeh, C. Szepesv ari, and D. Schuurmans. Orderingbased conditions for global convergence of policy gradient methods. Advances in Neural Information Processing Systems, 36, 2024. 27

[31] T. Morimura, K. Ota, K. Abe, and P. Zhang. Policy gradient algorithms with monte-carlo tree search for non-markov decision processes, 2022. 10

[32] R. Munos. Error bounds for approximate policy iteration. In Internation Conference on Machine Learning, volume 3, pages 560 567. Citeseer, 2003. 1

[33] B. O Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization, 2023. 10

[34] I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. Mc Kinney, T. Lattimore, C. Szepesv ari, S. Singh, B. Van Roy, R. Sutton, D. Silver, and H. van Hasselt. Behaviour suite for reinforcement learning. In International Conference on Learning Representations, 2020. 2, 8, 28, 30, 31

[35] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744, 2022. 1

[36] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. 3

[37] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970. 3

[38] A. Rosenberg, A. Hallak, S. Mannor, G. Chechik, and G. Dalal. Planning and learning with adaptive lookahead. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8):9606 9613, Jun. 2023. 10

[39] B. Scherrer. Improved and generalized upper bounds on the complexity of policy iteration. Advances in Neural Information Processing Systems, 26, 2013. 1

[40] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. P. Lillicrap, and D. Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588:604 609, 2019. 2, 5, 8, 10, 31

[41] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889 1897. PMLR, 2015. 1, 9

[42] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. 1, 9

[43] D. Shah, Q. Xie, and Z. Xu. Non-asymptotic analysis of monte carlo tree search, 2020. 10

[44] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. 2

[45] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140 1144, 2018. 2, 5

[46] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017. 2, 5

[47] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018. 1

[48] C. Szepesv ari. CMPUT 653: Theoretical Foundations of Reinforcement Learning. Lecture 6. Online planning - Part II., 2023. 21, 23, 26

[49] M. Tomar, Y. Efroni, and M. Ghavamzadeh. Multi-step greedy reinforcement learning algorithms. In International Conference on Machine Learning, pages 9504 9513. PMLR, 2020. 2, 10

[50] M. Tomar, L. Shani, Y. Efroni, and M. Ghavamzadeh. Mirror descent policy optimization. In International Conference on Learning Representations, 2022. 1

[51] A. Winnicki, J. Lubars, M. Livesay, and R. Srikant. The role of lookahead and approximate policy evaluation in reinforcement learning with linear value function approximation. ar Xiv preprint ar Xiv:2109.13419, 2021. 2

[52] A. Winnicki and R. Srikant. On the convergence of policy iteration-based reinforcement learning with monte carlo policy evaluation. In International Conference on Artificial Intelligence and Statistics, pages 9852 9878. PMLR, 2023. 2, 10

[53] L. Xiao. On the convergence rates of policy gradient methods. The Journal of Machine Learning Research, 23(1):12887 12922, 2022. 1, 3, 4, 5, 7, 9, 15, 17, 18

[54] R. Yuan, S. S. Du, R. M. Gower, A. Lazaric, and L. Xiao. Linear convergence of natural policy gradient methods with log-linear policies. In The Eleventh International Conference on Learning Representations, 2023. 8, 9

1 Introduction 1

2 Preliminaries 2

3 A Refresher on PMD and PI with Lookahead 3

3.1 PMD and its Connection to PI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2 Policy Iteration with Lookahead . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Policy Mirror Descent with Lookahead 4

4.1 h-PMD: Using h-Step Greedy Updates . . . . . . . . . . . . . . . . . . . . . . . . 4

4.2 Convergence Analysis for Exact h-PMD . . . . . . . . . . . . . . . . . . . . . . . 5

5 Inexact and Stochastic h-Policy Mirror Descent 6

5.1 Monte Carlo h-Greedy Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . 6

5.2 Analysis of Inexact h-PMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6 Extension to Function Approximation 7

6.1 Inexact h-PMD with Function Approximation . . . . . . . . . . . . . . . . . . . . 7

6.2 Convergence Analysis of h-PMD with Linear Function Approximation . . . . . . . 8

7 Simulations 8

8 Related Work 9

9 Conclusion 10

A Additional Details and Proofs for Section 4: Exact h-PMD 15

A.1 Auxiliary Lemma for the Proof of Theorem 4.1: Exact h-PMD . . . . . . . . . . . 15

A.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.3 Additional Comments About the Stepsizes in Theorem 4.1 . . . . . . . . . . . . . 18

B Additional Details and Proofs for Section 5: Inexact h-PMD 19

B.1 Details and Precise Notation for Lookahead Computation . . . . . . . . . . . . . . 19

B.2 Proof of Theorem 5.3: Inexact h-PMD . . . . . . . . . . . . . . . . . . . . . . . . 19

B.3 Proof of Theorem 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

C Additional Details and Proofs for Section 6: h-PMD with Function Approximation 24

C.1 Inexact h-PMD with Linear Function Approximation . . . . . . . . . . . . . . . . 25

C.2 Discussion of Assumption 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

C.3 Convergence Analysis of Inexact h-PMD using Function Approximation . . . . . . 27

D Further Simulations 28

D.1 Simulations for h-PMD with Euclidean Regularization . . . . . . . . . . . . . . . 28

D.2 Sample Complexity Comparison for Different Lookahead Depths . . . . . . . . . . 28

D.3 Total Running Time Comparison for Different Lookahead Depths . . . . . . . . . 28

D.4 About the Stepsizes in Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 29

D.5 Larger lookahead depths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

D.6 Experiments with MCTS-based Lookahead Estimation . . . . . . . . . . . . . . . 30

D.7 Continuous Control Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

D.8 Additional experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

A Additional Details and Proofs for Section 4: Exact h-PMD

Lemma A.1. For any policies π, π Π and every state s S, we have Qπ s , πs = T πV π (s) . As a consequence, the update rules (PMD) and (1) in the main part are equivalent.

Proof. Let π, π Π. For every s S, we have

Qπ s , πs = X

a A Qπ (s, a) π(a|s) (by definition of the scalar product)

r(s, a) + γ X

s S P(s |s, a)V π (s )

π(a|s) (Qπ as a function of V π )

= rπ(s) + γ X

a A P(s |s, a)π(a|s)

(by rearranging and using the definition of rπ)

= rπ(s) + γP πV π (s) (using the definition of P π)

= T πV π (s) . (8)

The second part of the statement holds by setting π = πk .

A.1 Auxiliary Lemma for the Proof of Theorem 4.1: Exact h-PMD

Lemma A.2 (Triple Point Descent Lemma for h-PMD). Let h N \ {0} . Initialized at π0 ri(Π), the iterates of h-step greedy PMD remain in ri(Π) and satisfy the following inequality for any k N and any policy π Π,

ηk T πT h 1V πk ηk T πk+1T h 1V πk + Dϕ(π, πk) Dϕ(π, πk+1) Dϕ(πk+1, πk)

Proof. Recall the definition of the h-step greedy action value function denoted by Qπ h : S A R in (2). Then, recall that for each state s S and every k N, we have

Qπk h (s, ), πs = (T πT h 1V πk)(s) . (9)

Applying the three point descent lemma (see Lemma 6 in [53] with ϕ(π) = Qπk h (s, ), π for any π Π and C = (A) with the notation therein), we obtain for any policy π Π,

ηk Qπk d (s, ), πk+1 s + Dϕ(πk+1 s , πk s ) ηk Qπk d (s, ), πs + Dϕ(πs, πk s ) Dϕ(πs, πk+1 s ) .

We obtain the desired result by rearranging the above last inequality and using (9).

Lemma A.3. Let h N \ {0} . Initialized at π0 ri(Π), the iterates of h-step greedy PMD satisfy for every k N, T πk+1T h 1V πk V πk+1 + γ 1 γ cke ,

where (ck) is the sequence of positive integers introduced in Theorem 4.1 such that dk ηk ck and dk = minπ Gh(V πk ) Dϕ(π, πk) .

Proof. Let k N . Recall that any policy π Gh(V πk) satisfies T πT h 1V πk = T h V πk . Using Lemma A.2 for π Gh(V πk) together with the previous identity and the nonnegativity of the Bregman divergence implies that for any for any policy π Gh(V πk),

T h V πk T πk+1T h 1V πk + dk ηk e T πk+1T h 1V πk + cke , (10)

where the last inequality follows from the assumption on the stepsizes ηk . We now prove by recurrence that for every n N,

T πk+1T h 1V πk (T πk+1)n+1T h 1V πk +

i=1 γicke , (11)

with Pn i=1 γi = 0 for n = 0 . The base case for n = 0 is immediate. We now suppose that the result holds for n and we show that it remains true for n + 1 . We have

T πk+1T h 1V πk = T πk+1T h 1T πk V πk (T πk V πk = V πk)

T πk+1T h V πk (T πk V T V for any V RS and monotonicity of T πk+1)

T πk+1(T πk+1T h 1V πk + cke) (using (10) and monotonocity of T πk+1)

= (T πk+1)2T h 1V πk + γcke (T π(V + ce) = T πV + γc e for any V RS, c R)

(T πk+1)n+1T h 1V πk +

(using recurrence hypothesis and monotonicity of T πk+1)

= (T πk+1)n+2T h 1V πk +

i=1 γi+1cke + γcke

= (T πk+1)n+2T h 1V πk +

i=1 γicke , (12)

which concludes the recurrence proof of (11) for every n N . To conclude, we take the limit in (11) and use the fact that for any V RS, the sequence ((T πk+1)n V )n converges to the fixed point V πk+1 of the contractive operator T πk+1 . Hence, we obtain

T πk+1T h 1V πk V πk+1 +

i=1 γicke = V πk+1 + γ 1 γ cke . (13)

This concludes the proof.

Remark A.4. We note that although in our work the stepsise ηk is chosen to be the same for all states for simplicity, the same result holds for a stepsize that is different for each state as long as ηk(s) 1 ck Dϕ(πs, πk s ) for each s S. The same is true for all results depending on this lemma (or the inexact variant).

A.2 Proof of Theorem 4.1

As a warmup, we start by reporting a convergence result for the standard PMD algorithm, recently established in [20]. Using the contraction properties of the Bellman operators, it can be shown that the suboptimality gap V πk V for PI iterates converges to zero at a linear rate with a contraction factor γ, regardless of the underlying MDP. This instance-independent convergence rate was generalized to PMD in [20]. Before stating this result, we report a useful three-point descent lemma to quantify the improvement of the PMD policy resulting from the proximal step compared to an arbitrary policy. Lemma A.5 (Three Point Descent Lemma). Using the PMD update rule, we have for every k N and every π Π,

ηk T πk+1V πk ηk T πV πk Dϕ(πk+1, πk) + Dϕ(π, πk) Dϕ(π, πk+1) . (14)

This lemma results from an application of Lemma 6 in [53] which is a variant of Lemma 3.2 in [10]. This same result was also used in [20].

Theorem A.6 (Exact PMD, [20]). Let (ck) be a sequence of positive reals and let the stepsize ηk in (PMD) satisfy ηk 1 ck minπ G(V πk ) Dϕ(π, πk) where the minimum is computed componentwise. Initialized at π0 ri(Π), the iterates (πk) of (PMD) satisfy for every k N,

t=1 γ tct 1

We provide here a complete proof of this result as a warmup for our upcoming main result. Although equivalent, notice that our proof is more compact compared to the one provided in [20] (see section 6 and Lemma A.1 to A.3 therein) thanks to the use of our convenient notations, and further highlights the relationship between PMD and PI through the explicit use of Bellman operators.

Proof. Using Lemma A.5 with a policy π G(V πk) for which T πV πk = T V πk, we have

ηk T V πk ηk T πk+1V πk Dϕ(πk+1, πk) + Dϕ(π, πk) Dϕ(π, πk+1) . (15)

Setting dk := minπ G(V πk ) Dϕ(π, πk) and recalling that Dϕ(π, πk) 0 for any policy π Π, we obtain

T V πk T πk+1V πk + dk

ηk e . (16)

Notice that for Policy Iteration we have T V πk = T πk+1V πk i.e. πk+1 G(V πk). Instead, for PMD we have that πk+1 is approximately greedy (see (16)).

Applying Lemma A.5 again with π = πk, recalling that T πk V πk = V πk and that Dϕ(π, π ) 0 for any π, π Π, we immediately obtain V πk T πk+1V πk . Iterating this relation and using the monotonicity of the Bellman operator, we obtain V πk T πk+1V πk (T πk+1)2V πk (T πk+1)n V πk V πk+1 where the last inequality follows from the fact that T πk+1 has a unique fixed point V πk+1 . We are now ready to control the suboptimality gap as follows:

V V πk+1 V T πk+1V πk

= V T V πk + T V πk T πk+1V πk

V T V πk + dk

V T V πk + cke ,

where the first inequality follows from the inequality V πk+1 T πk+1V πk proved above, the second one stems from (16), and the last one is a consequence of our assumption dk/ηk ck . We conclude the proof by taking the max norm, applying the triangle inequality and using the contractiveness of the Bellman optimality operator T .

Proof of Theorem 4.1

Proof. Similarly to Lemma A.5, we show that for any π Gh(V πk),

ηk T πT h 1V πk ηk T πk+1T h 1V πk Dϕ(πk+1, πk) + Dϕ(π, πk) Dϕ(π, πk+1) , (17)

see Lemma A.2 for a proof. Setting dk := minπ Gh(V πk ) Dϕ(π, πk) (note the difference with dk in the proof of Theorem A.6), and recalling that π Gh(V πk) implies that π G(T h 1V πk), we obtain from (17) an inequality similar to (16),

T h V πk T πk+1T h 1V πk + dk ηk e . (18)

Then, we observe that

T πk+1T h 1V πk = T πk+1T h 1T πk V πk

T πk+1T h V πk

T πk+1(T πk+1T h 1V πk + dk ηk e)

= (T πk+1)2T h 1V πk + γ dk ηk e

(T πk+1)2T h 1V πk + γcke , (19)

where the second inequality follows from using (17) and monotonicity of T πk+1 and the last inequality uses the assumption on the stepsizes. Then we show by recursion by iterating (19), taking the limit and using the fact that T πk+1 has a unique fixed point V πk+1, that

T πk+1T h 1V πk V πk+1 + γ 1 γ cke , (20)

see Lemma A.3 for a complete proof. We now control the suboptimality gap as follows,

V V πk+1 V T πk+1T h 1V πk + γ 1 γ cke

= V T h V πk + γ 1 γ cke + T h V πk T πk+1T h 1V πk

V T h V πk + 1 1 γ cke , (21)

where the first inequality stems from (20) and the last one from reusing (18). Similarly to Theorem A.6, we conclude the proof by taking the max norm, applying the triangle inequality and using the contractiveness of the Bellman optimality operator h times to obtain

V V πk+1 γh V V πk + 1 1 γ ck .

A recursion gives the desired result.

A.3 Additional Comments About the Stepsizes in Theorem 4.1

About the cost of computing adaptive stepsizes. There is typically only one greedy policy at each iteration, and in the case that there is more than one it is possible to just pick any arbitrary greedy policy and the lower bound condition on the step size will still be satisfied. The cost of computing the step size is typically the cost of computing a Bregman divergence between two policies. We elaborate on this in the following using the discussion in [20] p. 7 in Computing the step size paragraph which also applies to our setting. We use the step size condition ηk 1 ck minπ Gh(V πk ) Dϕ(π, πk) . We have a minimum in the condition which means we can just pick a single greedy policy π Gh(V πk) (which can easily be computed given a lookahead value function in the exact setting or its estimate in an inexact setting). As for the maximum over the state space, this is because we use the same step size in all the states. This condition can readily be replaced by a state dependent step size ηk(s) 1 ck Dϕ( π( |s), πk( |s)) which is enough for our results to hold.

About the condition on the stepsizes. First, notice that PMD has been analyzed using an increasing stepsize in the exact setting (see e.g. [25, 53, 29]) and this is natural as this corresponds to emulating Policy Iteration. In the present work, notice that we do not require the stepsize to go to infinity and the stepsize does not typically go to infinity under our condition. Recall that we only require the stepsize to satisfy the condition: ηk 1 ck minπ Gh(V πk ) Dϕ(π, πk) . As the policy gets closer to the optimal one it should also get closer to a greedy policy (the optimal policy is greedy with respect to its value function) and the term containing a Bregman divergence will vanish (since the optimal policy is already greedy with respect to its own value function). This effect balances the decreasing ck. Notice that we typically choose ck = γ2h(k+1).

We refer the reader to section D.4 for additional observations regarding stepsizes in experiments.

B Additional Details and Proofs for Section 5: Inexact h-PMD

B.1 Details and Precise Notation for Lookahead Computation

We describe here our approach to estimate Qπ h in more detail and introduce the notation necessary for the analysis in the following sections. Assume that we are estimating Qπ h(s, a) at a given fixed state-action pair (s, a) S A. We define a sequence (Sk) of sets of visited states as follows: Set Sh = {s} and define for every k [0, h 1], for every s Sk+1 and for ever a A the multiset Sk( s, a) := {s j P( | s, a) : j [Mk+1]} (including duplicated states) where Mk+1 is the number of states sampled from each state s Sk+1 , and Sk = S

s S, a A Sk( s, a) Sk+1 (without duplicated states). The estimation procedure is recursive. First we define the value function ˆV1 by ˆV1( s) := ˆV π( s) , s S0 , where ˆV π( s) is an estimate of V π( s) computed using M0 rollouts of the policy π from state s . More precisely, we sample M0 trajectories (τj( s))j [M0] of length H 1 rolling out policy π, i.e. τj( s) := {(sj k, aj k, r(sj k, aj k))}0 k H 1 with sj 0 = s and aj k π( |sj k). Then the Monte Carlo estimator is computed as ˆV π( s) = 1 M0 PM0 j=1 PH 1 k=0 γkr(sj k, aj k) . For each stage k [1, h] for h 1, we define the following estimates for every s Sk, a A,

ˆQk( s, a) := r( s, a) + γ

ˆVk(s ) , ˆVk+1( s) := max a A ˆQk( s, a) , (22)

The desired estimate ˆQπ h(s, a) is obtained with k = h. The recursive procedure described above can be represented as a (partial incomplete) tree where the root node is the state-action pair (s, a) and each sampled action leads to a new state node s followed by a new state-action node (s , a ). To trigger the recursion, we assign estimated values ˆV π( s) of the true value function V π( s) to each one of the leaf state nodes s of this tree, i.e. the state level h 1 of this tree which corresponds to the last states visited in the sampled trajectories. The desired estimate ˆQπ h(s, a) is given by the estimate obtained at the root state-action node (s, a) . Recalling that Qπ h(s, a) := r(s, a) + γ(PV π h )(s, a) for any s S, a A and V π h := T h 1V π, the overall estimation procedure consists in using Monte Carlo rollouts to estimate the value function V π before approximating V π h := T h 1V π using approximate Bellman operator steps and finally computing the estimate ˆQπ h(s, a) using a sampled version of Qπ h(s, a) := r(s, a) + γ(PV π h )(s, a).

B.2 Proof of Theorem 5.3: Inexact h-PMD

Throughout this section, we suppose that the assumptions of Theorem 5.3 hold.

Lemma B.1. Let π Π, k N . Suppose that the estimator ˆQπk h is such that ˆQπk h Qπ h b for some positive scalar b . Then the approximate h-step lookahead value function ˆV πk h (π) := M π ˆQπk h is a b-approximation of T πT h 1V πk, i.e.,

ˆV πk h (π) T πT h 1V πk b . (23)

Proof. The H older inequality implies that for any policy π Π:

ˆV πk h (π) T πT h 1V πk = max s S | ˆQπk h (s, ) Qπk h (s, ), πs |

max s S ˆQπk h (s, ) Qπk h (s, ) πs 1 b .

Lemma B.2 (Approximate h-Step Greedy Three-Point Descent Lemma). For any policy π Π, the iterates πk of approximate h-PMD satisfy the following inequality for every k 0:

ηk T πT h 1V πk ηk T πk+1T h 1V πk +Dϕ(π, πk) Dϕ(π, πk+1) Dϕ(πk+1, πk)+2ηkbe . (24)

Proof. We apply again the Three-Point Descent Lemma (see e.g. Lemma A.1 in [20]) to obtain:

ηk ˆQπk h (s, ), πk+1 s +Dϕ(πk+1 s , πk s ) ηk ˆQπk h (s, ), πs +Dϕ(πs, πk s ) Dϕ(πs, πk+1 s ) , (25)

which by rearranging and expressing in vector form gives:

ηk ˆV πk h (π) ηk ˆV πk h (πk+1) + Dϕ(π, πk) Dϕ(π, πk+1) Dϕ(πk+1, πk) . (26)

Lemma B.1 implies that ˆV πk h (π) T πT h 1V πk be and ˆV πk h (πk+1) T πk+1T h 1V πk + be yielding the lemma.

The following lemma is an analogue of (18) in the proof of Theorem 4.1 above, modified for the inexact case.

Lemma B.3. Assuming access at each iteration to an approximate Qh value function ˆQπk h with ˆQπk h Qπ h b, the iterates of inexact h-PMD satisfy for every k 0:

T h V πk T πk+1T h 1V πk + 2be + dk ηk e . (27)

Proof. Let π be any true h-step greedy policy π Gh(V πk), and let π be any approximate h-step greedy policy π argmax π Π ˆV πk h (π). By definition of π and equation (26) we have:

ˆV πk h ( π) ˆV πk h ( π) ˆV πk h (πk+1) + 1

ηk Dϕ( π, πk) 1

ηk Dϕ( π, πk+1) 1

ηk Dϕ(πk+1, πk) (28)

ˆV πk h (πk+1) + dk ηk e . (29)

Finally we have the desired result using Lemma B.1, recalling the argument from the end of Lemma B.2 that ˆV πk h ( π) T πT h 1V πk be = T h V πk be and ˆV πk h (πk+1) T πk+1T h 1V πk + be.

Lemma B.4. Given a ˆQπk h with ˆQπk h Qπ h b, let Gh = argmaxπ Π ˆV πk h (π) denote the set of approximate h-step greedy policies and dk = min π Gh D( π, πk) . Then the iterates of inexact h-PMD satisfy the following bound:

ηk T πk+1T h 1V πk ηk V πk+1 + 2ηk γ 1 γ be + γ 1 γ dke . (30)

Proof. Analogously to Lemma A.3 we proceed by induction using Lemma B.3. We write for every k 0,

ηk T πk+1T h 1V πk (a) ηk T πk+1T h V πk

(b) ηk(T πk+1)2T h 1V πk + 2γηkbe + γ dke

ηk(T πk+1)2T h V πk + 2γηkbe + γ dke

ηk(T πk+1)3T h 1V πk + 2γ2ηkbe + γ2 dk + 2γηkbe + γ dke . . .

ηk V πk+1 + 2ηk γ 1 γ be + γ 1 γ dke , (31)

where (a) follows from the monotonicity of both Bellman operators T πk+1, T and the fact that V πk T V πk, (b) uses Lemma B.3 and monotonicity of the Bellman operators. The last inequality stems from the fact that for every V RS, limn + (T πk+1)n V = V πk+1 since V πk+1 is the unique fixed point of the Bellman operator T πk+1 which is a γ-contraction.

End of the proof of Theorem 5.3. We are now ready to conclude the proof of the theorem. Letting dk = min π Gh D( π, πk) as in the theorem, we have:

V V πk+1 (i) V T πk+1T h 1V πk + γ 1 γ

dk ηk e + 2 γ 1 γ be

= V T h V πk + T h V πk T πk+1T h 1V πk + γ 1 γ

dk ηk e + 2 γ 1 γ be

(ii) V T h V πk + 1 1 γ

dk ηk e + 2 1 γ be ,

where (i) follows from B.3 and (ii) from Lemma B.4. By the triangle inequality and the contraction property of T we obtain a recursion similar to the previous theorems:

V V πk+1 γh V V πk + 1 1 γ

dk ηk + 2 1 γ b .

Unfolding this recursive inequality concludes the proof.

B.3 Proof of Theorem 5.4

We first recall the notations used in our lookahead value function estimation procedure before providing the proof. The procedure we use is a standard online planning method and our proof follows similar steps to the proof provided for example in the lecture notes [48]. Notice though that we are not approximating the optimal Q function but rather the lookahead value function Qπ h induced by a given policy π Π at any state action pair (s, a) S A .

Recall that for any k [2, h 1] , ˆVk and ˆQk are defined as follows:

ˆQk( s, a) = r( s, a) + γ 1

ˆVk(s ) , s Sk , (32)

ˆVk( s) = max a A ˆQk 1( s, a) , s Sk 1 . (33)

At the bottom of the recursion, recall that ˆV1 is defined for every s S0 by

ˆV1( s) = ˆV π( s) = 1 M0

k=0 γkr(sj k, aj k) ,

using M0 trajectories (τj( s))j [M0] of length H 1 rolling out policy π, i.e. τj( s) := {(sj k, aj k, r(sj k, aj k))}0 k H 1 with sj 0 = s and aj k π( |sj k).

Note that the cardinality of each set Sk is evidently upper bounded by S but also by Ah k Q i [k+1,h] Mi as the number of states in each set grows by a factor of at most AMk at step k. In this section we prove some concentration bounds on the above estimators, which will be useful for the rest of the analysis. Lemma B.5 (Concentration Bounds of Lookahead Value Function Estimators). Let δ (0, 1) . For any state s S0, with probability 1 δ we have:

| ˆV1( s) V π( s)| CV 1 (δ) := γH

1 γ + 1 1 γ

For any k [1, h], for state s Sk and for any action a A, with probability 1 δ we have:

| ˆQk( s, a) (J ˆVk)( s, a)| CQ k (δ) := γ 1 γ

Finally, for any k [1, h 1], for any s Sk, also with probability 1 δ we have:

| ˆVk+1( s) (T ˆVk)( s)| CV k+1(δ) := γ 1 γ

Proof. The proof of this result is standard and relies on Hoeffding s inequality. We provide a proof for completeness. Let Pr be the probability measure induced by the interaction of the planner and the MDP simulator on an adequate probability space. We denote by E the corresponding expectation.

First we show the bound for the value function estimate ˆV1. This estimate is biased due to the truncation of the rollouts at a fixed horizon H, which is the source of the first error term in the bound. For any state s S0, the bias in ˆV1( s) is given by

| E h ˆV1( s) i V π( s)| = E

t=H γtr(st, at)|s0 = s, at π( |st)

The second term is due to Hoeffding s inequality: each trajectory is i.i.d, and the estimator is bounded with probability 1 by 1 1 γ . Therefore ˆV1( s) concentrates around its mean, and with probability 1 δ we have:

| ˆV1( s) E h ˆV1( s) i | 1 1 γ

Now we prove the bound for the ˆQk function. Assume k [1, h] is arbitrary, and s Sk. For any action a A, we can bound the error of ˆQk as follows:

| ˆQk( s, a) (J ˆVk)( s, a)| = γ

i=1 ˆVk(s i) (P ˆVk)( s, a)

Finally we bound the ˆVk+1 function for k > 1. Assume k [1, h 1] is arbitrary, and s Sk. With probability at least 1 δ we have:

| ˆVk+1( s) (T ˆVk)( s)| γ 1 γ

To show this last result, we use a standard union bound. For every t 0, we have:

Pr(| ˆVk+1( s) T ˆVk( s)| t)

= Pr(| max a A ˆQk( s, a ) max a A(JVk)( s, a )| t)

Pr(max a A ˆQk( s, a ) max a A(JVk)( s, a ) t) + Pr(max a A(JVk)( s, a ) max a A ˆQk( s, a ) t)

h ˆQk( s, a ) (JVk)( s, a ) i t) + Pr(max a A

h (JVk)( s, a ) ˆQk( s, a ) i t)

a A Pr( ˆQk( s, a) (J ˆVk)( s, a) t) + X

a A Pr((J ˆVk)( s, a) ˆQk( s, a) t)

2A exp( 2(1 γ)2

γ2 t2Mk) . (41)

B.3.1 Estimation Error Bound for ˆQh

We start by defining some useful Bellman operators. The Bellman operator J : RS RS A and its sampled version ˆJ : RS RS A are defined for every V RS and any (s, a) S A by

JV (s, a) := r(s, a) + γPV (s, a) , (42)

ˆJV (s, a) := r(s, a) + γ

s C(s,a) V (s ) , (43)

where C(s, a) is a set of m states sampled i.i.d from the distribution P( |s, a) .

Lemma B.6. The operators T and ˆT satisfy for every U, V RS, s S, a A,

|JU(s, a) JV (s, a)| γ U V , (44)

| ˆJU(s, a) ˆJV (s, a)| γ max s C(s,a) |U(s) V (s)| . (45)

Proof. The proof is immediate and is hence omitted.

We now show how to bound the error of ˆQk for any k [2, h]. We can bound the total error on ˆQk by decomposing the error into parts, accounting individually for the contribution of each stage in the estimation procedure. We will make use of the operators ˆJ as defined above and we will index them by the sets Sk (k [1, h]) for clarity. We use the notation X for the infinity norm over any finite set X . Recall that Qπ k = r + γPT k 1V π = JT k 1V π for every k [1, h]. The following decomposition holds for any k [2, h] using the aforementioned contraction properties:

max a A ˆQk( , a) Qπ k( , a) Sk

= max a A ( ˆJSk 1 ˆVk)( , a) (JT k 1V π)( , a) Sk

max a A ( ˆJSk 1 ˆVk)( , a) ( ˆJSk 1T k 1V π)( , a) Sk + max a A ( ˆJSk 1T k 1V π)( , a) (JT k 1V π)( , a) Sk

γ ˆVk T k 1V π Sk 1 + max a A ( ˆJSk 1T k 1V π)( , a) (JT k 1V π)( , a) Sk

= γ M ˆQk 1 MJT k 2V π Sk 1 + max a A ( ˆJSk 1T k 1V π)( , a) (JT k 1V π)( , a) Sk

γ max a A ˆQk 1( , a) (JT k 2V π( , a) Sk 1 + max a A ( ˆJSk 1T k 1V π)( , a) (JT k 1V π)( , a) Sk

= γ max a A ˆQk 1( , a) Qπ k 1( , a) Sk 1 + max a A ( ˆJSk 1T k 1V π)( , a) (JT k 1V π)( , a) Sk (46)

which, after defining the terms k = maxa A ˆQk( , a) Qπ k( , a) Sk and

Ek = maxa A ( ˆJSk 1T k 1V π)( , a) (JT k 1V π)( , a) Sk yields the following recursion:

k γ k 1 + Ek , (47)

and unrolling this recursion leads to the following bound on h:

i=2 γh i Ei . (48)

We begin by bounding 1 by partially applying the recursion above:

1 = max a A ˆQ1( , a) JV π( , a) S1 = max a A ˆJS0 ˆV1( , a) JV π( , a) S1 γ ˆV1 V π S0 ,

which from Lemma B.5 we know to be bounded with probability at most 1 δ|S0| by:

1 γ ˆV1 V π S0 γH+1

1 γ + γ 1 γ

log(2|S0|/δ)

Finally, each term Ej for j k can be bounded using the following lemma below. The proof relies on a union bound over the states reached during the construction of the search tree, see the lemma towards the end of [48] for a detailed explanation and a proof. Lemma B.7. Let n(j) = |Sj| for j [k] and any k [h], let S1 be the root state of the tree search and for each 1 < i n(j) let Si Sj be the state reached at the i 1th simulator call such that Sj = {Si}n(j) i=1 . For 0 < δ 1 with probability 1 An(j)δ for any 1 i n(j) we have:

max a A |( ˆJSj 1T k 1V π)(Si, a) (JT k 1V π)(Si, a)| γ 1 γ

Combining this result with (50) and the unrolled recursion gives the final concentration bound with probability at least 1 δV δJ for δV , δJ (0, 1) ,

max a A ˆQk( , a) Qπ k( , a) Sk γH+k

log(2|S0|/δV )

M0 +γ2(1 γk 1)

log(2A|S0|(k 1)/δJ)

by a union bound over j k, assuming Mj = M for all 1 j h 1.

B.3.2 End of the Proof of Theorem 5.4

For any state-action pair (s, a) S A, define the set |Sh| = {s} and the sets |Sk| accordingly for 0 k h 1. Here we set Mt := M > 0 , t [1, h]. Assuming the worst case growth of |Sk| |Sk+1|AMk+1, we can bound |S0| deterministically as |S0| Ah M h. Using the bound in the previous section, it is possible to bound the error of ˆQh(s, a) as follows:

| ˆQh(s, a) Qπ h(s, a)| max a A ˆQh( , a ) Qπ h( , a ) Sh

log(2|S0|/δV )

M0 + γ2(1 γh 1)

log(2A|S0|h/δJ)

with probability at least 1 δV δJ.

Assume now we want to estimate the value function ˆQ(s, a) for all state action pairs (s, a) in some set C S A: we want to achieve max(s,a) C | ˆQ(s, a) Q(s, a)| b for some positive accuracy b with probability at least 1 δV δJ. This can be achieved by choosing the following parameters:

1 j h , Mj := M > 9γ4(1 γh 1)2

(1 γ)4b2 log(2h A|S0||C|/δJ) , (54)

(1 γ)2b2 log(2|S0||C|/δV ) , (55)

H > 1 1 γ log 3γh

Picking C = S A gives ˆQh Qπ h b. If estimates for {Vk}h k=1 are reused across state action pairs, and the components of each value function are only computed at most once for every state s S, then the overall number of samples is of the order of KM0H|S0| + Kh|S0| |C|M. In particular, with C = S A, the number of samples becomes of the order of KM0HS + Kh S AM where we reuse estimators (which are in this case computed for each state-action pair (s, a) S A. Let K > 1 h(1 γ) log( 4 ϵ(1 γ)(1 γh)) as in Theorem 5.4. Choosing δV = δJ = δ/2 for some δ (0, 1)

and b = ϵ(1 γ)(1 γh)

4 yields the following bound with probability at least 1 δ (by Theorem 5.3):

V V πk γhk V V π0 + 1 γhk

(1 γ)(1 γh)

To conclude, note that when k = K, we have:

γh K V V π0 + 1 γh K

(1 γ)(1 γh)

γh K 2(1 γh K) (1 γ)(1 γh) ϵ(1 γh K)

C Additional Details and Proofs for Section 6: h-PMD with Function Approximation

To incorporate function approximation to h-PMD, the overall approach is as follows: compute an estimate of the lookahead value function by fitting a linear estimator using least squares, then perform an update step as in inexact PMD, taking advantage of the same error propagation analysis which is

agnostic to the source of errors in the value function. However a problem that arises is in storing the new policy, since a classic tabular storage method would require O(S) memory, which we would like to avoid to scale to large state spaces. A solution to this problem will be presented later on in this section.

Since we are using lookahead in our h-PMD algorithm, we modify the targets in our least squares regression procedure in order to approximate Qπ h rather than Qπ as in the standard PMD algorithm (with h = 1).

C.1 Inexact h-PMD with Linear Function Approximation

Here we provide a detailed description of the h-PMD algorithm using function approximation to estimate the Qh-function. Firstly we consider the lookahead depth h as a fixed input to the algorithm. We assume that the underlying MDP has finite state and action spaces S and A, with transition kernel P : S A (A), reward vector R [0, 1]S A and discount factor γ. We also assume we are given a feature map ψ : S A Rd, also written as Ψ RSA d.

We adapt the inexact h-PMD algorithm from section 5 as follows:

1. Take as input any policy π0 rint(Π), for example one that is uniform over actions in all states.

2. Initialize our memory containing π0, and an empty list Θ which will hold the parameters of our value function estimates computed at each iteration.

3. At each iteration k = 0, . . . , K, perform the following steps:

(a) Compute the targets ˆR(z) := ˆQh(z) for all z C using the procedure described in section 5.1, using πk as a policy.

(b) Compute θk := argminθ Rd P z C ρ(z)(ψ(z) θ ˆR(z))2 and append it to the list Θ.

(c) Using Ψθk as an estimate for Qπk h , compute πk+1 using the usual update from inexact h-PMD. This step can be merged with step 3a, the details are described below.

Computing the full policy πk for every state at every iteration would require Ω(S) operations, although the number of samples to the environment still depends only on d. If this is not a limitation the policy update step can be performed as in inexact h-PMD simply by using Ψθk as an estimate for Qπk h , as described in step 3c. Otherwise, we describe a method below to compute πk on demand when it is needed in the computation of the targets in step 3a, eliminating step 3c entirely.

1. The policy π0 is already given for all states.

2. For any k > 0 given πk we can compute πk+1 at any state s using the following:

(a) Compute any greedy policy π( |s) using Ψθk, for example π(a |s)

1 a argmax a A

(b) Compute ηk := 1 γ2h(k+1) Dϕ( π( |s), πk( |s))

(c) Finally, compute πk+1( |s) argmaxπ Π {ηk (Ψθk)s, π( |s) Dϕ(π( |s), πk( |s))}

Remark C.1. This procedure can be used to compute πk+1( |s) on demand for any k > 0 and s S by recursively computing πk( |s ) at any states s required for the computation of Ψθk (since π0 is given). The memory cost requirement for this method is of the order of O(KA min{|S0|H, S}) in total which is always less than computing the full tabular policy at each iteration. Remark C.2. Above, (Ψθk)s denotes a vector in RA that can be computed in O(Ad) operations: each component in a is equal to ψ(s, a) θk which can be computed in O(d) operations. Also, the stepsize ηk is chosen to be equal to its lower bound in the analysis of 5.3, with the value of ck chosen to be equal to γ2h(k+1), as it is needed for linear convergence. We also note that by memorizing policies as they are computed, we manage to perform just as many policy updates as are needed by the algorithm.

C.2 Discussion of Assumption 6.2

We make two preliminary comments before discussing the relevance of our assumption and alternatives to relax it below:

First, notice that we do not require Qπ h to be represented as a linear function. Our assumption is an approximate action value function approximation.

When the lookahead depth is h = 1, we recover the standard PMD algorithm. Notice that in this case our assumption is standard in the RL literature for linear function approximation.

Now, using our notations, notice that for any policy π Π, Qπ h = r + γPT h 1V π = r + γPT h 1M πQπ = JT h 1M πQπ. We have two alternatives:

Directly approximate Qπ h as we do in our assumption. In that case the targets used in our regression procedure are estimates ˆQh of Qπ h. Note also that this approach is meaningful from a practical point as we may directly use a neural network to approximate the lookahead Q-function which is used in our h-PMD algorithm rather than approximating the Q-function first and then trying to approximate from it the lookahead Q-function Qπ h.

Approximate Qπ with linear function approximation and use our propagation analysis (see section 5.1 and proof in appendix B.3.1, B.3.2) based on the link between Qπ h and Qπ highlighted in the formula above. This approach which propagates the function approximation error on Qπ to Qπ h constitutes an alternative to Assumption 6.2 on Qπ h by an assumption on Qπ. We discuss it further in the rest of this section.

In the following exposition we provide more details regarding estimating Qπ using linear function approximation and relaxing assumption 6.2 using the second approach discussed above.

Let Assumption 6.1 hold (as in Theorem 6.3). Instead of assumption 6.2, we introduce the following assumption on the Q-function Qπ rather than the h-step lookahead Q-function Qπ h .

Assumption C.3. ϵ > 0, π Π, infθ Rd Qπ Ψθ ϵ .

We modify the h-PMD with function approximation algorithm as follows: instead of directly approximating Qπ h using linear function approximation, we first use linear function approximation to approximate Qπ from which we construct an estimate of Qπ h similarly to our method for inexact h-PMD. This requires us to replace the regression targets with simpler ones, which are Monte Carlo estimates of Qπ instead of Qπ h. Specifically, these new targets are defined as:

RM0(z) := 1 M0

t=0 γtr(s(i) t , a(i) t ),

where (s(i) t , a(i) t ) are state action pairs sampled according to the policy π, z = (s, a) C S A and (s(i) 0 , a(i) 0 ) = z.

For any policy π we can use the same function approximation procedure as described in our paper (see Appendix C.1) to yield a ˆθ Rd with Qπ Ψˆθ ϵ0, (for some value of ϵ0 to be defined below) by simply plugging in h = 1. Note that this particular case was previously analysed in [48] (Lecture 8). We reuse an intermediate result therein (equation (7)) which guarantees the bound:

Qπ Ψˆθ ϵπ (1 +

1 γ + 1 1 γ

log(2|C|/δ)

=: ϵ0(δ) (59)

with probability at least 1 δ for some 0 < δ < 1.

Now we reuse most of our estimation procedure in Section 5.1, simply by changing the definition of ˆV1 to ˆV1(s) := P

a A π(a|s)(Ψˆθ)s,a for all s S0. This evidently retains the guarantee that | ˆV1(s) V π(s)| ϵ0(δ) with probability at least 1 δ for each s S0. Finally, using the same error

propagation analysis as in Appendix B.3.1, replacing CV 1 ( δV 1 |S0|) with ϵ0(δV 1 ) we obtain the following

bound for any individual state action pair (s, a):

| ˆ Qπ h(s, a) Qπ h(s, a)| γ 1 γ

log(2/δQ h ) Mh + γ2 1 γh 1

log(2(h 1)Z/δV t ) M + γhϵ0(δV 1 )

(60) with probability at least 1 δV 1 δV t δQ h .

Finally, the update rule as described in Appendix C.1 is replaced by πk+1( |s) argmaxπs (A)(ηk ˆ Qπk h (s, ), πs Dϕ(πs, πk s )). Note that we only compute πk in the states where we need it to compute πk+1 as described in Appendix C.1.

Unfortunately, it is not clear how to directly obtain a result similar to Theorem 6.3 from this approach (controlling V V πk at each iteration, i.e. uniformly over all states) beyond the bound that we provide above. A naive union bound over the state action space would result in a bound with a logarithmic dependence on the size of the state space, unlike Theorem 6.3 which results in a bound that is completely independent of the size of the state space.

We would like to add an additional comment regarding possible ways to relax Assumption 6.2 even for the standard case where h = 1. Recently, Mei et al. [30] proposed an interesting alternative approach relying on order-based conditions (instead of approximation error based ones) to analyze policy gradient methods with linear function approximation. We would like to highlight though that their work only applies to the bandit setting to the best of our knowledge. The extension to MDPs is far from being straightforward and would require some specific investigation. It would be interesting to see if one can relax Assumption 6.2 using similar ideas.

C.3 Convergence Analysis of Inexact h-PMD using Function Approximation

Our analysis follows the same lines as the proof in [27] under our assumptions from section 6.

We compute ˆQh(z) for a subset C of state action pairs using the vanilla Monte Carlo Planning algorithm presented in section 5. These will be our targets for estimating the value function using least squares. In order to control the extrapolation of these estimates to the full Qπ h function, we will need the following theorem. Theorem C.4 ([22]). Assume Z is finite and ψ : Z Rd is such that the underlying matrix Ψ R|Z| d is of rank d. Then there exist C Z and a distribution ρ : C [0, 1] (i.e. P

z C ρ(z) = 1) satisfying the following conditions:

1. |C| d(d + 1)/2 ,

2. supz Z ψ(z) G 1 ρ

d where Gρ := P

z C ρ(z)ψ(z)ψ(z) .

Let C S A, ρ : C [0, 1] be as described in the Kiefer-Wolfowitz theorem. Define ˆθ as follows:

ˆθ argminθ Rd X

z C ρ(z)(ϕ(z) θ ˆRm(z))2 . (61)

Assuming that Gρ is nonsingular, a closed form solution for the above least squares problem is given by ˆθ = G 1 ρ P

z C ρ(z) ˆRm(z)ϕ(z). This will be used to estimate our lookahead value function as Ψˆθ. We can control the extrapolation error of this estimate as follows: Theorem C.5. Let δ (0, 1) and let θ Rd such that Qπ h Φθ ϵπ. Assume that we estimate the targets ˆQh using the Monte Carlo planning procedure described in section 5.1 with parameters, M and M0. There exists a subset C Z and a distribution ρ : C R 0 for which, with probability at least 1 δ, the least squares estimator ˆθ satisfies:

Qπ h Φˆθ ϵπ(1 +

log(4Zd2/δ)

M0 + γ2(1 γh 1)

log(4h Zd2/δ)

and Z := Ah Mh M h 1 .

Proof. The proof of this theorem is identical to the one in [27], up to using our new targets and error propagation analysis instead of the typical function approximation targets for Q-function estimation (without lookahead). The subset C S A and distribution ρ : C [0, 1] are given by the Kiefer-Wolfowitz theorem stated above. We suppose we have access to such a set. Note that this assumption is key to controlling the extrapolation error of a least squares estimate Ψˆθ.

We have now fully described how to estimate the lookahead value function Qπ h using a number of trajectories that does not depend on the size of the state space. In the previous section, we described how to use this estimate in an h-PMD style update, while maintaining a memory size that does not depend on S. Therefore we have an h-PMD algorithm that does not use memory, computation or number of samples depending on S.

D Further Simulations

D.1 Simulations for h-PMD with Euclidean Regularization

Here we provide details of an additional experiment running h-PMD with Euclidean regularization. As in section 7, experiments were run on the bsuite Deep Sea environment [34] with γ = 0.99 and a grid size of 64x64. The figures are identical in structure to the ones in 7, 32 runs were performed for each value of h in 1, 5, 10, 15, and 20. In Fig. 3, we report the number of samples to the environment at each iteration for the same runs. When compared to the figures showing the convergence, it is clearly seen that running the algorithm with higher values of lookahead results in convergence using fewer overall samples.

0 10 20 30 40 Iteration

0 20 40 60 Iteration

h=1 h=5 h=10 h=15 h=20

Figure 2: Suboptimality value function gap for h-PMD using Euclidean Bregman divergence. in the exact (left) and inexact (right) settings. (Right) We performed 32 runs for each value of h, the mean is shown as a solid line and the standard deviation as a shaded area.

D.2 Sample Complexity Comparison for Different Lookahead Depths

We have recorded the sample complexity of our algorithm for different values of h in Figure 3. Notice that the number of samples used at each iteration increases with h. However, notice that this increase is small: h-PMD with h = 20 requires only 1.5 times more samples per iteration than h = 1 as shown in Figure 3. Furthermore, increasing h greatly improves the convergence rate of the algorithm as described by our theory, which results in a much smaller number of iterations required until convergence (for example h = 20 converges in at least 10 times fewer iterations than h = 1 (see figure 1 (right) in section 7). Overall, the algorithm is more sample efficient for higher values of h.

D.3 Total Running Time Comparison for Different Lookahead Depths

We provide a plot accounting for running time instead of the number of iterations (see Figure 4). Observe that the overall performance is very similar to the same plot with the number of iterations on the x-axis.

0 20 40 60 80 Iteration

h=1 h=5 h=10 h=15 h=20

Figure 3: Samples used by inexact h-PMD at each iteration. When compared with the first figure, it is clear to see the algorithm needs less samples to converge with higher values of h. Since far less iterations are needed with higher values of lookahead, the benefit of higher convergence rate greatly outweighs the additional cost per iteration.

0 2 4 6 8 10 Time (s)

0 10 20 30 40 50 60 70 80 Iterations

h=1 h=5 h=10 h=15 h=20

Figure 4: Value function gap against running time. Note that the algorithms with higher values for h still converge faster in terms of runtime.

D.4 About the Stepsizes in Simulations

In our experiments we use the exact same stepsizes described in our theory. We also observe in our simulations that the stepsize does not go to infinity but rather decreases at some point and/or stabilizes. This is a consequence of the adaptivity of the stepsize discussed above. We also note that the simulations reported in [20] (appendix D p. 18) also validate the fact that the stepsizes do not go to infinity (see the right plots therein for the stepsizes). We also observe a similar behavior for the stepsizes in our experiments (see Figure 5).

D.5 Larger lookahead depths

In order to practically investigate the tradeoff of increasing the lookahead depth h, we performed additional experiments where h was increased to 100 in the Deep Sea bsuite environment. The plots in Figure 6 show that increasing h past the values in our initial experiments still results in better performance.

0 25 50 75 100 125 150 175 200 Iterations

h=1 h=5 h=10 h=15 h=20

Figure 5: Behaviour of our adaptive stepsize ηk at each iteration of the algorithm for different values of h. Note that the stepsize does not diverge towards infinity, instead it usually vanishes after a certain number of iterations.

0 20 40 60 80 100 Iterations

0 2 4 6 8 10 Time (s)

h=1 h=5 h=10 h=15 h=20 h=100

Figure 6: Value function gap plotted against number of iterations (left) and against running time (right) of h-PMD in the Deep Sea bsuite environment. Note that increasing the lookahead depth seems to result in better performance in terms of runtime and number of iterations.

D.6 Experiments with MCTS-based Lookahead Estimation

We have performed additional experiments in an even more challenging and practical setting, on two environments from the bsuite [34] set: Deep Sea and Catch. We have implemented our PMD algorithm with a Monte Carlo Tree Search (MCTS) algorithm (which we call PMD-MCTS) to compute the lookahead Q-function. We used Deep Mind s MCTS implementation (MCTX)4 in order to run this algorithm on any deterministic gym style environments implemented in JAX (we use gymnax [26] to implement these environments). Note that any tree search algorithm implementation can be used to estimate the lookahead function and plugged into our PMD method. See figures below. Note that this training procedure is more practical: we do not evaluate the value function at each state-action pair anymore at each time step, we only evaluate the value function (and so only update the policy) in states that we encounter using the current policy. This significantly relaxes the generative model assumption. Note that in this more challenging setting the best performance is not obtained for higher values of h: intermediate values of h perform better, illustrating the tradeoff in choosing the depth h.

4https://github.com/google-deepmind/mctx

0 100 200 300 400 500 Time (s)

0 200 400 600 800 1000 Iterations

h=1 h=2 h=3 h=4 h=5

0 500 1000 1500 2000 2500 3000 3500 Time (s)

0 50 100 150 200 250 300 350 400 Iterations

h=1 h=2 h=3 h=4 h=5

Figure 7: Training curves of PMD-MCTS for the bsuite Catch (left) and Deep Sea (right) environments. Note that the case of h = 1 fails to converge in any reasonable number of iterations in both environments.

D.7 Continuous Control Experiments

Finally, we performed experiments in classic continuous control environments in order to investigate the performance of h-PMD when applied to MDPs with large/infinite state spaces. For this continuous state space setting, we implemented a fully online variant of h-PMD based off of Deep Mind s Mu Zero [40] algorithm. Experiments show that increasing the lookahead depth can lead to better performance up to a certain point, and the tradeoff seems to be environment-specific.

0 25000 50000 75000 100000 125000 150000 175000 200000

Cart Pole-v1

0 100000 200000 300000 400000 500000 Steps

h=1 h=2 h=3 h=4 h=5

Figure 8: Performance of h-PMD in two of Open AI s gym [8] environments, Cart Pole-v1 (left) and Acrobot-v1 (right). In both figures the reward is plotted against the total number of environment steps evaluated, which can be understood as the number of samples used. Note that higher values of h do lead to better performance for the Cart Pole-v1 environment, but not necessarily for the Acrobot-v1 environment.

D.8 Additional experiments

We conclude with results from a larger set of experiments designed to verify the robustness of our experimental results to different environment parameters. We report the performance of the h-PMD algorithm in the inexact setting in two different environments, Deep Sea from bsuite [34], and the

classic Gridworld environment. The effect of γ is as expected: higher values of γ result in slower convergence, lower values of γ yield faster convergence. The size of the environment ( chain length for Deep Sea and number of columns/rows for Gridworld) directly affects the number of iterations until convergence predictably: larger environments take more iterations to solve. Note that in all cases the effect of h is as expected, i.e. higher h leads to faster convergence.

0 20 40 60 80 100 Iterations

Deep Sea: chain length 32, γ = 0.99

h=1 h=5 h=10 h=15 h=20

0 20 40 60 80 100 Iterations

Deep Sea: chain length 16, γ = 0.99

h=1 h=5 h=10 h=15 h=20

0 20 40 60 80 100 Iterations

Deep Sea: chain length 32, γ = 0.9

h=1 h=5 h=10 h=15 h=20

0 20 40 60 80 100 Iterations

Deep Sea: chain length 32, γ = 0.999

h=1 h=5 h=10 h=15 h=20

0 5 10 15 20 25 30 Iterations

Gridworld: 32 columns, 16 rows, γ = 0.99

h=1 h=5 h=10 h=15 h=20

0 5 10 15 20 25 30 Iterations

Gridworld: 16 columns, 8 rows, γ = 0.99

h=1 h=5 h=10 h=15 h=20

0 5 10 15 20 25 30 Iterations

Gridworld: 32 columns, 16 rows, γ = 0.9

h=1 h=5 h=10 h=15 h=20

0 5 10 15 20 25 30 Iterations

Gridworld: 32 columns, 16 rows, γ = 0.999

h=1 h=5 h=10 h=15 h=20

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Please see the abstract, the introduction and the results.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes] Justification: Please see the future work discussion in the conclusion and the discussion about the computational tradeoff in Section 4. Assumptions for function approximation are discussed in details in the Appendix.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate Limitations section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Please see the complete proofs in the different appendix sections.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: Details provided in the main part in the simulation sections and in the appendix, we also provide a link to an anonymous repository for the code.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Please see link to an anonymous repository for the code. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: This is discussed in the description of simulations in the main part and in the appendix as well as in the code provided. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We provide error bars (shaded area) based on runnning our simulations using different seeds and specify it is computed using the standard deviation. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer Yes if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We used CPUs and we also tracked running time (see relevant figures in the appendix). We performed several simulations for tuning which are not all reported in the paper. We reported several of them to show the robustness of our results to varying hyperparameters.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: We checked the code and our paper conforms to it.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Our contributions are mainly theoretical and focus on improving the performance of existing policy gradient methods.

Guidelines:

The answer NA means that there is no societal impact of the work performed.

If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Our work does not involve such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes] Justification: We mention whenever we use specific existing code from some repositories and give credit appropriately.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: does not apply to our paper.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: does not apply to our work.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: does not apply to our work.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.