# nonstationary_approximate_modified_policy_iteration__cc47c247.pdf

Non-Stationary Approximate Modiﬁed Policy Iteration

Boris Lesner BORIS.LESNER.DATEXIM@GMAIL.COM Bruno Scherrer BRUNO.SCHERRER@INRIA.FR Inria, Villers-ls-Nancy, F-54600, France Universit de Lorraine, LORIA, UMR 7503, Vanduvre-ls-Nancy, F-54506, France

We consider the inﬁnite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modiﬁed Policy Iteration a family of algorithms that can interpolate between Value and Policy Iteration with an error ϵ at each iteration is known to lead to stationary policies that are at least 2γϵ (1 γ)2 -optimal. Variations of Value and Policy Iteration, that build ℓ-periodic nonstationary policies, have recently been shown to display a better 2γϵ (1 γ)(1 γℓ)-optimality guarantee. We describe a new algorithmic scheme, Non-Stationary Modiﬁed Policy Iteration, a family of algorithms parameterized by two integers m 0 and ℓ 1 that generalizes all the above mentionned algorithms. While m allows one to interpolate between Value-Iteration-style and Policy-Iteration-style updates, ℓspeciﬁes the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved 2γϵ (1 γ)(1 γℓ)-optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and ℓ; this tightness was to our knowledge only known in two speciﬁc cases, Value Iteration (m = 0, ℓ= 1) and Policy Iteration (m = , ℓ= 1).

1. Introduction

Dynamic Programming (DP) is an elegant approach for addressing γ-discounted inﬁnite-horizon optimal control problems formalized as Markov Decision Processes (MDP) (Puterman, 1994). The two most well-known DP algorithms in this framework are Value Iteration (VI) and

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

Policy Iteration (PI). While the former has typically lighter iterations, the latter usually converges much faster. Modiﬁed Policy Iteration (MPI), that interporlates between the two, was introduced to improve the convergence rate of VI while remaining lighter than PI (Puterman & Shin, 1978).

When the optimal control problem one considers is large, an option is to consider approximate versions of these DP algorithms, where each iteration may be corrupted with some noise ϵ. An important question is the sensitivity of such an approach to the noise. Bertsekas & Tsitsiklis (1996) gather several results regarding approximate versions of VI and PI (thereater named AVI and API). It is known that the policy output by such procedures is guaranteed to be 2γϵ (1 γ)2 -optimal. In particular, when the perturbation ϵ tends to 0, one recovers an optimal solution. This analysis was recently generalized to an approximate implementation of MPI (AMPI) independently by Canbolat & Rothblum (2012) and Scherrer et al. (2012). The better guarantee, obtained by the latter 2γϵ (1 γ)2 -optimality exactly matches that of AVI and API. The algorithmic scheme AMPI can be implemented in various ways, reducing the original control problem to a series of (more standard) regression and classiﬁcation problems (Scherrer et al., 2012), and lead to state-of-the-art results on large benchmark problems, in particular on the Tetris domain (Gabillon et al., 2013).

An apparent weakness of these sensitivity analyses is that the dependence with respect to the discount factor γ is bad: since γ is typically close to 1, the denominator of the constant 2γ (1 γ)2 often makes the guarantee uninformative in practice. Unfortunately, it turns out that it is not so much a weakness of the analyses but a weakness of the very algorithmic approach since Bertsekas & Tsitsiklis (1996) and Scherrer & Lesner (2012) showed that the bound 2γϵ (1 γ)2 is tight respectively for API and AVI and thus cannot be improved in general. Interestingly, the authors of the latter article described a trick for modifying AVI and API so as to improve the guarantee: even though one knows that there exists a stationary policy that is optimal, Scherrer & Lesner (2012) showed that variations of AVI and API

Non-Stationary Approximate Modiﬁed Policy Iteration

that compute ℓ-periodic non-stationary policies (thereafter named NS-AVI and NS-API) lead to an improved bound of 2γϵ (1 γ)(1 γℓ). For values of ℓof the order of 1 log 1

equivalent to 1 1 γ when γ is close to 1 the guarantee is improved by a signiﬁcant factor (of order 1 1 γ ). With respect to the standard AVI and API schemes, the only extra algorithmic price to pay is memory that is then O(ℓ) instead of O(1). As often in computer science, one gets a clear trade-off between quality and memory.

To the best of our knowledge, it is not known whether the non-stationary trick also applies to a modiﬁed algorithm that would interpolate between NS-AVI and NS-API. Perhaps more importantly, it is not known whether the improved bound 2γϵ (1 γ)(1 γℓ) is tight for NS-AVI or NS-API,

and even whether the standard 2γϵ (1 γ)2 bound is tight for AMPI. In this article, we ﬁll the missing parts of this topic in the literature. We shall describe NS-AMPI, a new nonstationary MPI algorithm that generalizes all previously mentioned algorithms AVI, API, AMPI, NS-AVI and NSAPI and prove that it returns a policy that is 2γϵ (1 γ)(1 γℓ)- optimal. Furthermore, we will show that for any value of the period ℓand any degree of interpolation between NSAVI and NS-API, such a bound is tight. Thus, our analysis not only uniﬁes all previous works, but it provides a complete picture of the sensitivity analysis for this large class of algorithms.

The paper is organized as follows. In Section 2 we describe the optimal control problem. Section 3 describes the state-of-the-art algorithms AMPI, NS-AVI and NS-API along with their known sensitivity analysis. In Section 4, we describe the new algorithm, NS-AMPI, and our main results: a performance guarantee (Theorem 3) and a matching lower bound (Theorem 4). Section 5 follows by providing the proof skteches of both results. Section 6 describes a small numerical illustration of our new algorithm, which gives some insight on the choice of its parameters. Section 7 concludes and mentions potential future research directions.

2. Problem Setting

We consider a discrete-time dynamic system whose state transition depends on a control. Let X be a state space. When at some state, an action is chosen from a ﬁnite action space A. The current state x X and action a A characterize through a homogeneous probability kernel P(dx|x, a) the distribution of the next state x . At each transition, the system is given a reward r(x, a, x ) R where r : X A X R is the instantaneous reward function. In this context, the goal is to determine a sequence of actions (at) adapted to the past of the process until time t that maximizes the expected discounted sum of rewards

from any starting state x:

k=0 γkr(xk, ak, xk+1)

x0 = x, xt+1 P( |xt, at)

where 0 < γ < 1 is a discount factor. The tuple X, A, P, r, γ is called a Markov Decision Process (MDP) and the associated optimization problem inﬁnite-horizon stationary discounted optimal control (Puterman, 1994; Bertsekas & Tsitsiklis, 1996) .

An important result of this setting is that there exists at least one stationary deterministic policy, that is a function π : X A that maps states into actions, that is optimal (Puterman, 1994). As a consequence, the problem is usually recast as looking for the stationary deterministic policy π that maximizes for every state x the quantity

vπ(x) := Eπ

k=0 γkr(xk, π(xk), xk+1)

(1) called the value of policy π at state x. The notation Eπ means that we condition on trajectories such that xt+1 Pπ( |xt), where Pπ(dx|x) is the stochastic kernel P(dx|x, π(x)) that chooses actions according to policy π. We shall similarly write rπ : X R for the function giving the immediate reward while following policy π:

x, rπ(x) = E [r(x0, π(x0), x1) | x0 = x, x1 Pπ( |x0)] .

Two linear operators are associated with the stochastic kernel Pπ: a left operator on functions f RX

x, (Pπf)(x) = Z f(y)Pπ(dy|x)

= E [f(x1) | x0 = x, x1 Pπ( |x0)] ,

and a right operator on distributions µ

(µPπ)(dy) = Z Pπ(dy|x)µ(dx).

In words, (Pπf)(x) is the expected value of f after following policy π for a single time-step starting from x, and µPπ is the distribution of states after a single time-step starting from µ.

Given a policy π, it is well known that the value vπ is the unique solution of the following Bellman equation:

vπ = rπ + γPπvπ.

In other words, vπ is the ﬁxed point of the afﬁne operator Tπv := rπ + γPπv. The optimal value starting from state x is deﬁned as

v (x) := max π vπ(x).

Non-Stationary Approximate Modiﬁed Policy Iteration

It is also well known that v is characterized by the following Bellman equation:

v = max π (rπ + γPπv ) = max π Tπv ,

where the max operator is componentwise. In other words, v is the ﬁxed point of the nonlinear operator Tv := maxπ Tπv. Finally, for any function v RX, we say that a policy π is greedy with respect to v if it satisﬁes:

π arg max π Tπ v

or equivalently Tπv = Tv. We write, with some abuse of notation1 G(v) any policy that is greedy with respect to v. The notions of optimal value function and greedy policies are fundamental to optimal control because of the following standard property: any policy π that is greedy with respect to the optimal value is an optimal policy and its value vπ is equal to v . Thus, the main problem amounts to computing the optimal value function v . The next section descibes algorithmic approaches from the literature.

3. State-of-the-Art Algorithms

We begin by describing the Approximate Modiﬁed Policy Iteration (AMPI) algorithmic scheme (Scherrer et al., 2012). Starting from an arbitrary value function v0, AMPI generates a sequence of value-policy pairs

πk+1 = G(vk) (greedy step)

vk+1 = (Tπk+1)m+1vk + ϵk (evaluation step)

where m 0 is a free parameter. At each iteration k, the term ϵk accounts for a possible approximation in the evaluation step. AMPI generalizes the well-known approximate DP algorithms Value Iteration (AVI) and Policy Iteration (API) for values m = 0 and m = , respectively. In the exact case (ϵk = 0), MPI requires less computation per iteration than PI (in a way similar to VI) and enjoys the faster convergence (in terms of number of iterations) of PI (Puterman & Shin, 1978; Puterman, 1994).

It was recently shown that controlling the errors ϵk when running AMPI is sufﬁcient to ensure some performance guarantee (Scherrer et al., 2012; Canbolat & Rothblum, 2012). For instance, we have the following performance bound, that is remarkably independent of the parameter m.2

1There might be several policies that are greedy with respect to v. 2Note that in practice, the term ϵk will generally depend on m. The exact dependence may strongly depend on the precise implementation and we refer the reader to (Scherrer et al., 2012) for examples of such analyses. In this paper, we only consider the situation of a uniform error bound on the errors, all the more that extensions to more complicated errors is straightforward.

Theorem 1 (Scherrer et al. (2012, Remark 2)). Consider AMPI with any parameter m 0. Assume there exists an ϵ > 0 such that the errors satisfy ϵk < ϵ for all k. Then, the loss due to running policy πk instead of the optimal policy π satisﬁes

v vπk 2(γ γk)

(1 γ)2 ϵ + 2γk

In the speciﬁc case corresponding to AVI (m = 0) and API (m = ), this bound matches performance guarantees that have been known for a long time (Singh & Yee, 1994; Bertsekas & Tsitsiklis, 1996). The asymptotic constant 2γ (1 γ)2 can be very big, in particular when γ is close to 1. Unfortunately, it cannot be improved: Bertsekas & Tsitsiklis (1996, Example 6.4) showed that the bound is tight for PI, Scherrer & Lesner (2012) proved that it is tight for VI,3 and we will prove in this article4 the to our knowledge unknown fact that it is also tight for AMPI. In other words, improving the performance bound requires to change the algorithms.

Even though the theory of optimal control states that there exists a stationary policy that is optimal, Scherrer & Lesner (2012) recently showed that the performance bound of Theorem 1 could be improved in the speciﬁc cases m = 0 and m = by considering variations of AVI and API that build periodic non-stationary policies (instead of stationary policies). Surprisingly, the Non-Stationary AVI (NS-AVI) algorithm proposed there works almost exactly like AVI: it builds the exact same sequence of value-policy pairs from any initialization v0 (compare with AMPI with m = 0):

πk+1 = G(vk) (greedy step)

vk+1 = Tπk+1vk + ϵk (evaluation step)

The only difference is in what is output: while AVI would return the last policy, say πk after k iterations, NS-AVI returns the periodic non-stationary policy πk,ℓthat loops in reverse order on the last ℓgenerated policies:

πk,ℓ= πk πk 1 πk ℓ+1 | {z } last ℓpolicies

πk πk 1 πk ℓ+1 | {z } last ℓpolicies

Following the policy πk,ℓmeans that the ﬁrst action is selected by πk, the second one by πk 1, until the ℓth one by πk ℓ+1, then the policy loops and the next actions are selected by πk, πk 1, so on and so forth. Note that when ℓ= 1, we recover the output of AVI: the last policy πk that is used for all actions.

3Though the MDP instance used to show the tightness of the bound for VI is the same as that for PI (Bertsekas & Tsitsiklis, 1996, Example 6.4), Scherrer & Lesner (2012) seem to be the ﬁrst to argue about it in the literature. 4Theorem 4 page 4 with ℓ= 1.

Non-Stationary Approximate Modiﬁed Policy Iteration

To describe the other algorithm proposed by Scherrer & Lesner (2012), Non-Stationary API (NS-API), we shall introduce the linear Bellman operator Tπk,ℓassociated with πk,ℓ:

v RX, Tπk,ℓv = Tπk Tπk 1 . . . Tπk ℓ+1v.

It is indeed straightforward to show that the value vπk,ℓ obtained by following πk,ℓis the unique ﬁxed point of Tπk,ℓ. Then, from any initial set of ℓpolicies (π0, π 1, . . . , π ℓ+1), NS-API generates the following sequence of value-policy pairs:

vk = vπk,ℓ+ ϵk (evaluation step)

πk+1 = G(vk) (greedy step)

While computing the value vk requires (approximately) solving the ﬁxed point equation vπk,ℓ= Tπk,ℓvπk,ℓof the non-stationary policy πk,ℓmade of the last ℓcomputed policies, the new policy πk+1 that is computed in the greedy step is (as usual) a simple stationary policy. After k iterations, similarly to NS-AVI, the algorithm returns the periodic non-stationary policy πk,ℓ. Here again, setting ℓ= 1 provides the standard API algorithm.

On the one hand, using these non-stationary variants may require more memory since one must store ℓpolicies instead of one. On the other hand, the following result shows that this extra memory allows us to improve the performance guarantee.

Theorem 2 (Scherrer & Lesner (2012, Theorems 2 and 4)). Consider NS-AVI or NS-API with any parameter l 0. Assume there exists an ϵ > 0 such that the errors satisfy ϵk < ϵ for all k. Then, the loss due to running the non-stationary policy πk,ℓinstead of the optimal policy π satisﬁes

v vπk,ℓ 2(γ γk) (1 γ)(1 γℓ)ϵ + γkg0.

where g0 = 2 1 γℓ v v0 for NS-AVI or g0 = v vπ0,ℓ for NS-API.

For any ℓ 1, it is a factor 1 γ 1 γℓbetter than in Theo-

rem 1. Using ℓ= l 1 1 γ m yields5 an asymptotic perfor-

mance bound of 3.164γ

1 γ ϵ. which constitutes an improvement of order O( 1 1 γ ), which is signiﬁcant in typical situations where γ is close to 1.

5 Using the facts that 1 γ log γ and log γ 0, we have log γℓ log γ 1 1 γ 1 log γ log γ = 1 hence γℓ e 1. Therefore 2 1 γℓ 2 1 e 1 < 3.164.

4. Main results

We are now ready to present the ﬁrst contribution of this paper. We shall introduce a new algorithm, Non-Stationary AMPI (NS-AMPI), that generalizes NS-AVI and NS-API (in the same way the standard AMPI algorithm generalizes standard AVI and API) and AMPI (in the same way NS-VI and NS-PI respectively generalize AVI and API). Given some free parameters m 0 and ℓ 1, an arbitrary value function v0 and an arbitrary set of ℓ 1 policies π0, π 1, π ℓ+2, consider the algorithm that builds a sequence of value-policy pairs as follows:

πk+1 = G(vk) (greedy step)

vk+1 = (Tπk+1,ℓ)m Tπk+1vk + ϵk. (evaluation step)

While the greedy step is identical to that of all algorithms, the evaluation step involves the non-stationary Bellman operator Tπk+1,ℓ(composed with itself m times) that we introduced in the previous section, composed with the standard Bellman operator Tπk+1. As in NS-AVI and NS-API, after k iterations, the output of the algorithm is the periodic nonstationary policy πk,ℓ. For the values m = 0 and m = , it is easy to see that one respectively recovers NS-AVI and NS-API. When ℓ= 1, one recovers AMPI (that itself generalizes the standard AVI and API algorithms, obtained if we further set respectively m = 0 and m = ).

At this point, a natural question is whether the previous sensitivity results extend to this more general setting. As the following original result states, the answer is yes.

Theorem 3. Consider NS-AMPI with any parameters m 0 and ℓ 1. Assume there exists an ϵ > 0 such that the errors satisfy ϵk < ϵ for all k. Then, the loss due to running policy πk,ℓinstead of the optimal policy π satisﬁes

v vπk,ℓ 2(γ γk) (1 γ)(1 γℓ)ϵ + 2γk

Theorem 3 asymptotically generalizes both Theorem 1 for ℓ> 1 (the bounds match when ℓ= 1) and Theorem 2 for m > 0 (the bounds are very close when m = 0 or m = ). As already observed for AMPI, it is remarkable that this performance bound is independent of m.

The second main result of this article is that the bound of Theorem 3 is tight, in the precise sense formalized by the following theorem.

Theorem 4. For all parameter values m 0 and ℓ 1, for all discount factor γ, for all ϵ > 0, there exists an MDP instance, an initial value function v0, a set of initial policies π0, π 1, . . . , π ℓ+2 and a sequence of error terms (ϵk)k 1 satisfying ϵk ϵ, such that for all iterations k, the bound of Theorem 3 is satisﬁed with equality.

Non-Stationary Approximate Modiﬁed Policy Iteration

This theorem generalizes the (separate) tightness results for PI (m = , ℓ= 1) (Bertsekas & Tsitsiklis, 1996) and for VI (m = 0, ℓ= 1) (Scherrer & Lesner, 2012), which are the only results we are aware of. To our knowledge, this result is new even for the standard AMPI algorithm (m arbitrary but ℓ= 1), and for the non-trivial instances of NSVI (m = 0, ℓ> 1) and NS-PI (m = , ℓ> 1) proposed by Scherrer & Lesner (2012).

Since it is well known that there exists an optimal policy that is stationary, our result as well as those of Scherrer & Lesner (2012) suggesting to consider non-stationary policies may appear strange. There exists, however, a very simple approximation scheme of discounted inﬁnitehorizon control problems that has to our knowledge never been documented in the literature that sheds some light on the deep reason why non-stationary policies may be an interesting option. Given an inﬁnite-horizon problem, consider approximating it by a ﬁnite-horizon discounted control problem by cutting the horizon after some sufﬁciently big instant T (that is assume there is no reward after time T). Contrary to the original inﬁnite-horizon problem, the resulting ﬁnite-horizon problem is non-stationary, and has therefore naturally a non-stationary solution that is built by dynamic programming in reverse order. Moreover, it can be shown (Kakade, 2003, by adapting the proof of Theorem 2.5.1) that solving this ﬁnite-horizon with VI with a potential error of ϵ at each iteration, will induce at most a performance error of 2 PT 1 i=0 γtϵ = 2(1 γT )

1 γ ϵ. If we add

the error due to truncating the horizon (γT maxs,a |r(s,a)|

we get an overall error of order O 1 1 γ ϵ for a mem-

ory T of the order of6 O 1 1 γ . Though this approximation scheme may require a signiﬁcant amount of memory (when γ is close to 1), it achieves the same O( 1 1 γ ) improvement over the performance bound of AVI/API/AMPI as NS-AVI/NS-API/NS-AMPI do. In comparison, the nonstationary algorithms with a ﬁxed period ℓcan be seen as a more ﬂexible way to make the trade-off between the memory and the quality.

5. Proof sketches

We begin by considering Theorem 3. While the performance guarantee was obtained through three independent proofs for NS-VI, NS-PI and AMPI, the more general setting that we consider here involves a totally uniﬁed proof, which we describe in the remaining of this section.

We write Pk (resp. P ) for the transition kernel Pπk (resp.

6 We use the fact that γT K < ϵ 1 γ T > log (1 γ)K

ϵ 1 γ with K = maxs,a |r(s,a)|

Pπ ) induced by the stationary policy πk (resp. π ). We will write Tk (resp. T ) for the associated Bellman operator. Similarly, we will write Pk,ℓfor the transition kernel associated with the non-stationary policy πk,ℓand Tk,ℓ for its associated Bellman operator. For k 0 we deﬁne the following quantities: bk = Tk+1vk Tk+1,ℓTk+1vk, sk = vk vπk,ℓ ϵk, dk = v vk+ϵk, and lk = v vπk,ℓ. The last quantity, the loss lk of using policy πk,ℓinstead of π is the quantitiy we want to ultimately upper bound.

The core of the proof consists in deriving the following recursive relations. Lemma 1. The quantities bk, sk and dk satisfy:

bk γPk+1 (γℓPk,ℓ)mbk 1 + (I γℓPk,ℓ)ϵk ,

dk = γP dk 1 γP ϵk 1 +

i=0 (γℓPk,ℓ)ibk 1,

sk = (γℓPk,ℓ)m X

j=0 (γℓPk,ℓ)j (Tkvk 1 Tk,ℓTkvk 1) .

Since ϵ is a uniform upper-bound on the pointwise absolute value of the errors |ϵk|, the ﬁrst inequality implies that bk O(ϵ), and as a result, the second and third inequalities gives us dk O(ϵ) and sk O(ϵ). This means that the loss lk = dk + sk will also satisfy lk O(ϵ) and the result is obtained by taking the norm . The actual bound given in the theorem requires a careful expansion of these three inequalities where we make precise what we have just hidden in the O-notations. The details of this expansion are tedious and deferred to Appendix B of the Supplementary Material. We thus now concentrate on the proof of these relations.

Proof of Lemma 1. We will repeatedly use the fact that since policy πk+1 is greedy with respect to vk, we have

π , Tk+1vk Tπ vk. (2)

For a non-stationary policy πk,ℓ, the induced ℓ-step transition kernel is Pk,ℓ= Pk Pk 1 Pk ℓ+1. As a consequence, for any function f : S R, the operator Tk,ℓmay be expressed as: Tk,ℓf = rk + γPk,1rk 1 + γ2Pk,2rk 2 + + γℓPk,ℓf and, for any function g : S R, we have

Tk,ℓf Tk,ℓg = γℓPk,ℓ(f g) (3)

and Tk,ℓ(f + g) = Tk,ℓf + γℓPk,ℓ(g). (4)

Let us now bound bk. We have

bk = Tk+1vk Tk+1,ℓTk+1vk

Eq.(2) Tk+1vk Tk+1,ℓTk ℓ+1vk = Tk+1vk Tk+1Tk,ℓvk = γPk+1 (vk Tk,ℓvk)

Non-Stationary Approximate Modiﬁed Policy Iteration

= γPk+1 ((Tk,ℓ)m Tkvk 1 +ϵk Tk,ℓ((Tk,ℓ)m Tkvk 1 + ϵk))

Eq.(4) = γPk+1 ((Tk,ℓ)m Tkvk 1

(Tk,ℓ)m+1Tkvk 1 + (I γℓPk,ℓ)ϵk

Eq.(3) = γPk+1 (γℓPk,ℓ)m (Tkvk 1

Tk,ℓTkvk 1) + (I γℓPk,ℓ)ϵk

= γPk+1 (γℓPk,ℓ)mbk 1 + (I γℓPk,ℓ)ϵk .

We now turn to the bound of dk:

dk = v vk + ϵk = v (Tk,ℓ)m Tkvk 1 = v Tkvk 1

i=0 (Tk,ℓ)i Tkvk 1 (Tk,ℓ)i+1Tkvk 1

Eq.(3) = T v Tkvk 1

i=0 (γℓPk,ℓ)i (Tkvk 1 Tk,ℓTkvk 1)

Eq.(2) T v T vk 1 +

i=0 (γℓPk,ℓ)ibk 1

Eq.(3) = γP (v vk 1) +

i=0 (γℓPk,ℓ)ibk 1

= γP dk 1 γP ϵk 1 +

i=0 (γℓPk,ℓ)ibk 1.

Finally, we prove the relation for sk:

sk = vk vπk,ℓ ϵk = (Tk,ℓ)m Tkvk 1 vπk,ℓ = (Tk,ℓ)m Tkvk 1 (Tk,ℓ) Tk,ℓTkvk 1

= (γℓPk,ℓ)m X

j=0 (γℓPk,ℓ)j (Tkvk 1 Tk,ℓTkvk 1) .

1 2 3 . . . ℓ ℓ+ 1 ℓ+ 2 . . .

0 2γϵ 2(γ + γ2)ϵ

0 0 0 0 0 0 0

Figure 1. The deterministic MDP matching the bound of Theorem 3.

We now turn to the tightness results given in Theorem 4. The proof considers a generalization of the MDP instance used to prove the tightness of the bound for VI (Scherrer &

Lesner, 2012) and PI (Bertsekas & Tsitsiklis, 1996, Example 6.4). Precisely, this MDP consists of states {1, 2, . . . }, two actions: left ( ) and right ( ); the reward function r and transition kernel P are characterized as follows for any state i 2:

r(i, ) = 0, r(i, ) = 2γ γi

P(i|i + 1, ) = 1, P(i + ℓ 1|i, ) = 1,

and r(1) = 0 and P(1|1)1 for state 1 (all the other transitions having zero probability mass). As a shortcut, we will use the notation ri for the non-zero reward r(i, ) in state i. Figure 1 depicts the general structure of this MDP. It is easily seen that the optimal policy π is to take in all states i 2, as doing otherwise would incur a negative reward. Therefore, the optimal value v (i) is 0 in all states i. The proof of the above theorem considers that we run AMPI with v0 = v = 0, π0 = π 1 = = πℓ+2 = π , and the following sequence of error terms:

ϵ if i = k, ϵ if i = k + ℓ, 0 otherwise.

In such a case, one can prove that the sequence of policies π1, π2, . . . , πk that are generated up to iteration k is such that for all i k, the policy πi takes in all states but i, where it takes . As a consequence, a nonstationary policy πk,ℓbuilt from this sequence takes in k (as dictated by πk), which transfers the system into state k + ℓ 1 incurring a reward of rk. Then the policies πk 1, πk 2, . . . , πk ℓ+1 are followed, each indicating to take with 0 reward. After ℓsteps, the system is again is state k and, by the periodicity of the policy, must again use the action πk(k) = . The system is thus stuck in a loop, where every ℓsteps a negative reward of rk is received. Consequently, the value of this policy from state k is:

vπk,ℓ(k) = rk 1 γℓ= γ γk

(1 γ)(1 γℓ)2ϵ.

As a consequence, we get the following lower bound,

v vπk,ℓ |vπk,ℓ(k)| = γ γk

(1 γ)(1 γℓ)2ϵ

which exactly matches the upper bound of Theorem 3 (since v0 = v = 0). The proof of this result involves computing the values vk(i) for all states i, steps k of the algorithm, and values m and ℓof the parameters, and proving that the policies πk+1 that are greedy with respect to these values satisfy what we have described above. Due to lack of space, the complete proof is deferred to Appendix B of the Supplementary Material; in Lemma 7 and the associated Figures 4 and 5 there, note the quite complex shape of the value function that is induced by the cyclic nature of the MDP and the NS-AMPI algorithm.

Non-Stationary Approximate Modiﬁed Policy Iteration

6. Empirical Illustration

In this section, we describe an empirical illustration of the new algorithm NS-AMPI. Note that the goal here is not to convince the reader that the new degrees of freedom for approximate dynamic programming may be interesting in difﬁcult real control problems we leave this important question to future work but rather to give some insight, on small and artiﬁcial well-controlled problems, on the effect of the main parameters m and ℓ.

The problem we consider, the dynamic location problem from Bertsekas & Yu (2012), involves a repairman moving between n sites according to some transition probabilities. As to allow him do his work, a trailer containing supplies for the repair jobs can be relocated to any of the sites at each decision epoch. The problem consists in ﬁnding a relocation policy for the trailer according the repairman s and trailer s positions which maximizes the discounted expectation of a reward function.

Given n sites, the state space has n2 states comprising the locations of both the repairman and the trailer. There are n actions, each one corresponds to a possible destination of the trailer. Given an action a = 1, . . . , n, and a state s = (sr, st), where the repairman and the trailer are at locations sr and st, respectively, we deﬁne the reward as r(s, a) = |sr st| |st a|/2. At any time-step the repairman moves from its location sr < n with uniform probability to any location sr s r n; when sr = n, he moves to site 1 with probability 0.75 or otherwise stays. Since the trailer moves are deterministic, the transition kernel is

P((s r, a)|(sr, st), a) =

1 n sr+1 if sr < n 0.75 if sr = n s r = 1 0.25 if sr = n s r = n

and 0 everywhere else.

We evaluated the empirical performance gain of using nonstationary policies by implementing the algorithm using random error vectors ϵk, with each component being uniformly random between 0 and some user-supplied value ϵ. The adjustable size (with n) of the state and actions spaces allowed to compute an optimal policy to compare with the approximate ones generated by NS-AMPI for all combinations of parameters ℓ {1, 2, 5, 10} and m {1, 2, 5, 10, 25, }. Recall that the cases m = 1 and m = correspond respectively to the NS-VI and NS-PI, while the case ℓ= 1 corresponds to AMPI. We used n = 8 locations, γ = 0.98 and ϵ = 4 in all experiments.

Figure 2 shows the average value of the error v vπk,ℓ per iteration for the different values of parameters m and ℓ. For each parameter combination, the results are obtained by averaging over 250 runs. While higher values of ℓimpacts computational efﬁciency (by a factor O(ℓ)) it always re-

Figure 2. Average error of policy πk,ℓper iteration k of NSAMPI. Red lines for ℓ= 1, yellow for ℓ= 2, green for ℓ= 5 and blue for ℓ= 10.

sults with better asymptotic performance. Especially with the lower values of m, a higher ℓallows for faster convergence. While increasing m, this trend fades to be ﬁnally reversed in favor of faster convergence for small ℓ. However, while small ℓconverges faster, it is with greater error than with higher ℓafter convergence. It can be seen that convergence is attained shortly after the ℓth iteration which can be explained by the fact that the ﬁrst policies (involving π0, π 1, . . . , π ℓ+2), are of poor quality and the algorithm must perform at least ℓiterations to push them out of πk,ℓ.

We conducted a second experiment to study the relative inﬂuence of the parameters ℓand m. From the observation that, in the very setting we are considering, the time complexity of an iteration of NS-AMPI can be roughly summarized by the number ℓm + 1 of applications of a stationary policy s Bellman operator, we ran the algorithm for ﬁxed values of the product ℓm and measured the asymptotic policy error for varying values of ℓafter 150 iterations. These results are depicted on Figure 3. This setting gives insight on how to set both parameters for a given time budget ℓm. While runs with a lower ℓare slightly faster to converge, higher values always give the best policies after a sufﬁcient number of iterations, and greatly reduces the variance across all runs, showing that non-stationarity adds robustness to the approximation noise. Regarding asymptotic quality, it thus appears that the best setting is to favor ℓinstead of m.

Overall, both experiments conﬁrm our theoretical analysis that the main parameter for asymptotic quality is ℓ. Regarding the rate of convergence, the ﬁrst experiments sug-

Non-Stationary Approximate Modiﬁed Policy Iteration

0 2 4 6 8 10 ℓ

Policy error

0 5 10 15 20 ℓ

Policy error

0 10 20 30 40 50 ℓ

Policy error

0 20 40 60 80 100 ℓ

Policy error

Figure 3. Policy error and standard deviation after 150 iterations for different different values of ℓ. Each plot represents a ﬁxed value of the product ℓm. Data is collected over 250 runs with n = 8.

gests that too big values of ℓmay be harmful. In practice, a schedule where ℓprogressively grows while m decreases may provide the best compromise. Conﬁrming this, as well as studying approximate implementations designed for real problems constitutes a matter for future investigation.

7. Conclusion

We have described a new dynamic-programming scheme, NS-AMPI, that extends and uniﬁes several state-of-the-art algorithms of the literature: AVI, API, AMPI, NS-VI, and NS-PI. NS-AMPI has two integer parameters: m 0 that allows to move from a VI-style update to a PI-style update, and ℓ 1 that characterizes the period of the nonstationary policy that it builds. In Theorem 3, we have provided a performance guarantee for this algorithm that is independent of m and that improves when ℓincreases; since ℓdirectly controls the memory of the process, this allows to make a trade-off between memory and quality. In the literature, similar upper bounds were only known for AMPI (Scherrer et al., 2012) ℓ= 1 and m arbitrary and NS-AVI/NS-API (Scherrer & Lesner, 2012) ℓarbitrary but m {0, }. For most settings ℓ> 1 and 1 m < the result is new. By exhibiting a specially designed MDP, we argued (Theorem 4) that our analysis is tight. Similar lower bounds were only known for AVI and API ℓ= 1 and m {0, }. In other words, we have generalized the scarce existing bounds in a uniﬁed setting and closed the gap between the upper and lower bounds for all values of m 0 and ℓ 1.

A practical limitation of Theorem 3 is that it assumes that

the errors ϵk are controlled in max norm. In practice, the evaluation step of dynamic programming algorithm is usually done through some regression scheme see for instance (Bertsekas & Tsitsiklis, 1996; Antos et al., 2007a;b; Scherrer et al., 2012) and thus controlled through the

L2,µ norm, deﬁned as f 2,µ = q R f(x)µ(dx). Munos (2003; 2007) originally developed such analyzes for AVI and API. Farahmand et al. (2010) and Scherrer et al. (2012) later improved them. Using a technical lemma due to Scherrer et al. (2012, Lemma 3), one can easily deduce7

from our analysis (developed in Appendix A of the Supplementary Material) the following performance bound.

Corollary 1. Consider AMPI with any parameters m 0 and ℓ 1. Assume there exists an ϵ > 0 such that the errors satisfy ϵk 2,µ < ϵ for all k. Then, the expected (with respect to some initial measure ρ) loss due to running policy πk,ℓinstead of the optimal policy π satisﬁes

Es ρ[v (s) vπk,ℓ(s)] 2(γ γk)C1,k,ℓ

(1 γ)(1 γℓ)ϵ + O γk

where Cj,k,l = (1 γ)(1 γl)

n=i γi+lnc(i + ln)

is a convex combination of concentrability coefﬁcients based on Radon-Nikodym derivatives c(j) =

maxπ1, ,πj d(ρPπ1Pπ2 Pπj )

With respect to the previous bound in norm , this bound involves extra constants Cj,k,l 1. Each such coefﬁcient Cj,k,l is a convex combination of terms c(i), that each quantiﬁes the difference between 1) the distribution µ used to control the errors and 2) the distribution obtained by starting from ρ and making k steps with arbitrary sequences of policies. Overall, this extra constant can be seen as a measure of stochastic smoothness of the MDP (the smoother, the smaller). Further details on these coefﬁcients can be found in (Munos, 2003; 2007; Farahmand et al., 2010).

We have shown on a small numerical study the signiﬁcant inﬂuence of the parameter ℓon the asymptotic quality of approximately optimal controllers, and suggested that optimizing the speed of convergence may require a ﬁne schedule between ℓand m. Instantiating and analyzing speciﬁc implementations of NS-AMPI as was done recently for AMPI (Scherrer et al., 2012), and applying them on large domains constitutes interesting future work.

7Precisely, Lemma 3 of (Scherrer et al., 2012) should be applied to Equation (8) page 15 in Appendix A of the Supplementary Material.

Non-Stationary Approximate Modiﬁed Policy Iteration

Antos, A., Munos, R., and Szepesv ari, C. Fitted Q-iteration in continuous action-space MDPs. In NIPS, 2007a.

Antos, A., Szepesv ari, C., and Munos, R. Value-iteration based ﬁtted policy iteration: learning with a single trajectory. In Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007, pp. 330 337. IEEE, 2007b.

Bertsekas, D.P. and Tsitsiklis, J.N. Neuro-dynamic programming. Athena Scientiﬁc, 1996.

Bertsekas, D.P. and Yu, H. Q-learning and enhanced policy iteration in discounted dynamic programming. Mathematics of Operations Research, 37(1):66 94, 2012.

Canbolat, P. and Rothblum, U. (Approximate) iterated successive approximations algorithm for sequential decision processes. Annals of Operations Research, pp. 1 12, 2012. ISSN 0254-5330.

Farahmand, A.M., Munos, R., and Szepesv ari, Cs. Error propagation for approximate policy and value iteration (extended version). In NIPS, 2010.

Gabillon, Victor, Ghavamzadeh, Mohammad, and Scherrer, Bruno. Approximate Dynamic Programming Finally Performs Well in the Game of Tetris. In Neural Information Processing Systems (NIPS) 2013, South Lake Tahoe, United States, December 2013.

Kakade, S.M. On the Sample Complexity of Reinforcement Learning. Ph D thesis, University College London, 2003.

Munos, R. Error bounds for approximate policy iteration. In International Conference on Machine Learning (ICML), pp. 560 567, 2003.

Munos, R. Performance bounds in Lp-norm for approximate value iteration. SIAM Journal on Control and Optimization, 46(2):541 561, 2007.

Puterman, M.L. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1994.

Puterman, M.L. and Shin, M.C. Modiﬁed policy iteration algorithms for discounted Markov decision problems. Management Science, 24(11):1127 1137, 1978.

Scherrer, B. and Lesner, B. On the Use of Non-Stationary Policies for Stationary Inﬁnite-Horizon Markov Decision Processes. In Advances in Neural Information Processing Systems 25, pp. 1835 1843, 2012.

Scherrer, B., Ghavamzadeh, M., Gabillon, V., and Geist, M. Approximate Modiﬁed Policy Iteration. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 1207 1214, July 2012.

Singh, S. and Yee, R. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16-3:227 233, 1994.