# approximate_policy_iteration_schemes_a_comparison__396c5dc3.pdf

Approximate Policy Iteration Schemes: A Comparison

Bruno Scherrer BRUNO.SCHERRER@INRIA.FR Inria, Villers-l es-Nancy, F-54600, France Universit e de Lorraine, LORIA, UMR 7503, Vandœuvre-l es-Nancy, F-54506, France

We consider the inﬁnite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration (API) (Bertsekas & Tsitsiklis, 1996), Conservative Policy Iteration (CPI) (Kakade & Langford, 2002), a natural adaptation of the Policy Search by Dynamic Programming algorithm (Bagnell et al., 2003) to the inﬁnite-horizon case (PSDP ), and the recently proposed Non-Stationary Policy Iteration (NSPI(m)) (Scherrer & Lesner, 2012). For all algorithms, we describe performance bounds with respect the per-iteration error ϵ, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API, but this comes at the cost of a relative exponential in 1

ϵ increase of the number of iterations. 2) PSDP enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP is proportional to their number of iterations, which may be problematic when the discount factor γ is close to 1 or the approximation error ϵ is close to 0; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes conﬁrm our analysis.

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

1. Introduction

We consider an inﬁnite-horizon discounted Markov Decision Process (MDP) (Puterman, 1994; Bertsekas & Tsitsiklis, 1996) (S, A, P, r, γ), where S is a possibly inﬁnite state space, A is a ﬁnite action space, P(ds |s, a), for all (s, a), is a probability kernel on S, r : S [ Rmax, Rmax] is a reward function bounded by Rmax, and γ (0, 1) is a discount factor. A stationary deterministic policy π : S A maps states to actions. We write Pπ(ds |s) = P(ds |s, π(s)) for the stochastic kernel associated to policy π. The value vπ of a policy π is a function mapping states to the expected discounted sum of rewards received when following π from these states: for all s S,

t=0 γtr(st)

s0 = s, st+1 Pπ( |st)

The value vπ is clearly bounded by Vmax = Rmax/(1 γ). It is well-known that vπ can be characterized as the unique ﬁxed point of the linear Bellman operator associated to a policy π: Tπ : v 7 r + γPπv. Similarly, the Bellman optimality operator T : v 7 maxπ Tπv has as unique ﬁxed point the optimal value v = maxπ vπ. A policy π is greedy w.r.t. a value function v if Tπv = Tv, the set of such greedy policies is written Gv. Finally, a policy π is optimal, with value vπ = v , iff π Gv , or equivalently Tπ v = v .

The goal of this paper is to study and compare several approximate Policy Iteration schemes. In the literature, such schemes can be seen as implementing an approximate greedy operator, Gϵ, that takes as input a distribution ν and a function v : S R and returns a policy π that is (ϵ, ν)- approximately greedy with respect to v in the sense that:

ν(Tv Tπv) = ν(max π Tπ v Tπv) ϵ. (1)

where for all x, νx denotes Es ν[x(s)]. In practice, this approximation of the greedy operator can be achieved through a ℓp-regression of the so-called Q-function the stateaction value function (a direct regression is suggested by Kakade & Langford (2002), a ﬁxed-point LSTD approach is used by Lagoudakis & Parr (2003b)) or through a

Approximate Policy Iteration Schemes: A Comparison

(cost-sensitive) classiﬁcation problem (Lagoudakis & Parr, 2003a; Lazaric et al., 2010). With this operator in hand, we shall describe several Policy Iteration schemes in Section 2. Then Section 3 will provide a detailed comparative analysis of their performance guarantees, time complexities, and memory requirements. Section 4 will go on by providing experiments that will illustrate their behavior, and conﬁrm our analysis. Finally, Section 5 will conclude and present future work.

2. Algorithms

API We begin by describing the standard Approximate Policy Iteration (API) (Bertsekas & Tsitsiklis, 1996). At each iteration k, the algorithm switches to the policy that is approximately greedy with respect to the value of the previous policy for some distribution ν:

πk+1 Gϵk+1(ν, vπk). (2)

If there is no error (ϵk = 0) and ν assigns a positive weights to every state, it can easily be seen that this algorithm generates the same sequence of policies as exact Policy Iterations since from Equation (1) the policies are exactly greedy.

CPI/CPI(α)/API(α) We now turn to the description of Conservative Policy Iteration (CPI) proposed by (Kakade & Langford, 2002). At iteration k, CPI (described in Equation (3)) uses the distribution dπk,ν = (1 γ)ν(I γPπk) 1 the discounted cumulative occupancy measure induced by πk when starting from ν for calling the approximate greedy operator, and uses a stepsize αk to generate a stochastic mixture of all the policies that are returned by the successive calls to the approximate greedy operator, which explains the adjective conservative :

πk+1 (1 αk+1)πk + αk+1Gϵk+1(dπk,ν, vπk) (3)

The stepsize αk+1 can be chosen in such a way that the above step leads to an improvement of the expected value of the policy given that the process is initialized according to the distribution ν (Kakade & Langford, 2002). The original article also describes a criterion for deciding whether to stop or to continue. Though the adaptive stepsize and the stopping condition allows to derive a nice analysis, they are in practice conservative: the stepsize αk should be implemented with a line-search mechanism, or be ﬁxed to some small value α. We will refer to this latter variation of CPI as CPI(α).

It is natural to also consider the algorithm API(α) (mentioned by Lagoudakis & Parr (2003a)), a variation of API that is conservative like CPI(α) in the sense that it mixes the new policy with the previous ones with weights α and

1 α, but that directly uses the distribution ν in the approximate greedy step:

πk+1 (1 α)πk + αGϵk+1(ν, vπk) (4)

Because it uses ν instead of dπk,ν, API(α) is simpler to implement than CPI(α)1.

PSDP We are now going to describe an algorithm that has a ﬂavour similar to API in the sense that at each step it does a full step towards a new deterministic policy but also has a conservative ﬂavour like CPI in the sense that the policies considered evolve more and more slowly. This algorithm is a natural variation of the Policy Search by Dynamic Programming algorithm (PSDP) of Bagnell et al. (2003), originally proposed to tackle ﬁnite-horizon problems, to the inﬁnite-horizon case; we thus refer to it as PSDP . To the best of our knowledge however, this variation has never been used in an inﬁnite-horizon context.

The algorithm is based on ﬁnite-horizon non-stationary policies. Given a sequence of stationary deterministic policies (πk) that the algorithm will generate, we will write σk = πkπk 1 . . . π1 the k-horizon policy that makes the ﬁrst action according to πk, then the second action according to πk 1, etc. Its value is vσk = Tπk Tπk 1 . . . Tπ1r. We will write the empty non-stationary policy. Note that v = r and that any inﬁnite-horizon policy that begins with σk = πkπk 1 . . . π1, which we will (somewhat abusively) denote σk . . . has a value vσk... vσk γk Vmax. Starting from σ0 = , the algorithm implicitely builds a sequence of non-stationary policies (σk) by iteratively concatenating the policies that are returned by the approximate greedy operator:

πk+1 Gϵk+1(ν, vσk) (5)

While the standard PSDP algorithm of Bagnell et al. (2003) considers a horizon T and makes T iterations, the algorithm we consider here has an indeﬁnite number of iterations. The algorithm can be stopped at any step k. The theory that we are about to describe suggests that one may return any policy that starts by the non-stationary policy σk. Since σk is an approximately good ﬁnite-horizon policy, and as we consider an inﬁnite-horizon problem, a natural output that one may want to use in practice is the inﬁnitehorizon policy that loops over σk, that we shall denote (σk) .

1In practice, controlling the greedy step with respect to dπk,ν requires to generate samples from this very distribution. As explained by Kakade & Langford (2002), one such sample can be done by running one trajectory starting from ν and following πk, stopping at each step with probability 1 γ. In particular, one sample from dπk,ν requires on average 1 1 γ samples from the underlying MDP. With this respect, API(α) is much simpler to implement.

Approximate Policy Iteration Schemes: A Comparison

From a practical point of view, PSDP and CPI need to store all the (stationary deterministic) policies generated from the start. The memory required by the algorithmic scheme is thus proportional to the number of iterations, which may be prohibitive. The aim of the next paragraph, that presents the last algorithm of this article, is to describe a solution to this potential memory issue.

NSPI(m) We originally devised the algorithmic scheme of Equation (5) (PSDP ) as a simpliﬁed variation of the Non-Stationary PI algorithm with a growing period algorithm (NSPI-growing) (Scherrer & Lesner, 2012)2. With respect to Equation (5), the only difference of NSPIgrowing resides in the fact that the approximate greedy step is done with respect to the value v(σk) of the policy that loops inﬁnitely over σk (formally the algorithm does πk+1 Gϵk+1(ν, v(σk) )) instead of the value vσk of only the ﬁrst k steps here. Following the intuition that when k is big, these two values will be close to each other, we ended up considering PSDP because it is simpler. NSPIgrowing suffers from the same memory drawback as CPI and PSDP . Interestingly, the work of Scherrer & Lesner (2012) contains another algorithm, Non-Stationary PI with a ﬁxed period (NSPI(m)), that has a parameter that directly controls the number of policies stored in memory.

Similarly to PSDP , NSPI(m) is based on non-stationary policies. It takes as an input a parameter m. It requires a set of m initial deterministic stationary policies πm 1, πm 2, . . . , π0 and iteratively generates new policies π1, π2, . . . . For any k 0, we shall denote σm k the m-horizon non-stationary policy that runs in reverse order the last m policies, which one may write formally: σm k = πk πk 1 . . . πk m+1. Also, we shall denote (σm k ) the m-periodic inﬁnite-horizon nonstationary policy that loops over σm k . Starting from σm 0 = π0π1 . . . πm 1, the algorithm iterates as follows:

πk+1 Gϵk+1(ν, v(σm k ) ) (6)

Each iteration requires to compute an approximate greedy policy πk+1 with respect to the value v(σm k ) of (σm k ) , that is the ﬁxed point of the compound operator3:

v, Tk,mv = Tπk Tπk 1 . . . Tπk m+1v.

When one goes from iterations k to k + 1, the process consists in adding πk+1 at the front of the (m 1)-horizon policy πkπk 1 . . . πk m+2, thus forming a new m-horizon

2We later realized that it was in fact a very natural variation of PSDP. To give Caesar his due and God his , we kept as the main reference the older work and gave the name PSDP . 3Implementing this algorithm in practice can trivially be done through cost-sensitive classiﬁcation in a way similar to Lazaric et al. (2010). It could also be done with a straight-forward extension of LSTD(λ) to non-stationary policies.

policy σm k+1. Doing so, we forget about the oldest policy πk m+1 of σm k and keep a constant memory of size m. At any step k, the algorithm can be stopped, and the output is the policy πk,m = (σm k ) that loops on σm k . It is easy to see that NSPI(m) reduces to API when m = 1. Furthermore, if we assume that the reward function is positive, add stop actions in every state of the model that lead to a terminal absorbing state with a null reward, and initialize with an inﬁnite sequence of policies that only take this stop action , then NSPI(m) with m = reduces to PSDP .

3. Analysis

For all considered algorithms, we are going to describe bounds on the expected loss Es µ[vπ (s) vπ(s)] = µ(vπ vπ) of using the (possibly stochastic or nonstationary) policy π ouput by the algorithms instead of the optimal policy π from some initial distribution µ of interest as a function of an upper bound ϵ on all errors (ϵk). In order to derive these theoretical guarantees, we will ﬁrst need to introduce a few concentrability coefﬁcients that relate the distribution µ with which one wants to have a guarantee, and the distribution ν used by the algorithms4.

Deﬁnition 1. Let c(1), c(2), . . . be the smallest coefﬁcients in [1, ) { } such that for all i and all sets of deterministic stationary policies π1, π2, . . . , πi, µPπ1Pπ2 . . . Pπi c(i)ν. For all m, k, we deﬁne the following coefﬁcients in [1, ) { }:

C(1,k) = (1 γ)

i=0 γic(i + k),

C(2,m,k) = (1 γ)(1 γm)

j=0 γi+jmc(i + jm + k).

Similarly, let cπ (1), cπ (2), . . . be the smallest coefﬁcients in [1, ) { } such that for all i, µ(Pπ )i cπ (i)ν. We deﬁne:

C(1) π = (1 γ)

i=0 γicπ (i).

Finally let Cπ be the smallest coefﬁcient in [1, ) { } such that dπ ,µ = (1 γ)µ(I γPπ ) 1 Cπ ν.

With these notations in hand, our ﬁrst contribution is to provide a thorough comparison of all the algorithms. This is done in Table 1. For each algorithm, we describe some performance bounds and the required number of iterations and memory. To make things clear, we only display the dependence with respect to the concentrability constants, the

4The expected loss corresponds to some weighted ℓ1-norm of the loss vπ vπ. Relaxing the goal to controlling the weighted ℓp-norm for some p 2 allows to introduce some ﬁner coefﬁcients (Farahmand et al., 2010; Scherrer et al., 2012). Due to lack of space, we do not consider this here.

Approximate Policy Iteration Schemes: A Comparison

Algorithm Performance Bound # Iter. Memory Reference

API (Eq. (2)) C(2,1,0) 1 (1 γ)2 ϵ 1 1 γ log 1

ϵ 1 (Lazaric et al., 2010)

(= NSPI(1)) C(1,0) 1 (1 γ)2 ϵ log 1

ϵ API(α) (Eq. (4) C(1,0) 1 (1 γ)2 ϵ 1 α(1 γ) log 1

ϵ CPI(α) C(1,0) 1 (1 γ)3 ϵ 1 α(1 γ) log 1

CPI (Eq. (3)) C(1,0) 1 (1 γ)3 ϵ log 1

ϵ 1 1 γ 1 ϵ log 1

ϵ Cπ 1 (1 γ)2 ϵ γ ϵ2 (Kakade & Langford, 2002) PSDP (Eq. (5)) Cπ 1 (1 γ)2 ϵ log 1

ϵ 1 1 γ log 1

( NSPI( )) C(1) π 1 1 γ ϵ 1 1 γ log 1

NSPI(m) (Eq. (6))

C(2,m,0) 1 (1 γ)(1 γm) ϵ 1 1 γ log 1

m 1 (1 γ)2(1 γm) ϵ log 1

ϵ 1 1 γ log 1

ϵ C(1) π + γm C(2,m,m)

1 γm 1 1 γ ϵ 1 1 γ log 1

ϵ Cπ + γm C(2,m,0)

m(1 γm) 1 (1 γ)2 ϵ log 1

ϵ 1 1 γ log 1

Table 1. Upper bounds on the performance guarantees for the algorithms. Except when references are given, the bounds are to our knowledge new. A comparison of API and CPI based on the two known bounds was done by Ghavamzadeh & Lazaric (2012). The ﬁrst bound of NSPI(m) can be seen as an adaptation of that provided by Scherrer & Lesner (2012) for the more restrictive ℓ -norm setting.

C(2,m,m) C(1,m)

C(1) π C(1,0) Cπ

Figure 1. Hierarchy of the concentrability constants. A constant A is better than a constant B see the text for details if A is a parent of B on the above graph. The best constant is Cπ .

discount factor γ, the quality ϵ of the approximate greedy operator, and if applicable the main parameters α/m of the algorithms. For API(α), CPI(α), CPI and PSDP , the required memory matches the number of iterations. All but two bounds are to our knowledge original. The derivation of the new results are given in Appendix A.

Our second contribution, that is complementary with the comparative list of bounds, is that we can show that there exists a hierarchy among the constants that appear in all the bounds of Table 1. In the directed graph of Figure 1, a constant B is a descendent of A if and only if the implication {B < A < } holds5. The if and only if is important here: it means that if A is a parent of B, and B is not a parent of A, then there exists an MDP for which A

5Dotted arrows are used to underline the fact that the comparison of coefﬁcients is restricted to the case where the parameter m is ﬁnite.

is ﬁnite while B is inﬁnite; in other words, an algorithm that has a guarantee with respect to A has a guarantee that can be arbitrarily better than that with constant B. Thus, the overall best concentrability constant is Cπ , while the worst are C(2,1,0) and C(2,m,0). To make the picture complete, we should add that for any MDP and any distribution µ, it is possible to ﬁnd an input distribution ν for the algorithm (recall that the concentrability coefﬁcients depend on ν and µ) such that Cπ is ﬁnite, though it is not the case for C(1) π (and as a consequence all the other coefﬁcients). The derivation of this order relations is done in Appendix B.

The standard API algorithm has guarantees expressed in terms of C(2,1,0) and C(1,0) only. Since CPI s analysis can be done with respect to Cπ , it has a performance guarantee that can be arbitrarily better than that of API, though the opposite is not true. This, however, comes at the cost of an exponential increase of time complexity since CPI may require a number of iterations that scales in O 1

ϵ2 , while the guarantee of API only requires O log 1

ϵ iterations. When the analysis of CPI is relaxed so that the performance guarantee is expressed in terms of the (worse) coefﬁcient C(1,0)

(obtained also for API), we can slightly improve the rate to O 1

ϵ , though it is still exponentially slower than that of API. This second result for CPI was proved with a technique that was also used for CPI(α) and API(α). We conjecture that it can be improved for CPI(α), that should be as good as CPI when α is sufﬁciently small.

PSDP enjoys two guarantees that have a fast rate like those of API. One bound has a better dependency with respect to 1 1 γ , but is expressed in terms of the worse coefﬁ-

cient C(1) π . The second guarantee is almost as good as that

Approximate Policy Iteration Schemes: A Comparison

of CPI since it only contains an extra log 1

ϵ term, but it has the nice property that it holds quickly with respect to ϵ: in time O(log 1

ϵ ) instead of O( 1

ϵ2 ), that is exponentially faster. PSDP is thus theoretically better than both CPI (as good but faster) and API (better and as fast).

Now, from a practical point of view, PSDP and CPI need to store all the policies generated from the start. The memory required by these algorithms is thus proportional to the number of iterations. Even if PSDP may require much fewer iterations than CPI, the corresponding memory requirement may still be prohibitive in situations where ϵ is small or γ is close to 1. We explained that NSPI(m) can be seen as making a bridge between API and PSDP . Since (i) both have a nice time complexity, (ii) API has the best memory requirement, and (iii) NSPI(m) has the best performance guarantee, NSPI(m) is a good candidate for making a standard performance/memory trade-off. If the ﬁrst two bounds of NSPI(m) in Table 1 extends those of API, the other two are made of two terms: the left terms are identical to those obtained for PSDP , while the two possible right terms are new, but are controlled by γm, which can thus be made arbitrarily small by increasing the memory parameter m. Our analysis thus conﬁrms our intuition that NSPI(m) allows to make a performance/memory trade-off in between API (small memory) and PSDP (best performance). In other words, as soon as memory becomes a constraint, NSPI(m) is the natural alternative to PSDP .

4. Experiments

In this section, we present some experiments in order to illustrate the empirical behavior of the different algorithms discussed in the paper. We considered the standard API as a baseline. CPI, as it is described by Kakade & Langford (2002), is very slow (in one sample experiment on a 100 state problem, it made very slow progress and took several millions of iterations before it stopped) and we did not evaluate it further. Instead, we considered two variations: CPI+ that is identical to CPI except that it chooses the step αk at each iteration by doing a line-search towards the policy output by the greedy operator6, and CPI(α) with α = 0.1, that makes relatively but not too small steps at each iteration. To assess the utility for CPI to use the distribution dν,π for the approximate greedy step, we also considered API(α) with α = 0.1, the variation of API described in Equation (4) that makes small steps, and that only differs from CPI(α) by the fact that the approximate greedy step uses the distribution ν instead of dπk,ν. In addition to these algorithms, we considered PSDP and NSPI(m) for the values m {5, 10, 30}.

6We implemented a crude line-search mechanism, that looks on the set 2iα where α is the minimal step estimated by CPI to ensure improvement.

In order to assess their quality, we consider ﬁnite problems where the exact value function can be computed. More precisely, we consider Garnet problems ﬁrst introduced by Archibald et al. (1995), which are a class of randomly constructed ﬁnite MDPs. They do not correspond to any speciﬁc application, but remain representative of the kind of MDP that might be encountered in practice. In brief, we consider Garnet problems with |S| {50, 100, 200}, |A| {2, 5, 10} and branching factors in {1, 2, 10}. The greedy step used by all algorithms is approximated by an exact greedy operator applied to a noisy orthogonal projection on a linear space of dimension |S|

10 with respect to the quadratic norm weighted by ν or dν,π (for CPI+ and CPI(α)) where ν is uniform.

For each of these 33 = 27 parameter instances, we generated 30 i.i.d. Garnet MDPs (Mi)1 i 30. For each such MDP Mi, we ran API, API(0.1), CPI+, CPI(0.1), NSPI(m) for m {5, 10, 30} and PSDP 30 times. For each run j and algorithm, we compute for all iterations k (1, 100) the performance, i.e. the loss Lj,k = µ(vπ vπk) with respect to the optimal policy. Figure 2 displays statistics about these random variables. For each algorithm, we display a learning curve with conﬁdence regions that account for the variability across runs and problems. The supplementary material contains statistics that are respectively conditioned on the values of n S, n A and b, which gives some insight on the inﬂuence of these parameters.

From these experiments and statistics, we can make a series of observations. The standard API scheme is much more variable than the other algorithms and tends to provide the worst performance on average. CPI+ and CPI(α) display about the same asymptotic performance on average. If CPI(α) has slightly less variability, it is much slower than CPI+, that always converges in very few iterations (most of the time less than 10, and always less than 20). API(α) the naive conservative variation of API that is also simpler than CPI(α) is empirically close to CPI(α), while being on average slightly worse. CPI+, CPI(α) and PSDP have a similar average performance, but the variability of PSDP is signiﬁcantly smaller. PSDP is the algorithm that overall gives the best results. NSPI(m) does indeed provide a bridge between API and PSDP . By increasing m, the behavior gets closer to that of PSDP . With m = 30, NSPI(m) is overall better than API(α), CPI+, and CPI(α), and close to PSDP . The above relative observations are stable with respect to the number of states n S and actions n A. Interestingly, the differences between the algorithms tend to vanish when the dynamics of the problem gets more and more stochastic (when the branching factor increases). This complies with our analysis based on concentrability coefﬁcients: there are all ﬁnite when the dynamics mixes a lot, and their relative difference are the biggest in deterministic instances.

Approximate Policy Iteration Schemes: A Comparison

0 20 40 60 80 100 Iterations

0 20 40 60 80 100 Iterations

0 20 40 60 80 100 Iterations

CPI+ (line search)

0 20 40 60 80 100 Iterations

0 20 40 60 80 100 Iterations

0 20 40 60 80 100 Iterations

0 20 40 60 80 100 Iterations

0 20 40 60 80 100 Iterations

Figure 2. Statistics for all instances. The MDPs (Mi)1 i 30 are i.i.d. with the same distribution as M1. Conditioned on some MDP Mi and some algorithm, the error measures at all iteration k are i.i.d. with the same distribution as L1,k. The central line of the learning curves gives the empirical estimate of the overall average error (E[L1,k])k. The three grey regions (from dark to light grey) are estimates of respectively the variability (across MDPs) of the average error (Std[E[L1,k|M1]])k, the average (across MDPs) of the standard deviation of the error (E[Std[L1,k|M1]])k, and the variability (across MDPs) of the standard deviation of the error (Std[Std[L1,k|M1]])k. For ease of comparison, all curves are displayed with the same x and y range.

5. Discussion, Summary and Future Work

We have considered several variations of the Policy Iteration schemes for inﬁnite-horizon problems: API, CPI, NSPI(m), API(α) and PSDP 7. We have in particular explained the fact to our knowledge so far unknown that the recently introduced NSPI(m) algorithm generalizes API (that is obtained when m=1) and PSDP (that is very similar when m = ). Figure 1 synthesized the theoretical guarantees about these algorithms. Most of the bounds are to our knowledge new.

One of the ﬁrst important message of our work is that what is usually hidden in the constants of the performance bounds does matter. The constants involved in the bounds for API, CPI, PSDP and for the main (left) terms of NSPI(m) can be sorted from the worst to the best as follows: C(2,1,0), C(1,0), C(1) π , Cπ . A detailed hierarchy of all constants was depicted in Figure 1. This is to our knowledge the ﬁrst time that such an in-depth comparison of the bounds is done, and our hierarchy of constants has interesting implications that go beyond the Policy Iteration schemes we have been focusing on in this paper. As a matter of fact, several other dynamic programming algorithms, namely AVI (Munos, 2007), λPI (Scherrer, 2013), AMPI (Scherrer et al., 2012), come with guarantees involv-

7We recall that to our knowledge, the use of PSDP (PSDP in an inﬁnite-horizon context) is not documented in the literature.

ing the worst constant C(2,1,0), which suggests that they should not be competitive with the best algorithms we have described here.

At the purely technical level, several of our bounds come in pair; this is due to the fact that we have introduced a new proof technique. This led to a new bound for API, that improves the state of the art in the sense that it involves the constant C(1,0) instead of C(2,1,0). It also enabled us to derive new bounds for CPI (and its natural algorithmic variant CPI(α)) that is worse in terms of guarantee but has a better time complexity ( O( 1

ϵ ) instead of O( 1

ϵ2 )). We believe this new technique may be helpful in the future for the analysis of other MDP algorithms.

Let us sum up the main insights of our analysis. 1) The guarantee for CPI can be arbitrarily stronger than that of API/API(α), because it is expressed with respect to the best concentrability constant Cπ , but this comes at the cost of a relative exponential in 1

ϵ increase of the number of iterations. 2) PSDP enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP is proportional to their number of iterations, which may be problematic in particular when the discount factor γ is close to 1 or the approximation error ϵ is close to 0; we showed that the NSPI(m) algorithm allows to make an overall trade-off between memory and perfor-

Approximate Policy Iteration Schemes: A Comparison

The main assumption of this work is that all algorithms have at disposal an ϵ-approximate greedy operator. It may be unreasonable to compare all algorithms on this basis, since the underlying optimization problems may have different complexities: for instance, methods like CPI look in a space of stochastic policies while API moves in a space of deterministic policies. Digging and understanding in more depth what is potentially hidden in the term ϵ as we have done here for the concentrability constants constitutes a very natural research direction.

Last but not least, we have run numerical experiments that support our worst-case analysis. On simulations on about 800 Garnet MDPs with various characteristics, CPI(α), CPI+ (CPI with a crude line-search mechanism), PSDP and NSPI(m) were shown to always perform signiﬁcantly better than the standard API. CPI+, CPI(α) and PSDP performed similarly on average, but PSDP showed much less variability and is thus the best algorithm in terms of overall performance. Finally, NSPI(m) allows to make a bridge between API and PSDP , reaching an overall performance close to that of PSDP with a controlled memory. Implementing other instances of these algorithmic schemes, running and analyzing experiments on bigger domains constitutes interesting future work.

A. Proofs for Table 1

PSDP : For all k, we have

vπ vσk = Tπ vπ Tπ vσk 1 + Tπ vσk 1 Tπkvσk 1

γPπ (vπ vσk 1) + ek

where we deﬁned ek = maxπ Tπ vσk 1 Tπkvσk 1. As Pπ is non negative, we deduce by induction:

i=0 (γPπ )iek i + γk Vmax.

By multiplying both sides by µ, using the deﬁnition of the coefﬁcients cπ (i) and the fact that νej ϵj ϵ, we get:

i=0 µ(γPπ )iek i + γk Vmax (7)

i=0 γicπ (i)ϵk i + γk Vmax

i=0 γicπ (i)

ϵ + γk Vmax.

The bound with respect to C(1) π is obtained by using the fact

that vσk... vσk γk Vmax and taking k l log 2Vmax

Starting back in Equation (7) and using the deﬁnition of Cπ (in particular the fact that for all i, µ(γPπ )i 1 1 γ dπ ,µ Cπ

1 γ ν) and the fact that νej ϵj, we get:

i=0 µ(γPπ )iek i + γk Vmax

i=1 ϵi + γk Vmax

and the other bound is obtained by using the fact that vσk... vσk γk Vmax, Pk i=1 ϵi kϵ, and considering

the number of iterations k = l log 2Vmax

API/NSPI(m): API is identical to NSPI(1), and its bounds are particular cases of the ﬁrst two bounds for NSPI(m), so we only consider NSPI(m). By following the proof technique of Scherrer & Lesner (2012), writing Γk,m = (γPπk)(γPπk 1) (γPπk m+1) and ek+1 = maxπ Tπ vπk,m Tπk+1vπk,m, one can show that:

i=0 (γPπ )i(I Γk i,m) 1ek i + γk Vmax.

Multiplying both sides by µ (and observing that ek 0) and the fact that νej ϵj ϵ, we obtain:

i=0 µ(γPπ )i(I Γk i,m) 1ek i + γk Vmax (8)

j=0 γi+jmc(i + jm)ϵk i

+ γk Vmax (9)

j=0 γi+jmc(i + jm)ϵ + γk Vmax, (10)

which leads to the ﬁrst bound by taking k l log 2Vmax

ϵ 1 γ m . Starting back on Equation (9), assuming for simplicity that ϵ k = 0 for all k 0, we get:

µ(vπ vπk) γk Vmax

j=0 γh+(l+j)mc(h + (l + j)m)ϵk h lm

j=l γh+jmc(h + jm) max k (l+1)m+1 p k lm ϵp

j=0 γh+jmc(h + jm) max k (l+1)m+1 p k lm ϵp

Approximate Policy Iteration Schemes: A Comparison

j=0 γh+jmc(h + jm)

l=0 max l (l+1)m+1 p k lm ϵp

which leads to the second bound by taking k = l log 2Vmax

ϵ 1 γ m . Last but not least, starting back on Equa-

tion (8), and using the fact that (I Γk i,m) 1 = I + Γk i,m(I Γk i,m) 1 we see that:

µ(vπ vπk) γk Vmax

i=0 µ(γPπ )iek i +

i=0 µ(γPπ )iΓk i,m(I Γk i,m) 1ek i.

The ﬁrst term of the r.h.s. can be bounded exactly as for PSDP . For the second term, we have:

i=0 µ(γPπ )iΓk i,m(I Γk i,m) 1ek i

j=1 γi+jmc(i + jm)ϵk i

j=0 γi+jmc(i + (j + 1)m)ϵk i,

and we follow the same lines as above (from Equation (9) to Equations (10) and (11)) to conclude.

CPI, CPI(α), API(α): Conservative steps are addressed by a tedious generalization of the proof for API by Munos (2003). Due to lack of space, the proof is deferred to the Supplementary Material.

B. Proofs for Figure 1

We here provide details on the order relation for the concentrability coefﬁcients.

Cπ C(1) π : (i) We have Cπ C(1) π because

dπ ,µ = (1 γ)µ(I γPπ ) 1 = (1 γ)

i=0 γiµ(Pπ )i

i=0 γicπ (i)ν = C(1) π ν

and Cπ is the smallest coefﬁcient C satisfying dπ ,µ Cν. (ii) We may have Cπ < and C(1) π = by design-

ing a MDP on N where π induces a deterministic transition from state i to state i + 1.

C(1) π C(1,0): (i) We have C(1) π C(1,0) because for

all i, cπ (i) c(i). (ii) It is easy to obtain C(1) π < and C(1,0) = since C(1) π only depends on one policy while C(1) π depends on all policies.

C(1,0) C(2,m,0) and C(1,m) C(2,m,m): (i) C(1,m) 1 1 γm C(2,m,m) holds because

i=0 γic(i + m)

j=0 γi+jmc(i + (j + 1)m)

= 1 (1 γ)(1 γm)C(2,m,m).

(ii) One may have C(1,m) < and C(2,m,m) = when c(i) = Θ( 1 i2γi ), since the generic term of C(1,m) is Θ( 1

i2 ) (the sum converges) while that of C(2,m,m) is Θ( 1

i ) (the sum diverges). The reasoning is similar for the other relation.

C(1,m) C(1,0) and C(2,m,m) C(2,m,0): We here assume that m < . (i) We have C(1,m) 1 γm C(1,0) and C(2,m,m) 1 γm C(2,m,0). (ii) It sufﬁces that c(j) = for some j < m to have C(2,m,0) = while C(2,m,m) < , or to have C(1,0) = while C(1,m) < .

C(2,1,0) C(2,m,0): (i) We clearly have C(2,m,0) 1 γm

1 γ C(2,1,0). (ii) C(2,m,0) can be rewritten as follows:

C(2,m,0) = (1 γ)(1 γm)

Then, using the fact that 1 + i

m , we have

1 γ 1 γm C(2,m,0)

i=0 max 1, i

i=0 γic(i) +

i=0 γic(i) + m m + 1

i=0 γic(i) + m m + 1

= m m + 1C(2,1,0) + 1 m + 1

i=0 γic(i).

Thus, when m is ﬁnite, C(2,m,0) < C(2,1,0) < .

Approximate Policy Iteration Schemes: A Comparison

Archibald, T., Mc Kinnon, K., and Thomas, L. On the Generation of Markov Decision Processes. Journal of the Operational Research Society, 46:354 361, 1995.

Bagnell, J.A., Kakade, S.M., Ng, A., and Schneider, J. Policy search by dynamic programming. In NIPS, 2003.

Bertsekas, D.P. and Tsitsiklis, J.N. Neuro-Dynamic Programming. Athena Scientiﬁc, 1996.

Farahmand, A.M., Munos, R., and Szepesv ari, Cs. Error propagation for approximate policy and value iteration (extended version). In NIPS, 2010.

Ghavamzadeh, M. and Lazaric, A. Conservative and Greedy Approaches to Classiﬁcation-based Policy Iteration. In AAAI, 2012.

Kakade, Sham and Langford, John. Approximately optimal approximate reinforcement learning. In ICML, 2002.

Lagoudakis, M. and Parr, R. Reinforcement Learning as Classiﬁcation: Leveraging Modern Classiﬁers. In ICML, 2003a.

Lagoudakis, M.G. and Parr, R. Least-squares policy iteration. Journal of Machine Learning Research (JMLR), 4: 1107 1149, 2003b.

Lazaric, A., Ghavamzadeh, M., and Munos, R. Analysis of a Classiﬁcation-based Policy Iteration Algorithm. In ICML, 2010.

Munos, R. Error Bounds for Approximate Policy Iteration. In ICML, 2003.

Munos, R. Performance Bounds in Lp norm for Approximate Value Iteration. SIAM J. Control and Optimization, 2007.

Puterman, M. Markov Decision Processes. Wiley, New York, 1994.

Scherrer, B. Performance Bounds for Lambda Policy Iteration and Application to the Game of Tetris. Journal of Machine Learning Research, 14:1175 1221, 2013.

Scherrer, B. and Lesner, B. On the Use of Non-Stationary Policies for Stationary Inﬁnite-Horizon Markov Decision Processes. In NIPS, 2012.

Scherrer, Bruno, Ghavamzadeh, Mohammad, Gabillon, Victor, and Geist, Matthieu. Approximate Modiﬁed Policy Iteration. In ICML, 2012.