# accelerating_value_iteration_with_anchoring__96cd12c0.pdf

Accelerating Value Iteration with Anchoring

Jongmin Lee1 Ernest K. Ryu1,2

1Department of Mathematical Science, Seoul National University 2Interdisciplinary Program in Artificial Intelligence, Seoul National University

Value Iteration (VI) is foundational to the theory and practice of modern reinforcement learning, and it is known to converge at a O(γk)-rate, where γ is the discount factor. Surprisingly, however, the optimal rate in terms of Bellman error for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an anchoring mechanism (distinct from Nesterov s acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a O(1/k)-rate for γ 1 or even γ = 1, while standard VI has rate O(1) for γ 1 1/k, where k is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of 4, thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss Seidel VI setups as well.

1 Introduction

Value Iteration (VI) is foundational to the theory and practice of modern dynamic programming (DP) and reinforcement learning (RL). It is well known that when a discount factor γ < 1 is used, (exact) VI is a contractive iteration in the -norm and therefore converges. The progress of VI is measured by the Bellman error in practice (as the distance to the fixed point is not computable), and much prior work has been dedicated to analyzing the rates of convergence of VI and its variants.

Surprisingly, however, the optimal rate in terms of Bellman error for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. The classical O(γk)-rate of VI is inadequate as many practical setups use γ 1 or γ = 1 for the discount factor. (Not to mention that VI may not converge when γ = 1.) Moreover, most prior works on accelerating VI focused on the Bellman consistency operator (policy evaluation) as its linearity allows eigenvalue analyses, but the Bellman optimality operator (control) is the more relevant object in modern RL.

Contribution. In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an anchoring mechanism (distinct from Nesterov s acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a O(1/k)-rate for γ 1 or even γ = 1, while standard VI has rate O(1) for γ 1 1/k, where k is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of 4, thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss Seidel VI setups as well.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

1.1 Notations and preliminaries

We quickly review basic definitions and concepts of Markov decision processes (MDP) and reinforcement learning (RL). For further details, refer to standard references such as [69, 84, 81].

Markov Decision Process. Let M(X) be the space of probability distributions over X. Write (S, A, P, r, γ) to denote the MDP with state space S, action space A, transition probability P : S A M(S), reward r: S A R, and discount factor γ (0, 1]. Denote π: S M(A) for a policy, V π(s) = Eπ[P t=0 γtr(st, at) | s0 = s] and Qπ(s, a) = Eπ[P t=0 γtr(st, at) | s0 = s, a0 = a] for V - and Q-value functions, where Eπ denotes the expected value over all trajectories (s0, a0, s1, a1, . . . ) induced by P and π. We say V and Q are optimal V - and Qvalue functions if V = supπ V πand Q = supπ Qπ. We say π V and π Q are optimal policies if π V = argmaxπ V π

and π Q = argmaxπ Qπ. (If argmax is not unique, break ties arbitrarily.)

Value Iteration. Let F(X) denote the space of bounded measurable real-valued functions over X. With the given MDP (S, A, P, r, γ), for V F(S) and Q F(S A), define the Bellman consistency operators T π as T πV (s) = Ea π( | s),s P ( | s,a) [r(s, a) + γV (s )] ,

T πQ(s, a) = r(s, a) + γEs P ( | s,a),a π( | s ) [Q(s , a )] for all s S, a A, and the Bellman optimality operators T as T V (s) = sup a A

r(s, a) + γEs P ( | s,a) [V (s )] ,

T Q(s, a) = r(s, a) + γEs P ( | s,a)

sup a A Q(s , a )

for all s S, a A. For notational conciseness, we write T πV = rπ + γPπV and T πQ = r + γPπQ, where rπ(s) = Ea π( | s) [r(s, a)] is the reward induced by policy π and Pπ(s) and Pπ(s, a) defined as Pπ(s s ) = Prob(s s | a π( | s), s P( | s, a))

Pπ((s, a) (s , a )) = Prob((s, a) (s , a ) | s P( | s, a), a π( | s )), are the transition probabilities induced by policy π. We define VI for Bellman consistency and optimality operators as

V k+1 = T πV k, Qk+1 = T πQk, V k+1 = T V k, Qk+1 = T Qk for k = 0, 1, . . . , where V 0, Q0 are initial points. VI for control, after executing K iterations, returns the near-optimal policy πK as a greedy policy satisfying

T πKV K = T V K, T πKQK = T QK. For γ < 1, both Bellman consistency and optimality operators are contractions, and, by Banach s fixed-point theorem [5], the VIs converge to the unique fixed points V π, Qπ, V , and Q with O(γk)-rate. For notational unity, we use the symbol U when both V and Q can be used. Since TU k U k TU k U + U k U (1 + γ) U k U , VI exhibits the rate on the Bellman error: TU k U k (1 + γ)γk U 0 U for k = 0, 1, . . . , (1)

where T is Bellman consistency or optimality operator, U 0 is a starting point, and U is fixed point of T. We say V V or Q Q if V (s) V (s) or Q(s, a) Q (s, a) for all s S and a A, respectively.

Fixed-point iterations. Given an operator T, we say x is fixed point if Tx = x . Since Banach [5], the standard fixed-point iteration

xk+1 = Txk for k = 0, 1, . . . has been commonly used to find fixed points. Note that VI for policy evaluation and control are fixed-point iterations with Bellman consistency and optimality operators. In this work, we also consider the Halpern iteration

xk+1 = βk+1x0 + (1 βk+1)Txk for k = 0, 1, . . . , where x0 is an initial point and {βk}k N (0, 1).

1.2 Prior works

Value Iteration. Value iteration (VI) was first introduced in the DP literature [8] for finding optimal value function, and its variant approximate VI [11, 30, 56, 32, 19, 90, 81] considers approximate evaluations of the Bellman optimality operator. In RL, VI and approximate VI have served as the basis of RL algorithms such as fitted value iteration [29, 57, 52, 87, 50, 36] and temporal difference learning [80, 89, 41, 94, 54]. There is a line of research that emulates VI by learning a model of the MDP dynamics [85, 83, 62] and applying a modified Bellman operator [7, 33]. Asynchronous VI, another variation of VI updating the coordinate of value function in asynchronous manner, has also been studied in both RL and DP literature [11, 9, 88, 100].

Fixed-point iterations. The Banach fixed-point theorem [5] establishes the convergence of the standard fixed-point iteration with a contractive operator. The Halpern iteration [39] converges for nonexpansive operators on Hilbert spaces [96] and uniformly smooth Banach spaces [70, 97]. (To clarify, the -norm in Rn is not uniformly smooth.)

The fixed-point residual Txk xk is a commonly used error measure for fixed-point problems. In general normed spaces, the Halpern iteration was shown to exhibit O(1/ log(k))-rate for (nonlinear) nonexpansive operators [48] and O(1/k)-rate for linear nonexpansive operators [17] on the fixedpoint residual. In Hilbert spaces, [72] first established a O(1/k)-rate for the Halpern iteration and the constant was later improved by [49, 43]. For contractive operators, [65] proved exact optimality of Halpern iteration through an exact matching complexity lower bound.

Acceleration. Since Nesterov s seminal work [61], there has been a large body of research on acceleration in convex minimization. Gradient descent [15] can be accelerated to efficiently reduce function value and squared gradient magnitude for smooth convex minimization problems [61, 44, 45, 46, 102, 21, 60] and smooth strongly convex minimization problems [59, 91, 64, 86, 73]. Motivated by Nesterov acceleration, inertial fixed-point iterations [51, 22, 75, 70, 42] have also been suggested to accelerate fixed-point iterations. Anderson acceleration [2], another acceleration scheme for fixed-point iterations, has recently been studied with interest [6, 74, 93, 101].

In DP and RL, prioritized sweeping [55] is a well-known method that changes the order of updates to accelerate convergence, and several variants [68, 53, 95, 3, 18] have been proposed. Speedy Qlearning [4] modifies the update rule of Q-learning and uses aggressive learning rates for acceleration. Recently, there has been a line of research that applies acceleration techniques of other areas to VI: [34, 79, 28, 67, 27, 76] uses Anderson acceleration of fixed-point iterations, [92, 37, 38, 12, 1] uses Nesterov acceleration of convex optimization, and [31] uses ideas inspired by PID controllers in control theory. Among those works, [37, 38, 1] applied Nesterov acceleration to obtain theoretically accelerated convergence rates, but those analyses require certain reversibility conditions or restrictions on eigenvalues of the transition probability induced by the policy.

The anchor acceleration, a new acceleration mechanism distinct from Nesterov s, lately gained attention in convex optimization and fixed-point theory. The anchoring mechanism, which retracts iterates towards the initial point, has been used to accelerate algorithms for minimax optimization and fixed-point problems [71, 47, 98, 65, 43, 20, 99, 78], and we focus on it in this paper.

Complexity lower bound. With the information-based complexity analysis [58], complexity lower bound on first-order methods for convex minimization problem has been thoroughly studied [59, 23, 25, 13, 14, 24]. If a complexity lower bound matches an algorithm s convergence rate, it establishes optimality of the algorithm [58, 44, 73, 86, 26, 65]. In fixed-point problems, [16] established Ω(1/k1

2/q) lower bound on distance to solution for Halpern iteration with a nonexpansive operator in q-uniformly smooth Banach spaces. In [17], a Ω(1/k) lower bound on the fixed-point residual for the general Mann iteration with a nonexpansive linear operator, which includes standard fixed-point iteration and Halpern iterations, in the ℓ -space was provided. In Hilbert spaces, [65] showed exact complexity lower bound on fixed-point residual for deterministic fixed-point iterations with γ-contractive and nonexpansive operators. Finally, [37] provided lower bound on distance to optimal value function for fixed-point iterations satisfying span condition with Bellman consistency and optimality operators and we discussed this lower bound in section 4.

2 Anchored Value Iteration

Let T be a γ-contractive (in the -norm) Bellman consistency or optimality operator. The Anchored Value Iteration (Anc-VI) is

U k = βk U 0 + (1 βk)TU k 1 (Anc-VI)

for k = 1, 2, . . . , where βk = 1/(Pk i=0 γ 2i) and U 0 is an initial point. In this section, we present accelerated convergence rates of Anc-VI for both Bellman consistency and optimality operators for both V - and Q-value iterations. For the control setup, where the Bellman optimality operator is used, Anc-VI returns the near-optimal policy πK as a greedy policy satisfying T πKU K = T U K after executing K iterations.

Notably, Anc-VI obtains the next iterate as a convex combination between the output of T and the starting point U 0. We call the βk U0 term the anchor term since, loosely speaking, it serves to pull the iterates toward the starting point U0. The strength of the anchor mechanism diminishes as the iteration progresses since βk is a decreasing sequence.

The anchor mechanism was introduced [39, 72, 49, 65, 17, 48] for general nonexpansive operators and 2-nonexpansive and contractive operators. The optimal method for 2-nonexpansive and contractive operators in [65] shares the same coefficients with Anc-VI, and convergence results for general nonexapnsive operators in [17, 48] are applicable to Anc-VI for nonexpansive Bellman optimality and consistency operators. While our anchor mechanism does bear a formal resemblance to those of prior works, our convergence rates and point convergence are neither a direct application nor a direct adaptation of the prior convergence analyses. The prior analyses for 2-nonexpansive and contractive operators do not apply to Bellman operators, and prior analyses for general nonexpansive operators have slower rates and do not provide point convergence while our Theorem 3 does. Our analyses specifically utilize the structure of Bellman operators to obtain the faster rates and point convergence.

The accelerated rate of Anc-VI for the Bellman optimality operator is more technically challenging and is, in our view, the stronger contribution. However, we start by presenting the result for the Bellman consistency operator because it is commonly studied in the prior RL theory literature on accelerating value iteration [37, 38, 1, 31] and because the analysis in the Bellman consistency setup will serve as a good conceptual stepping stone towards the analysis in the Bellman optimality setup.

2.1 Accelerated rate for Bellman consistency operator

First, for general state-action spaces, we present the accelerated convergence rate of Anc-VI for the Bellman consistency operator. Theorem 1. Let 0 < γ < 1 be the discount factor and π be a policy. Let T π be the Bellman consistency operator for V or Q. Then, Anc-VI exhibits the rate

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 U 0 U π

= 2 k + 1 + k 1

k + 1ϵ + O(ϵ2) U 0 U π for k = 0, 1, . . . ,

where ϵ = 1 γ and the big-O notation considers the limit ϵ 0. If, furthermore, U 0 T πU 0 or U 0 T πU 0, then Anc-VI exhibits the rate

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1 U 0 U π

= 1 k + 1 + k k + 1ϵ + O(ϵ2) U 0 U π for k = 0, 1, . . . .

2, both rates of Theorem 1 are strictly faster than the standard rate (1) of VI, since γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 = γk 1 γ2 1 + 2γ γk+1

(1 γ2k+2) < γk(1 + γ).

The second rate of Theorem 1, which has the additional requirement, is faster than the standard rate (1) of VI for all 0 < γ < 1. Interestingly, in the γ 1 regime, Anc-VI achieves O(1/k)-rate while VI has a O(1)-rate. We briefly note that the condition U 0 TU 0 and U 0 TU 0 have been used in analyses of variants of VI [69, Theorem 6.3.11], [77, p.3].

In the following, we briefly outline the proof of Theorem 1 while deferring the full description to Appendix B. In the outline, we highlight a particular step, labeled , that crucially relies on the linearity of the Bellman consistency operator. In the analysis for the Bellman optimality operator of Theorem 2, resolving the step despite the nonlinearity is the key technical challenge.

Proof outline of Theorem 1. Recall that we can write Bellman consistency operator as T πV = rπ + γPπV and T πQ = r + γPπQ. Since T π is a linear operator1, we get

T πU k U k = T πU k (1 βk)T πU k 1 βk T πU π βk(U 0 U π)

= γPπ(U k (1 βk)U k 1 βk U π) βk(U 0 U π)

= γPπ(βk(U 0 U π) + (1 βk)(T πU k 1 U k 1)) βk(U 0 U π)

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i

βk(U 0 U π) + Πk j=1(1 βj) (γPπ)k+1 (U 0 U π),

where the first equality follows from the definition of Anc-VI and the property of fixed point, while the last equality follows from induction. Taking the -norm of both sides, we conclude

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 U 0 U π .

2.2 Accelerated rate for Bellman optimality operator

We now present the accelerated convergence rate of Anc-VI for the Bellman optimality operator.

Our analysis uses what we call the Bellman anti-optimality operator, defined as

ˆT V (s) = inf a A r(s, a) + γEs P ( | s,a) [V (s )]

ˆT Q(s, a) = r(s, a) + γEs P ( | s,a)

inf a A Q(s , a ) ,

for all s S and a A. (The sup is replaced with a inf.) When 0 < γ < 1, the Bellman antioptimality operator is γ-contractive and has a unique fixed point ˆU by the exact same arguments that establish γ-contractiveness of the standard Bellman optimality operator.

Theorem 2. Let 0 < γ < 1 be the discount factor. Let T and ˆT respectively be the Bellman optimality and anti-optimality operators for V or Q. Let U and ˆU respectively be the fixed points of T and ˆT . Then, Anc-VI exhibits the rate

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 max n U 0 U , U 0 ˆU

for k = 0, 1, . . . . If, furthermore, U 0 T U 0 or U 0 T U 0, then Anc-VI exhibits the rate

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1 U 0 U if U 0 T U 0

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1

U 0 ˆU if U 0 T U 0

for k = 0, 1, . . . .

1Arguably, T π is affine, not linear, but we follow the convention of [69] say T π is linear.

Anc-VI with the Bellman optimality operator exhibits the same accelerated convergence rate as Anc-VI with the Bellman consistency operator. As in Theorem 1, the rate of Theorem 2 also becomes O(1/k) when γ 1, while VI has a O(1)-rate.

Proof outline of Theorem 2. The key technical challenge of the proof comes from the fact that the Bellman optimality operator is non-linear. Similar to the Bellman consistency operator case, we have

T U k U k = T U k (1 βk)T U k 1 βk T U βk(U 0 U )

γPπk U k (1 βk)U k 1 βk U βk(U 0 U )

= γPπk(βk U 0 U + (1 βk)(T U k 1 U k 1)) βk(U 0 U )

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγPπl (U 0 U )

βk(U 0 U ) + Πk j=1(1 βj) Π0 l=kγPπl (U 0 U ),

where πk is the greedy policy satisfying T πk U k = T U k, we define Πi l=kγPπl = γPπkγPπk 1 γPπi, and last inequality follows by induction and monotonicity of Bellman optimality operator. The key step uses greedy policies {πl}l=0,1,...,k, which are well defined when the action space is finite. When the action space is infinite, greedy policies may not exist, so we use the Hahn Banach extension theorem to overcome this technicality. The full argument is provided in Appendix B.

To lower bound T U k U k, we use a similar line of reasoning with the Bellman anti-optimality operator. Combining the upper and lower bounds of T U k U k, we conclude the accelerated rate of Theorem 2.

For γ < 1, the rates of Theorems 1 and 2 can be translated to a bound on the distance to solution:

U k U γk (1 + γ) 1 + 2γ γk+1

for k = 1, 2, . . . . This O(γk) rate is worse than the rate of (classical) VI by a constant factor. Therefore, Anc-VI is better than VI in terms of the Bellman error, but it is not better than VI in terms of distance to solution.

3 Convergence when γ = 1

Undiscounted MDPs are not commonly studied in the DP and RL theory literature due to the following difficulties: Bellman consistency and optimality operators may not have fixed points, VI is a nonexpansive (not contractive) fixed-point iteration and may not convergence to a fixed point even if one exist, and the interpretation of a fixed point as the (optimal) value function becomes unclear when the fixed point is not unique. However, many modern deep RL setups actually do not use discounting,2 and this empirical practice makes the theoretical analysis with γ = 1 relevant.

In this section, we show that Anc-VI converges to fixed points of the Bellman consistency and optimality operators of undiscounted MDPs. While a full treatment of undiscounted MDPs is beyond the scope of this paper, we show that fixed points, if one exists, can be found, and we therefore argue that the inability to find fixed points should not be considered an obstacle in studying the γ = 1 setup.

We first state our convergence result for finite state-action spaces. Theorem 3. Let γ = 1. Let T : Rn Rn be the nonexpansive Bellman consistency or optimality operator for V or Q. Assume a fixed point exists (not necessarily unique). If, U 0 TU 0, then Anc-VI exhibits the rate TU k U k 1 k + 1

U 0 U for k = 0, 1, . . . .

2As a specific example, the classical policy gradient theorem [82] calls for the use of J(θ) = E P t=0 γt θ log πθ(at | st)Qϕ γ(st, at) , but many modern deep policy gradient methods use γ = 1 in the first instance of γ (so γt = 1) while using γ < 1 in Qϕ γ(st, at) [63].

for any fixed point U satisfying U 0 U . Furthermore, U k U for some fixed point U .

If rewards are nonnegative, then the condition U 0 TU 0 is satisfied with U 0 = 0. So, under this mild condition, Anc-VI with γ = 1 converges with O(1/k)-rate on the Bellman error. To clarify, the convergence U k U has no rate, i.e., U k U = o(1), while TU k U k = O(1/k). In contrast, standard VI does not guarantee convergence in this setup.

We also point out that the convergence of Bellman error does not immediately imply point convergence, i.e., TU k U k 0 does not immediately imply U k U , when γ = 1. Rather, we show (i) U k is a bounded sequence, (ii) any convergent subsequence U kj converges to a fixed point U , and (iii) U k is elementwise monotonically nondecreasing and therefore has a single limit.

Next, we state our convergence result for general state-action spaces.

Theorem 4. Let γ = 1. Let the state and action spaces be general (possibly infinite) sets. Let T be the nonexpansive Bellman consistency or optimality operator for V or Q, and assume T is well defined.3 Assume a fixed point exists (not necessarily unique). If U 0 TU 0, then Anc-VI exhibits the rate TU k U k 1 k + 1

U 0 U for k = 0, 1, . . .

for any fixed point U satisfying U 0 U . Furthermore, U k U pointwise monotonically for some fixed point U .

The convergence U k U pointwise in infinite state-action spaces is, in our view, a non-trivial contribution. When the state-action space is finite, pointwise convergence directly implies convergence in , and in this sense, Theorem 4 is generalization of Theorem 3. However, when the state-action space is infinite, pointwise convergence does not necessarily imply uniform convergence, i.e., U k U pointwise does not necessarily imply U k U in .

4 Complexity lower bound

We now present a complexity lower bound establishing optimality of Anc-VI.

Theorem 5. Let k 0, n k + 2, 0 < γ 1, and U 0 Rn. Then there exists an MDP with |S| = n and |A| = 1 (which implies the Bellman consistency and optimality operator for V and Q all coincide as T : Rn Rn) such that T has a fixed point U satisfying U 0 U and

TU k U k γk Pk i=0 γi U 0 U

for any iterates {U i}k i=0 satisfying

U i U 0 + span{TU 0 U 0, TU 1 U 1, . . . , TU i 1 U i 1} for i = 1, . . . , k.

Proof outline of Theorem 5. Without loss of generality, assume n = k + 2 and U 0 = 0. Consider the MDP (S, A, P, r, γ) such that

S = {s1, . . . , sk+2}, A = {a1}, P(si | sj, a1) = 1{i=j=1, j=i+1}, r(si, a1) = 1{i=2}.

Then, T = γPπU + [0, 1, 0, . . . , 0] , U = [0, 1, γ, . . . , γk] , and U 0 U = 1. Under the span condition, we can show that U k

k+2 = 0. Then, we get

TU k U k = 0, 1 U k

3 , . . . , γ U k

k+1 , γ U k

and this implies TU k U k

1 + TU k U k

2 + γ 1 TU k U k

3 + + γ k TU k U k

3Well-definedness of T requires a σ-algebra on state and action spaces, expectation with respect to transition probability and policy to be well defined, boundedness and measurability of the output of Bellman operators, etc.

Taking the absolute value on both sides,

(1 + + γ k) max 1 i k+2 {|TU k U k|i} 1.

Therefore, we conclude

TU k U k γk Pk i=0 γi U 0 U .

Note that the case γ = 1 is included in Theorem 5. When γ = 1, the lower bound of Theorem 5 exactly matches the upper bound of Theorem 3.

Since γk Pk i=0 γi

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1 4γk Pk i=0 γi for all 0 < γ < 1,

the lower bound establishes optimality of the second rates Theorems 1 and 2 up to a constant of factor 4. Theorem 5 improves upon the prior state-of-the-art complexity lower bound established in the proof of [37, Theorem 3] by a factor 1 γk+1. (In [37, Theorem 3], a lower bound on the distance to optimal value function is provided. Their result has an implicit dependence on the initial distance to optimal value function U 0 U , so we make the dependence explicit, and we translate their result to a lower bound on the Bellman error. Once this is done, the difference between our lower bound of Theorem 5 and of [37, Theorem 3] is a factor of 1 γk+1. The worst-case MDP of [37, Theorem 3] and our worst-case MDP primarily differ in the rewards, while the states and the transition probabilities are almost the same.)

The so-called span condition of Theorem 5 is arguably very natural and is satisfied by standard VI and Anc-VI. The span condition is commonly used in the construction of complexity lower bounds on first-order optimization methods [59, 23, 25, 13, 14, 65] and has been used in the prior state-of-the-art lower bound for standard VI [37, Theorem 3]. However, designing an algorithm that breaks the lower bound of Theorem 5 by violating the span condition remains a possibility. In optimization theory, there is precedence of lower bounds being broken by violating seemingly natural and minute conditions [40, 35, 98].

5 Approximate Anchored Value Iteration

In this section, we show that the anchoring mechanism is robust against evaluation errors of the Bellman operator, just as much as the standard approximate VI.

Let 0 < γ < 1 and let T be the Bellman optimality operator. The Approximate Anchored Value Iteration (Apx-Anc-VI) is

U k ϵ = T U k 1 + ϵk 1

U k = βk U 0 + (1 βk)U k ϵ (Apx-Anc-VI)

for k = 1, 2, . . . , where βk = 1/(Pk i=0 γ 2i), U 0 is an initial point, and the {ϵk} k=0 is the error sequence modeling approximate evaluations of T .

Of course, the classical Approximate Value Iteration (Apx-VI) is

U k = T U k 1 + ϵk 1 (Apx-VI)

for k = 1, 2, . . . , where U 0 is an initial point.

Fact 1 (Classical result, [11, p.333]). Let 0 < γ < 1 be the discount factor. Let T be the Bellman optimality for V or Q. Let U be the fixed point of T . Then Apx-VI exhibits the rate

T U k U k (1 + γ)γk U 0 U + (1 + γ) 1 γk

1 γ max 0 i k 1

ϵi for k = 1, 2, . . . .

Theorem 6. Let 0 < γ < 1 be the discount factor. Let T and ˆT respectively be the Bellman optimality and anti-optimality operators for V or Q. Let U and ˆU respectively be the fixed points of T and ˆT . Then Apx-Anc-VI exhibits the rate

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 max n U 0 U , U 0 ˆU

+ 1 + γ 1 + γk+1 1 γk

1 γ max 0 i k 1

ϵi for k = 1, 2, . . . .

If, furthermore, U 0 T U 0, then (Apx-Anc-VI) exhibits the rate

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1

U 0 ˆU + 1 + γ 1 + γk+1 1 γk

1 γ max 0 i k 1

for k = 1, 2, . . . .

The dependence on max ϵi of Apx-Anc-VI is no worse than that of Apx-VI. In this sense, Apx-Anc-VI is robust against evaluation errors of the Bellman operator, just as much as the standard Apx-VI. Finally, we note that a similar analysis can be done for Apx-Anc-VI with the Bellman consistency operator.

6 Gauss Seidel Anchored Value Iteration

In this section, we show that the anchoring mechanism can be combined with Gauss Seidel-type updates in finite state-action spaces. Let 0 < γ < 1 and let T : Rn Rn be the Bellman optimality operator. Define T GS : Rn Rn as

T GS = T n T 2 T 1 ,

where T j : Rn Rn is defined as

T j (U) = (U1, . . . , Uj 1, (T (U))j , Uj+1, . . . , Un)

for j = 1, . . . , n. Fact 2. [Classical result, [69, Theorem 6.3.4]] T GS is a γ-contractive operator and has the same fixed point as T .

The Gauss Seidel Anchored Value Iteration (GS-Anc-VI) is

U k = βk U 0 + (1 βk)T GSU k 1 (GS-Anc-VI)

for k = 1, 2, . . . , where βk = 1/(Pk i=0 γ 2i) and U 0 is an initial point. Theorem 7. Let the state and action spaces be finite sets. Let 0 < γ < 1 be the discount factor. Let T and ˆT respectively be the Bellman optimality and anti-optimality operators for V or Q. Let U

and ˆU respectively be the fixed points of T and ˆT . Then GS-Anc-VI exhibits the rate

T GSU k U k

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 max n U 0 U , U 0 ˆU

for k = 0, 1, . . . . If, furthermore, U 0 T GSU 0 or U 0 T GSU 0, then GS-Anc-VI exhibits the rate

T GSU k U k

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1 U 0 U if U 0 T GSU 0

T GSU k U k

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1

U 0 ˆU if U 0 T GSU 0

for k = 0, 1, . . . .

We point out that GS-Anc-VI cannot be directly extended to infinite action spaces since Hahn Banach extension theorem is not applicable in the Gauss Seidel setup. Furthermore, we note that a similar analysis can be carried out for GS-Anc-VI with the Bellman consistency operator.

7 Conclusion

We show that the classical value iteration (VI) is, in fact, suboptimal and that the anchoring mechanism accelerates VI to be optimal in the sense that the accelerated rate matches a complexity lower bound up to a constant factor of 4. We also show that the accelerated iteration provably converges to a fixed point even when γ = 1, if a fixed point exists. Being able to provide a substantive improvement upon the classical VI is, in our view, a surprising contribution.

One direction of future work is to study the empirical effectiveness of Anc-VI. Another direction is to analyze Anc-VI in a model-free setting and, more broadly, to investigate the effectiveness of the anchor mechanism in more practical RL methods.

Our results lead us to believe that many of the classical foundations of dynamic programming and reinforcement learning may be improved with a careful examination based on an optimization complexity theory perspective. The theory of optimal optimization algorithms has recently enjoyed significant developments [44, 43, 45, 98, 66], the anchoring mechanism being one such example [49, 65], and the classical DP and RL theory may benefit from a similar line of investigation on iteration complexity.

Acknowledgments and Disclosure of Funding

This work was supported by the the Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)] and the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02). We thank Jisun Park for providing valuable feedback.

[1] M. Akian, S. Gaubert, Z. Qu, and O. Saadi. Multiply accelerated value iteration for nonsymmetric affine fixed point problems and application to markov decision processes. SIAM Journal on Matrix Analysis and Applications, 43(1):199 232, 2022.

[2] D. G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the Association for Computing Machinery, 12(4):547 560, 1965.

[3] D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping. Neural Information Processing Systems, 1997.

[4] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. Neural Information Processing Systems, 2011.

[5] S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133 181, 1922.

[6] M. Barré, A. Taylor, and A. d Aspremont. Convergence of a constrained vector extrapolation scheme. SIAM Journal on Mathematics of Data Science, 4(3):979 1002, 2022.

[7] M. G. Bellemare, G. Ostrovski, A. Guez, P. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. Association for the Advancement of Artificial Intelligence, 2016.

[8] R. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, 6(5):679 684, 1957.

[9] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 2015.

[10] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. 4th edition, 2012.

[11] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1995.

[12] W. Bowen, X. Huaqing, Z. Lin, L. Yingbin, and Z. Wei. Finite-time theory for momentum Q-learning. Conference on Uncertainty in Artificial Intelligence, 2021.

[13] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 184(1 2):71 120, 2020.

[14] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points II: first-order methods. Mathematical Programming, 185(1 2):315 355, 2021.

[15] A.-L. Cauchy. Méthode générale pour la résolution des systemes d équations simultanées. Comptes rendus de l Académie des Sciences, 25:536 538, 1847.

[16] V. Colao and G. Marino. On the rate of convergence of Halpern iterations. Journal of Nonlinear and Convex Analysis, 22(12):2639 2646, 2021.

[17] J. P. Contreras and R. Cominetti. Optimal error bounds for non-expansive fixed-point iterations in normed spaces. Mathematical Programming, 199(1 2):343 374, 2022.

[18] P. Dai, D. S. Weld, J. Goldsmith, et al. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181 209, 2011.

[19] D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105:589 608, 2000.

[20] J. Diakonikolas. Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. Conference on Learning Theory, 2020.

[21] J. Diakonikolas and P. Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668 1697, 2022.

[22] Q. Dong, H. Yuan, Y. Cho, and T. M. Rassias. Modified inertial Mann algorithm and inertial CQ-algorithm for nonexpansive mappings. Optimization Letters, 12(1):87 102, 2018.

[23] Y. Drori. The exact information-based complexity of smooth convex minimization. Journal of Complexity, 39:1 16, 2017.

[24] Y. Drori and O. Shamir. The complexity of finding stationary points with stochastic gradient descent. International Conference on Machine Learning, 2020.

[25] Y. Drori and A. Taylor. On the oracle complexity of smooth strongly convex minimization. Journal of Complexity, 68, 2022.

[26] Y. Drori and M. Teboulle. An optimal variant of Kelley s cutting-plane method. Mathematical Programming, 160(1 2):321 351, 2016.

[27] M. Ermis, M. Park, and I. Yang. On Anderson acceleration for partially observable Markov decision processes. IEEE Conference on Decision and Control, 2021.

[28] M. Ermis and I. Yang. A3DQN: Adaptive Anderson acceleration for deep Q-networks. IEEE Symposium Series on Computational Intelligence, 2020.

[29] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503 556, 2005.

[30] D. Ernst, M. Glavic, P. Geurts, and L. Wehenkel. Approximate value iteration in the reinforcement learning context. Application to electrical power system control. International Journal of Emerging Electric Power Systems, 3(1), 2005.

[31] A.-m. Farahmand and M. Ghavamzadeh. PID accelerated value iteration algorithm. International Conference on Machine Learning, 2021.

[32] A.-m. Farahmand, C. Szepesvári, and R. Munos. Error propagation for approximate policy and value iteration. Neural Information Processing Systems, 2010.

[33] M. Fellows, K. Hartikainen, and S. Whiteson. Bayesian Bellman operators. Neural Information Processing Systems, 2021.

[34] M. Geist and B. Scherrer. Anderson acceleration for reinforcement learning. European Workshop on Reinforcement Learning, 2018.

[35] N. Golowich, S. Pattathil, C. Daskalakis, and A. Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. Conference on Learning Theory, 2020.

[36] G. J. Gordon. Stable function approximation in dynamic programming. International Conference on Machine Learning, 1995.

[37] V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration. Operations Research, 71(2):517 535, 2022.

[38] J. Grand-Clément. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. ar Xiv:2104.10677, 2021.

[39] B. Halpern. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957 961, 1967.

[40] R. Hannah, Y. Liu, D. O Connor, and W. Yin. Breaking the span assumption yields fast finite-sum minimization. Neural Information Processing Systems, 2018.

[41] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. Association for the Advancement of Artificial Intelligence, 2018.

[42] F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optimization Methods and Software, 34(2):383 405, 2019.

[43] D. Kim. Accelerated proximal point method for maximally monotone operators. Mathematical Programming, 190(1 2):57 87, 2021.

[44] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, 159(1 2):81 107, 2016.

[45] D. Kim and J. A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192 219, 2021.

[46] J. Lee, C. Park, and E. K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. Neural Information Processing Systems, 2021.

[47] S. Lee and D. Kim. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Neural Information Processing Systems, 2021.

[48] L. Leustean. Rates of asymptotic regularity for Halpern iterations of nonexpansive mappings. Journal of Universal Computer Science, 13(11):1680 1691, 2007.

[49] F. Lieder. On the convergence rate of the Halpern-iteration. Optimization Letters, 15(2):405 418, 2021.

[50] M. Lutter, S. Mannor, J. Peters, D. Fox, and A. Garg. Value iteration in continuous actions, states and time. International Conference on Machine Learning, 2021.

[51] P.-E. Maingé. Convergence theorems for inertial KM-type algorithms. Journal of Computational and Applied Mathematics, 219(1):223 236, 2008.

[52] A. massoud Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. American Control Conference, 2009.

[53] H. B. Mc Mahan and G. J. Gordon. Fast exact planning in Markov decision processes. International Conference on Automated Planning and Scheduling, 2005.

[54] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

[55] A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103 130, 1993.

[56] R. Munos. Error bounds for approximate value iteration. Association for the Advancement of Artificial Intelligence, 2005.

[57] R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815 857, 2008.

[58] A. S. Nemirovski. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153 175, 1992.

[59] Y. Nesterov. Lectures on Convex Optimization. Springer, 2nd edition, 2018.

[60] Y. Nesterov, A. Gasnikov, S. Guminov, and P. Dvurechensky. Primal dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773 810, 2021.

[61] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2). Doklady Akademii Nauk SSSR, 269(3):543 547, 1983.

[62] S. Niu, S. Chen, H. Guo, C. Targonski, M. Smith, and J. Kovaˇcevi c. Generalized value iteration networks: Life beyond lattices. Association for the Advancement of Artificial Intelligence, 2018.

[63] C. Nota and P. Thomas. Is the policy gradient a gradient? International Conference on Autonomous Agents and Multiagent Systems, 2020.

[64] C. Park, J. Park, and E. K. Ryu. Factor-

2 acceleration of accelerated gradient methods. Applied Mathematics & Optimization, 2023.

[65] J. Park and E. K. Ryu. Exact optimal accelerated complexity for fixed-point iterations. International Conference on Machine Learning, 2022.

[66] J. Park and E. K. Ryu. Accelerated infeasibility detection of constrained optimization and fixed-point iterations. International Conference on Machine Learning, 2023.

[67] M. Park, J. Shin, and I. Yang. Anderson acceleration for partially observable Markov decision processes: A maximum entropy approach. ar Xiv:2211.14998, 2022.

[68] J. Peng and R. J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4):437 454, 1993.

[69] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994.

[70] S. Reich. Strong convergence theorems for resolvents of accretive operators in Banach spaces. Journal of Mathematical Analysis and Applications, 75(1):287 292, 1980.

[71] E. K. Ryu, K. Yuan, and W. Yin. Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems. ar Xiv:1905.10899, 2019.

[72] S. Sabach and S. Shtern. A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640 660, 2017.

[73] A. Salim, L. Condat, D. Kovalev, and P. Richtárik. An optimal algorithm for strongly convex minimization under affine constraints. International Conference on Artificial Intelligence and Statistics, 2022.

[74] D. Scieur, A. d Aspremont, and F. Bach. Regularized nonlinear acceleration. Mathematical Programming, 179(1 2):47 83, 2020.

[75] Y. Shehu. Convergence rate analysis of inertial Krasnoselskii Mann type iteration with applications. Numerical Functional Analysis and Optimization, 39(10):1077 1091, 2018.

[76] W. Shi, S. Song, H. Wu, Y.-C. Hsu, C. Wu, and G. Huang. Regularized Anderson acceleration for off-policy deep reinforcement learning. Neural Information Processing Systems, 2019.

[77] O. Shlakhter, C.-G. Lee, D. Khmelev, and N. Jaber. Acceleration operators in the value iteration algorithms for Markov decision processes. Operations Research, 58(1):193 202, 2010.

[78] J. J. Suh, J. Park, and E. K. Ryu. Continuous-time analysis of anchor acceleration. Neural Information Processing Systems, 2023.

[79] K. Sun, Y. Wang, Y. Liu, B. Pan, S. Jui, B. Jiang, L. Kong, et al. Damped Anderson mixing for deep reinforcement learning: Acceleration, convergence, and stabilization. Neural Information Processing Systems, 2021.

[80] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9 44, 1988.

[81] R. S. Sutton and A. G. Barto. Reinforcement Learning: An introduction. MIT press, 2nd edition, 2018.

[82] R. S. Sutton, D. Mc Allester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems, 1999.

[83] Q. Sykora, M. Ren, and R. Urtasun. Multi-agent routing value iteration network. International Conference on Machine Learning, 2020.

[84] C. Szepesvári. Algorithms for Reinforcement Learning. Springer, 1st edition, 2010.

[85] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. Neural Information Processing Systems, 2016.

[86] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, 199(1-2):557 594, 2023.

[87] S. Tosatto, M. Pirotta, C. d Eramo, and M. Restelli. Boosted fitted Q-iteration. International Conference on Machine Learning, 2017.

[88] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185 202, 1994.

[89] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. Association for the Advancement of Artificial Intelligence, 2016.

[90] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234 244, 2006.

[91] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49 54, 2018.

[92] N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020.

[93] H. F. Walker and P. Ni. Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4):1715 1735, 2011.

[94] C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279 292, 1992.

[95] D. Wingate, K. D. Seppi, and S. Mahadevan. Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research, 6(25):851 881, 2005.

[96] R. Wittmann. Approximation of fixed points of nonexpansive mappings. Archiv der Mathematik, 58(5):486 491, 1992.

[97] H.-K. Xu. Iterative algorithms for nonlinear operators. Journal of the London Mathematical Society, 66(1):240 256, 2002.

[98] T. Yoon and E. K. Ryu. Accelerated algorithms for smooth convex-concave minimax problems with O(1/k2) rate on squared gradient norm. International Conference on Machine Learning, 2021.

[99] T. Yoon and E. K. Ryu. Accelerated minimax algorithms flock together. ar Xiv:2205.11093, 2022.

[100] Y. Zeng, F. Feng, and W. Yin. Async QVI: Asynchronous-parallel Q-value iteration for discounted Markov decision processes with near-optimal sample complexity. International Conference on Artificial Intelligence and Statistics, 2020.

[101] J. Zhang, B. O Donoghue, and S. Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170 3197, 2020.

[102] K. Zhou, L. Tian, A. M.-C. So, and J. Cheng. Practical schemes for finding near-stationary points of convex finite-sums. International Conference on Artificial Intelligence and Statistics, 2022.

A Preliminaries

For notational unity, we use the symbol U when both V and Q can be used.

Lemma 1. [10, Lemma 1.1.1] Let 0 < γ 1. If U U, then T πU T π U, T U T U. Lemma 2. Let 0 < γ 1. For any policy π, Pπ is a nonexpansive linear operator such that if U U, PπU Pπ U.

Proof. If r(s, a) = 0 for all s S and a A, T π = γPπ. Then by Lemma 1 and γ-contraction of T π, we have the desired result.

Lemma 3. Let 0 < γ < 1. Let T and ˆT respectively be the Bellman optimality and anti-optimality operators. Let U and ˆU respectively be the fixed points of T and ˆT . Then ˆU U .

Proof. By definition, ˆU = ˆT ˆU T ˆU . Thus, ˆU limm (T )m ˆU = U .

B Omitted proofs in Section 2

First, we prove the following lemma by induction.

Lemma 4. Let 0 < γ 1, and if γ = 1, assume a fixed point U π exists. For the iterates {U k}k=0,1,... of Anc-VI,

T πU k U k =

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i

βk(U 0 U π) + Πk j=1(1 βj) (γPπ)k+1 (U 0 U π)

where Πk j=k+1(1 βj) = 1 and β0 = 1.

Proof. If k = 0, we have

T πU 0 U 0 = T πU 0 U π (U 0 U π)

= T πU 0 T πU π (U 0 U π)

= γPπ U 0 U π (U 0 U π)

If k = m, since T π is a linear operator,

T πU m U m = T πU m (1 βm)T πU m 1 βm U 0

= T πU m (1 βm)T πU m 1 βm U π βm(U 0 U π)

= T πU m (1 βm)T πU m 1 βm T πU π βm(U 0 U π)

= γPπ(U m (1 βm)U m 1 βm U π) βm(U 0 U π)

= γPπ(βm(U 0 U π) + (1 βm)(T πU m 1 U m 1)) βm(U 0 U π)

= (1 βm)γPπ m 1 X

h (βi βi 1(1 βi)) Πm 1 j=i+1(1 βj) (γPπ)m 1 i+1 (U 0 U π) i

(1 βm)γPπβm 1(U 0 U π) + (1 βm)γPπ Πm 1 j=1 (1 βj) (γPπ)m (U 0 U π)

+ βmγPπ(U 0 U π) βm(U 0 U π)

h (βi βi 1(1 βi)) Πm j=i+1(1 βj) (γPπ)m i+1 (U 0 U π) i

βm 1(1 βm)γPπ(U 0 U π) + βmγPπ(U 0 U π)

βm(U 0 U π) + Πm j=1(1 βj) (γPπ)m+1 (U 0 U π)

h (βi βi 1(1 βi)) Πm j=i+1(1 βj) (γPπ)m i+1 (U 0 U π) i

βm(U 0 U π) + Πm j=1(1 βj) (γPπ)m+1 (U 0 U π)

Now, we prove the first rate of Theorem 1.

Proof of first rate in Theorem 1. Taking -norm both sides of equality in Lemma 4, we get

i=1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π)

+ βk U 0 U π + Πk i=1(1 βi) (γPπ)k+1 (U 0 U π)

i=1 γk i+1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) + βk + γk+1Πk j=1(1 βj)

i=1 γk+i 1 1 γ2 2

1 γ(2k+2) + γ2k 1 γ2

1 γ2k+2 + γk+1 1 γ2

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 U 0 U π ,

where the first inequality comes from triangular inequality, second inequality is from Lemma 2, and equality come from calculations.

For the second rate of Theorem 1, we introduce following lemma. Lemma 5. Let 0 < γ < 1. Let T be Bellman consistency or optimality operator. For the iterates {U k}k=0,1,... of Anc-VI, if U 0 TU 0, then Uk 1 Uk TUk 1 TUk U for 1 k. Also, if U 0 TU 0, then Uk 1 Uk TUk 1 TUk U for 1 k.

Proof. First, let U 0 TU 0. If k = 1, U 0 β1U 0 + (1 β1)TU 0 = U1 TU 0 by assumption. Since U 0 U 1, TU 0 TU 1 by monotonicity of Bellman consistency and optimality operators.

By induction, U k = βk U 0 + (1 βk)TU k 1 TU k 1, and since βk βk 1,

βk U 0 + (1 βk)TU k 1 βk 1U 0 + (1 βk 1)TU k 1

βk 1U 0 + (1 βk 1)TU k 2

Also, U k 1 U k implies TU k 1 TU k by monotonicity of Bellman consistency and optimality operators, and U k TU k implies that U k limm (T)m U k = U for all k = 0, 1, . . . .

Now, suppose U 0 TU 0. If k = 1, U 0 β1U 0 + (1 β1)TU 0 = U1 TU 0 by assumption. Since U 0 U 1, TU 0 TU 1 by monotonicity of Bellman consistency and optimality operators.

By induction, U k = βk U 0 + (1 βk)TU k 1 TU k 1, and since βk βk 1,

βk U 0 + (1 βk)TU k 1 βk 1U 0 + (1 βk 1)TU k 1

βk 1U 0 + (1 βk 1)TU k 2

Also, U k 1 U k implies TU k 1 TU k by monotonicity of Bellman consistency and optimality operators, and Uk TUk implies that U k limm (T)m U k = U for all k = 0, 1, . . . .

Now, we prove following key lemmas.

Lemma 6. Let 0 < γ 1, and assume a fixed point U π exists if γ = 1. For the iterates {U k}k=0,1,... of Anc-VI, if U 0 U π,

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i

βk(U 0 U π),

where Πk j=k+1(1 βj) = 1 and β0 = 1.

Lemma 7. Let 0 < γ < 1. For the iterates {U k}k=0,1,... of Anc-VI, if U 0 T πU 0,

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i

βk(U 0 U π),

where Πk j=k+1(1 βj) = 1 and β0 = 1.

Proof of Lemma 6. If U 0 U π, we get

T πU k U k =

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i

βk(U 0 U π) + Πk j=1(1 βj) (γPπ)k+1 (U 0 U π)

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i βk(U 0 U π),

by Lemma 4 and the fact that Πk j=1(1 βj) (γPπ)k+1 (U 0 U π) 0.

Proof of Lemma 7. If U 0 TU 0, U 0 U π 0 by Lemma 5. Hence, by Lemma 4, we have

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i βk(U 0 U π),

since 0 Πk j=1(1 βj) (γPπ)k+1 (U 0 U π).

Now, we prove the second rates of Theorem 1.

Proof of second rates in Theorem 1. Let 0 < γ < 1. By Lemma 5, if U 0 T πU 0, then U 0 U π. Hence,

0 T πU k U k

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i βk(U 0 U π),

by Lemma 6. Taking -norm both sides, we have

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1 U 0 U π .

Otherwise, if U 0 TU 0, U k TU k by Lemma 5. Since

0 T πU k U k

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) (γPπ)k i+1 (U 0 U π) i βk(U 0 U π),

by Lemma 7, taking -norm both sides, we obtain same rate as before.

Lastly, Taylor series expansion for both rates at γ = 1 is

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 = 2 k + 1 k 1

k + 1(γ 1) + O((γ 1)2),

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1 = 1 k + 1 k k + 1(γ 1) + O((γ 1)2).

For the analyses of Anc-VI for Bellman optimality operator, we first prove following two lemmas.

Lemma 8. Let 0 < γ 1. If γ = 1, assume a fixed point U exists. Then, if 0 α 1 and U (1 α) U αU U, there exist nonexpansive linear operator PH such that

T U (1 α)T U αT U γPH U.

Lemma 9. Let 0 < γ < 1. If 0 α 1 and U U (1 α) U α ˆU , then there exist nonexpansive linear operator ˆPH such that

γ ˆPH( U) T U αT U (1 α) ˆT ˆU .

Proof of Lemma 8. First, let U = V, U = V , U = V , U = V , and V (1 α) V αV V .

If action space is finite,

T V (1 α)T V αT V T πV (1 α)T π V αT πV

= γPπ V (1 α) V αV

where π is the greedy policy satisfying T πV = T V , first inequality is from T π V T V and T πV T V , and second inequality comes from Lemma 1. Thus, we can conclude PH = Pπ.

Otherwise, if action space is infinite, define P(c V ) = c sups S V (s) for c R and previously given V . Let M be linear space spanned by V with -norm. Then, P is linear functional on

M and P op 1 since |c sups S V (s)|

c V 1. Due to Hahn Banach extension Theorem, there exist

linear functional Ph : F(S) R with Ph( V ) = sups S V (s) and Ph op 1. Furthermore, we can define PH : F(S) F(S) such that PHV (s) = Ph(V ) for all s S. Then, since PH(V ) = |Ph(V )| Ph op 1 for V 1, we have PH 1. Therefore, PH is

nonexpansive linear operator in -norm. Then,

T V (s) (1 α)T V (s) αT V (s)

r(s, a) + γEs P ( | s,a) [V (s )] sup a A

(1 α)r(s, a) + (1 α)γEs P ( | s,a) h V (s ) i

αr(s, a) + αγEs P ( | s,a) [V (s )]

r(s, a) + γEs P ( | s,a) [V (s )] (1 α)r(s, a) (1 α)γEs P ( | s,a) h V (s ) i

αr(s, a) + αγEs P ( | s,a) [V (s )]

Es P ( | s,a) h V (s ) (1 α) V (s ) αV (s ) i

γ sup s S {V (s ) (1 α) V (s ) αV (s )}

γ sup s S V (s ).

for all s S. Therefore, we have T V (1 α)T V αT V γPH( V ).

Similarly, let U = Q, U = Q, U = Q , U = Q, and Q (1 α) Q αQ Q.

If action space is finite,

T Q (1 α)T Q αT Q γPπ Q (1 α) Q αQ

where π is the greedy policy satisfying T πQ = T Q, first inequality is from T π Q T Q and T πQ T Q , and second inequality comes from Lemma 1. Then, we can conclude PH = Pπ.

Otherwise, if action space is infinite, define P(c Q) = c sup(s ,a ) S A Q(s , a ) for c R and previously given Q. Let M be linear space spanned by Q with -norm. Then, P is linear functional on M and P op 1. Due to Hahn Banach extension Theorem, there exist linear functional Ph : F(S A) R with Ph( Q) = sup(s ,a ) S A Q(s , a ) and Ph op 1. Furthermore, we can define PH : F(S A) F(S A) such that PHQ(s, a) = Ph(Q) for all (s, a) S A and PH 1. Therefore, PH is nonexpansive linear operator in -norm. Then,

T Q(s, a) (1 α)T Q(s, a) αT Q (s, a)

= r(s, a) + γEs P ( | s,a)

sup a A Q(s , a ) (1 α)r(s, a) (1 α)γEs P ( | s,a)

sup a A Q(s , a )

αr(s, a) αγEs P ( | s,a)

sup a A Q (s , a )

γEs P ( | s,a)

n Q(s , a ) (1 α) Q(s , a ) o γEs P ( | s,a)

sup a A αQ(s , a )

γEs P ( | s,a)

n Q(s , a ) (1 α) Q(s , a ) αQ (s , a ) o

γ sup (s ,a ) S A

n Q(s , a ) (1 α) Q(s , a ) αQ (s , a ) o ,

γ sup (s ,a ) S A Q(s , a )

for all (s, a) S A. Therefore, we have

T Q (1 α)T Q αT Q γPH( Q).

Proof of Lemma 9. Note that ˆT is Bellman anti-optimality operators for V or Q, and ˆU is the fixed point of ˆT . First, let U = V, U = V , ˆU = ˆV , U = V , and V V (1 α) V α ˆV . Then,

T V (s) (1 α)T V (s) α ˆT ˆV (s)

r(s, a) + γEs P ( | s,a) [V (s )] sup a A

(1 α)r(s, a) + (1 α)γEs P ( | s,a) h V (s ) i

αr(s, a) + αγEs P ( | s,a) h ˆV (s ) i

r(s, a) + γEs P ( | s,a) [V (s )] (1 α)r(s, a) (1 α)γEs P ( | s,a) h V (s ) i

αr(s, a) + αγEs P ( | s,a) h ˆV (s ) i

Es P ( | s,a) h V (s ) (1 α) V (s ) α ˆV (s ) i .

Then, if action space is finite,

T V (1 α)T V αT V γP ˆπ V (1 α) V α ˆV

where ˆπ is the policy satisfying ˆπ( | s) = argmina A Es P ( | s,a) h V (s ) (1 α) V (s ) α ˆV (s ) i

and second inequality comes from Lemma 1. Thus, we can conclude PH = Pπ.

Otherwise, if action space is infinite, define ˆP(c V ) = c infs S V (s) for c R and previously given V . Let M be linear space spanned by V with -norm. Then, ˆP is linear functional on M and ˆP op 1 since |c infs S V (s)|

c V 1. Due to Hahn Banach extension Theorem, there exist linear

functional ˆPh : F(S) R with ˆPh( V ) = infs S V (s) and ˆPh op 1. Furthermore, we can define ˆPH : F(S) F(S) such that ˆPHV (s) = ˆPh(V ) for all s S. Then ˆPH 1 since ˆPH(V ) = | ˆPh(V )| ˆPh op 1 for V 1. . Thus, ˆPH is nonexpansive linear operator in -norm. Then, we have

T V (s) (1 α)T V (s) α ˆT ˆV (s) γ inf a A

Es P ( | s,a) h V (s ) (1 α) V (s ) α ˆV (s ) i

γ inf s S{V (s ) (1 α) V (s ) α ˆV (s )}

γ inf s S{ V (s )}

for all s S. Therefore, we have

γ ˆPH( V ) T V (s) (1 α)T V (s) α ˆT ˆV (s).

Similarly, let U = Q, U = Q, ˆU = ˆQ , U = Q, and Q Q (1 α) Q α ˆQ . Then,

T Q(s, a) αT Q(s, a) (1 α) ˆT ˆQ (s, a)

= r(s, a) + γEs P ( | s,a)

sup a A Q(s , a ) (1 α)r(s, a) (1 α)γEs P ( | s,a)

sup a A Q(s , a )

αr(s, a) αγEs P ( | s,a)

inf a A ˆQ (s , a )

γEs P ( | s,a)

n Q(s , a ) (1 α) Q(s , a ) o γEs P ( | s,a)

inf a A α ˆQ(s , a )

γEs P ( | s,a)

n Q(s , a ) (1 α) Q(s , a ) α ˆQ (s , a ) o .

Hence, if action space is finite,

T Q (1 α)T Q αT Q γP ˆπ Q (1 α) Q αQ ,

where ˆπ is the policy satisfying ˆπ( | s) = argmina A Es P ( | s,a) h Q(s ) (1 α) Q(s ) αQ (s ) i

and second inequality comes from Lemma 1. Then, we can conclude PH = P ˆπ.

Otherwise, if action space is infinite, define ˆP(c Q) = c inf(s ,a ) S A Q(s , a ) for c Rn and previously given Q. Let M be linear space spanned by Q with -norm. Then, P is linear functional on M with ˆP op 1. Due to Hahn Banach extension Theorem, there exist linear functional ˆPh : F(S A) R with ˆPh( Q) = inf(s ,a ) S A Q(s , a ) and ˆPh op 1. Furthermore, we can define ˆPH : F(S A) F(S A) such that PHQ(s, a) = ˆPh(Q) for all (s, a) S A and ˆPH 1. Thus ˆPH is nonexpansive linear operator in -norm. Then, we have

T Q(s, a) αT Q(s, a) (1 α) ˆT ˆQ (s, a)

γEs P ( | s,a)

n Q(s , a ) (1 α) Q(s , a ) α ˆQ (s , a ) o

γ inf (s ,a ) S A

n Q(s , a ) (1 α) Q(s , a ) α ˆQ (s , a ) o

γ inf (s ,a ) S A Q(s , a ),

for all (s, a) S A. Therefore, we have

γ ˆPH( Q) T Q (1 α)T Q α ˆT ˆQ .

Now, we present our key lemmas for the first rate of Theorem 2. Lemma 10. Let 0 < γ 1. If γ = 1, assume a fixed point U exists. For the iterates {U k}k=0,1,... of Anc-VI, there exist nonexpansive linear operators {Pl}l=0,1,...,k such that

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγPl (U 0 U )

βk(U 0 U ) + Πk j=1(1 βj) Π0 l=kγPl (U 0 U )

where Πk j=k+1(1 βj) = 1 and β0 = 1.

Lemma 11. Let 0 < γ < 1. For the iterates {U k}k=0,1,... of Anc-VI, there exist nonexpansive linear operators { ˆPl}l=0,1,...,k such that

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i

βk(U 0 ˆU ) + Πk j=1(1 βj) Π0 l=kγ ˆPl (U 0 ˆU ),

where Πk j=k+1(1 βj) = 1 and β0 = 1.

We prove previous lemmas by induction.

Proof of Lemma 10. If k = 0,

T U 0 U 0 = T U 0 U (U 0 U )

= T U 0 T U (U 0 U )

γP0(U 0 U ) (U 0 U ).

where inequality comes from first inequality in Lemma 8 with α = 1, U = U 0, U = U 0 U .

By induction,

U k (1 βk)U k 1 βk U

= βk U 0 U + (1 βk)(T U k 1 U k 1)

(βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γPl (U 0 U )

(1 βk)βk 1(U 0 U ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γPl (U 0 U )

+ βk(U 0 U ),

and let U be the entire right hand side of inequality. Then, we have

= T U k (1 βk)T U k 1 βk U 0

= T U k (1 βk)T U k 1 βk U βk(U 0 U )

= T U k (1 βk)T U k 1 βk T U βk(U 0 U )

(βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γPl (U 0 U )

(1 βk)βk 1(U 0 U ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γPl (U 0 U )

+ βk(U 0 U ) βk(U 0 U )

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγPl (U 0 U )

βk 1(1 βk)γPk(U 0 U ) + βkγPk U 0 U

βk(U 0 U ) + Πk j=1(1 βj) Π0 l=kγPl (U 0 U )

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγPl (U 0 U )

βk(U 0 U ) + Πk j=1(1 βj) Π0 l=kγPl (U 0 U ).

where inequality comes from first inequality in Lemma 8 with α = βk, U = U k, U = U k 1, and previously defined U.

Proof of Lemma 11. Note that ˆT is Bellman anti-optimality operators for V or Q, and ˆU is the fixed point of ˆT . If k = 0,

T U 0 U 0 = T U 0 ˆU (U 0 ˆU )

= T U 0 ˆT ˆU (U 0 ˆU )

γ ˆP0(U 0 ˆU ) (U 0 ˆU ).

where inequality comes from second inequality in Lemma 9 with α = 1, U = U 0, U = U 0 ˆU .

By induction,

U k (1 βk)U k 1 βk ˆU

= βk(U 0 ˆU ) + (1 βk)(T U k 1 U k 1)

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γ ˆPl (U 0 ˆU ) i

(1 βk)βk 1(U 0 ˆU ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γ ˆPl (U 0 ˆU )

+ βk(U 0 ˆU ),

and let U be the entire right hand side of inequality. Then, we have

= T U k (1 βk)T U k 1 βk U 0

= T U k (1 βk)T U k 1 βk ˆU βk(U 0 ˆU )

= T U k (1 βk)T U k 1 βk ˆT ˆU βk(U 0 ˆU )

γ ˆPk (1 βk)

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γ ˆPl (U 0 ˆU ) i

(1 βk)βk 1(U 0 ˆU ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γ ˆPl (U 0 ˆU )

+ βk(U 0 ˆU ) βk(U 0 ˆU )

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i

βk 1(1 βk)γ ˆPk(U 0 ˆU ) + βkγ ˆPk U 0 ˆU

βk(U 0 ˆU ) + Πk j=1(1 βj) Π0 l=kγ ˆPl (U 0 ˆU )

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i

βk(U 0 ˆU ) + Πk j=1(1 βj) Π0 l=kγ ˆPl (U 0 ˆU ).

where inequality comes from second inequality in Lemma 9 with α = βk, U = U k, U = U k 1, and previously defined U.

Now, we prove the first rate of Theorem 2.

Proof of first rate in Theorem 2. Since B1 A B2 implies A sup{ B1 , B2 } for A, B F(X), if we take right side first inequality of Lemma 10, we have

i=1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) Πi l=kγPl (U 0 U )

+ βk U 0 U π + Πk j=1(1 βj) Π0 l=kγPl (U 0 U )

i=1 γk i+1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) + βk + γk+1Πk j=1(1 βj)

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 U 0 U ,

where the first inequality comes from triangular inequality, second inequality is from nonexpansiveness of Pl, and last equality comes from calculations.

If we take right side of second inequality of Lemma 10, similarly, we have

i=1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU )

+ βk U 0 U π + Πk j=1(1 βj) Π0 l=kγ ˆPl (U 0 ˆU )

i=1 γk i+1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) + βk + γk+1Πk j=1(1 βj)

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1

where the first inequality comes from triangular inequality, second inequality is from from nonexpansiveness of ˆPl, and last equality comes from calculations. Therefore, we conclude

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 max n U 0 U , U 0 ˆU

Next, for the second rate in Theorem 2, we prove following lemmas by induction. Lemma 12. Let 0 < γ 1. If γ = 1, assume a fixed point U exists. For the iterates {U k}k=0,1,... of Anc-VI, if T U 0 U , there exist nonexpansive linear operators {Pl}l=0,1,...,k such that

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγPl (U 0 U ) βk(U 0 U )

where Πk j=k+1(1 βj) = 1 and β0 = 1.

Lemma 13. Let 0 < γ < 1. For the iterates {U k}k=0,1,... of Anc-VI, if U 0 T U 0, there exist nonexpansive linear operators { ˆPl}l=0,1,...,k such that

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i βk(U 0 ˆU ),

where Πk j=k+1(1 βj) = 1 and β0 = 1.

Proof of Lemma 12. If k = 0,

T U 0 U 0 = T U 0 U (U 0 U )

where the second inequality is from the condition.

By induction,

U k (1 βk)U k 1 βk U

(βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γPl (U 0 U )

(1 βk)βk 1(U 0 U ) + βk(U 0 U ),

and let U be the entire right hand side of inequality. Then, we have

= T U k (1 βk)T U k 1 βk T U βk(U 0 U )

(βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γPl (U 0 U )

(1 βk)βk 1(U 0 U ) + βk(U 0 U ) βk(U 0 U )

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγPl (U 0 U ) βk(U 0 U ),

where inequality comes from first inequality in Lemma 8 with α = βk, U = U k, U = U k 1, and previously defined U.

Proof of Lemma 13. If k = 0,

T U 0 U 0 = T U 0 ˆU (U 0 ˆU )

where the second inequality is from the fact that U 0 T U 0 implies T U 0 U by Lemma 5 and U ˆU by Lemma 3.

By induction,

U k (1 βk)U k 1 βk ˆU

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γ ˆPl (U 0 ˆU ) i

(1 βk)βk 1(U 0 ˆU ) + βk(U 0 ˆU ),

and let U be the entire right hand side of inequality. Then, we have

= T U k (1 βk)T U k 1 βk ˆT ˆU βk(U 0 ˆU )

γ ˆPk (1 βk)

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γ ˆPl (U 0 ˆU ) i

(1 βk)βk 1(U 0 ˆU ) + βk(U 0 ˆU ) βk(U 0 ˆU )

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i βk(U 0 ˆU ),

where inequality comes from second inequality in Lemma 9 with α = βk, U = U k, U = U k 1, and previously defined U.

Now, we prove the second rates of Theorem 2.

Proof of second rates in Theorem 2. Let 0 < γ < 1. Then, if U 0 T U 0, then T U 0 U and U k T U k by Lemma 5. Hence, taking -norm both sides of first inequality in Lemma 12, we have T U k U k

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1 U 0 U .

Otherwise, if U 0 TU 0, U k TU k by Lemma 5. taking -norm both sides of second inequality in Lemma 13, we have

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1

C Omitted proofs in Section 3

First, we present the following lemma. Lemma 14. Let γ = 1. Assume a fixed point U exists. For the iterates {U k}k=0,1,... of Anc-VI, U k U U 0 U .

Proof. If k = 0, it is obvious. By induction, U k U = βk U 0 + (1 βk)TU k 1 U = (1 βk)(TU k 1 U ) + βk U 0 U (1 βk) TU k 1 U + βk U 0 U (1 βk) U k 1 U + βk U 0 U = U 0 U where the second inequality comes form nonexpansiveness of T.

Now, we present the proof of Theorem 3.

Proof of Theorem 3. First, if U 0 TU 0, with same argument in proof of Lemma 5, we can show that U k 1 U k TU k 1 TU k for k = 1, 2, . . . .

Since fixed point U exists by assumption, Lemma 4 and 10 hold. Note that γ = 1 implies βk = 1 k+1 and if we take -norm both sides for those inequalities in lemmas, by simple calculation, we have

TU k U k 2 k + 1

for any fixed point U (since 0 T U k U k, we can get upper bound of T U k U k from Lemma 10).

Suppose that there exist {kj}j=0,1,... such that U kj converges to some U . Then, limj (T I)U kj = (T I) U = 0 since T I is continuous. This implies that U is a fixed point. By Lemma 14 and previous argument, U k is increasing and bounded sequence in Rn. Thus, U k has single limit point, some fixed point U . Furthermore, the fact that U 0 TU 0 U implies that Lemma 6 and 12 hold. Therefore, we have TU k U k 1 k + 1

Next, we prove the Theorem 4.

Proof of Theorem 4. By same argument in the proof of Theorem 3, if U 0 TU 0, we can show that U k 1 U k TU k 1 TU k for k = 1, 2, . . . ., and TU k U k 2 k + 1

for any fixed point U . Since U k is increasing and bounded by Lemma 14 and previous argument, U k converges pointwise to some U in general action-state space. We now show that TU k also converges pointwise to T U . First, let T be Bellman consistency operator and U = V, U = V π. By monontone convergence theorem,

lim k T πV k(s) = lim k Ea π( | s) Es P ( | s,a) r(s, a) + γV k(s )

= Ea π( | s)

lim k Es P ( | s,a) r(s, a) + γV k(s )

= Ea π( | s)

Es P ( | s,a)

r(s, a) + γ lim k V k(s )

= T π V π(s)

for any fixed s S. With same argument, case U = Q also holds. If T is Bellman optimality operator, we use following lemma.

Lemma 15. Let W, W k F(X) for k = 0, 1, . . . . If W k(x) W k+1(x) for all x X, and {W k}k=0,1,..., converge pointwise to W, then limk {supx W k(x)} = supx W(x).

Proof. W k(x) W(x) implies that supx W k(x) supx W(x). If supx W(x) = a, there exist x which satisfying a W(x) < ϵ

2, and by definition of W, there exist W k such that a W k(x) < ϵ for any ϵ > 0.

If U = V and U = V , by previous lemma and monotone convergence theorem, we have

lim k T V k(s) = lim k sup a

Es P ( | s,a) r(s, a) + γV k(s )

lim k Es P ( | s,a) r(s, a) + γV k(s )

Es P ( | s,a)

r(s, a) + γ lim k V k(s )

for any fixed s S. With similar argument, case U = Q also holds.

Since TU k T U and U k U pointwisely, TU k U k converges pointwise to T U U = 0. Thus, U is indeed fixed point of T. Furthermore, the fact that U 0 TU 0 U implies that Lemma 6 and 12 hold. Therefore, we have TU k U k 1 k + 1

D Omitted proofs in Section 4

We present the proof of Theorem 5.

Proof of Theorem 5. First, we prove the case U 0 = 0 for n k+2. Consider the MDP (S, A, P, r, γ) such that

S = {s1, . . . , sn}, A = {a1}, P(si | sj, a1) = 1{i=j=1, j=i+1}, r(si, a1) = 1{i=2}.

Then, T = γPπU + [0, 1, 0, . . . , 0] , U = [0, 1, γ, . . . , γn 2] , and U 0 U = 1. Under the span condition, we can show that U k

l = 0 for k + 2 l n by following lemma.

Lemma 16. Let T : Rn Rn be defined as before. Then, under span condition, U i

1 = 0 for 0 i k, and U i

j = 0 for 0 i k and i + 2 j n.

Proof. Case k = 0 is obvious. By induction, U l

1 = 0 for 0 l i 1. Then TU l

1 = 0 for 0 l i 1. This implies that TU l U l

1 = 0 for 0 l i 1. Hence U i

1 = 0. Again, by induction, U l

j = 0 for 0 l i 1, l+2 j n. Then TU l

j = 0 for 0 l i 1, l + 3 j n and this implies that TU l U l

j = 0 for 0 l i 1, l + 3 j n. Therefore, U i

j = 0 for i + 2 j n.

Then, we get

TU k U k = 0, 1 U k

3 , . . . , γ U k

k+1 , γ U k

k+1 , 0, . . . , 0 | {z } n k 2

and this implies TU k U k

2 + γ 1 TU k U k

3 + + γ k TU k U k

Taking the absolute value on both sides,

(1 + + γ k) max 1 i n {|TU k U k|i} 1.

Therefore, we conclude

TU k U k γk Pk i=0 γi U 0 U .

Now, we show that for any initial point U 0 Rn, there exists an MDP which exhibits same lower bound with the case U 0 = 0. Denote by MDP(0) and T0 the worst-case MDP and Bellman consistency or opitmality operator constructed for U 0 = 0. Define an MDP(U 0) (S, A, P, r, γ) for U 0 = 0 as

S = {s1, . . . , sn}, A = {a1}, P(si | sj, a1) = 1{i=j=1, j=i+1}, r(si, a1) = U 0 PπU 0

i + 1{i=2}.

Then, Bellman consistency or optimality operator T satisfies

TU = T0(U U 0) + U 0.

Let U be fixed point of T0. Then, if U = U +U 0, U is fixed point of T. Furthermore, if {U i}k i=0 satisfies span condition

U i U 0 + span{TU 0 U 0, TU 1 U 1, . . . , TU i 1 U i 1}, i = 1, . . . , k, U i = U i U 0 is a sequence satisfying U i U 0 |{z} =0 +span{T0 U 0 U 0, T0 U 1 U 1, . . . , T0 U i 1 U i 1}, i = 1, . . . , k,

which is the same span condition in Theorem 5 with respect to T0. This is because

TU i U i = T0(U i U 0) (U i U 0) = T U i U i

for i = 0, . . . , k. Thus, { U i}k i=0 is a sequence starting from 0 and satisfy the span condition for T0. This implies that TU k U k = T U k U k

γk Pk i=0 γi

= γk Pk i=0 γi U 0 U .

Hence, MDP(U 0) is indeed our desired worst-case instance. Lastly, the fact that U 0 U = U 0 U = (0, 1, γ, . . . , γn 2) implies U 0 U .

E Omitted proofs in Section 5

First, we prove following key lemma.

Lemma 17. Let 0 < γ < 1. For the iterates {U k}k=0,1,... of Anc-VI, there exist nonexpansive linear operators {Pl}l=0,1,...,k and { ˆPl}l=0,1,...,k such that

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγPl (U 0 U ) βk(U 0 U )

+ Πk j=1(1 βj)Π0 l=kγPl(U 0 U )

i=1 Πk j=i(1 βj)Πi+1 l=kγPl I γPi ϵi 1,

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i βk(U 0 ˆU )

+ Πk j=1(1 βj)Π0 l=kγ ˆPl(U 0 ˆU )

i=1 Πk j=i(1 βj)Πi+1 l=kγ ˆPl I γ ˆPi ϵi 1,

for 1 k, where Πk j=k+1(1 βj) = 1, Πk+1 l=k γPl = Πk+1 l=k γ ˆPl = I, and β0 = 1.

Proof of Lemma 17. First, we prove the first inequality in Lemma 17 by induction.

U 1 (1 β1)U 0 β1U = (1 β1)ϵ0 + β1(U 0 U ) + (1 β1)(T U 0 U 0)

(1 β1)ϵ0 + (1 β1)γP0(U 0 U ) + (2β1 1)(U 0 U ),

where inequality comes from Lemma 8 with α = 1, U = U 0, U = U 0 U , and let U be the entire right hand side of inequality. Then, we have

T U 1 U 1 = T U 1 (1 β1)T U 0 β1U β1(U 0 U ) (1 β1)ϵ0

γP1((1 β1)ϵ0 + (1 β1)γP0(U 0 U ) + (2β1 1)(U 0 U )) β1(U 0 U )

= (1 β1)γP1γP0(U 0 U ) + γP1(2β1 1)(U 0 U ) β1(U 0 U )

(I γP1)(1 β1)ϵ0.

where inequality comes from Lemma 8 with α = β1, U = U 1, U = U 0, and previously defined U.

By induction,

U k (1 βk)U k 1 βk U

= βk U 0 U + (1 βk)(T U k 1 U k 1) + (1 βk)ϵk 1

(βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γPl (U 0 U )

(1 βk)βk 1(U 0 U ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γPl (U 0 U )

+ βk(U 0 U ) (1 βk)

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γPl I γPi ϵi 1 + (1 βk)ϵk 1,

and let U be the entire right hand side of inequality. Then, we have

= T U k (1 βk)T U k 1 βk U 0 (1 βk)ϵk 1

= T U k (1 βk)T U k 1 βk T U βk(U 0 U ) (1 βk)ϵk 1

(βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γPl (U 0 U )

(1 βk)βk 1(U 0 U ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γPl (U 0 U )

+ βk(U 0 U ) (1 βk)

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γPl I γPi ϵi 1 + (1 βk)ϵk 1

βk(U 0 U ) (1 βk)ϵk 1

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγPl (U 0 U ) βk(U 0 U )

+ Πk j=1(1 βj)Π0 l=kγPl(U 0 U )

i=1 Πk j=i(1 βj)Πi+1 l=kγPl I γPi ϵi 1,

where inequality comes from Lemma 8 with α = βk, U = U k, U = U k 1, and previously defined U.

Now, we prove second inequality in Lemma 17 by induction.

U 1 (1 β1)U 0 β1 ˆU = (1 β1)ϵ0 + β1(U 0 ˆU ) + (1 β1)(T U 0 U 0)

(1 β1)ϵ0 + (1 β1)γ ˆP0(U 0 ˆU ) + (2β1 1)(U 0 ˆU ),

where inequality comes from Lemma 9 with α = 1, U = U 0, U = U 0 ˆU , and let U be the entire right hand side of inequality. Then, we have

T U 1 U 1 = T U 1 (1 β1)T U 0 β1 ˆU β1(U 0 ˆU ) (1 β1)ϵ0

γ ˆP1((1 β1)ϵ0 + (1 β1)γ ˆP0(U 0 ˆU ) + (2β1 1)(U 0 ˆU )) β1(U 0 ˆU )

= (1 β1)γ ˆP1γ ˆP0(U 0 ˆU ) + γ ˆP1(2β1 1)(U 0 ˆU ) β1(U 0 ˆU )

(I γ ˆP1)(1 β1)ϵ0.

where inequality comes from Lemma 9 with α = β1, U = U 1, U = U 0, and previously defined U.

By induction,

U k (1 βk)U k 1 βk ˆU

= βk U 0 ˆU + (1 βk)(T U k 1 U k 1) + (1 βk)ϵk 1

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γ ˆPl (U 0 ˆU ) i

(1 βk)βk 1(U 0 ˆU ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γ ˆPl (U 0 ˆU )

+ βk(U 0 ˆU ) (1 βk)

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γ ˆPl I γ ˆPi ϵi 1 + (1 βk)ϵk 1,

and let U be the entire right hand side of inequality. Then, we have

= T U k (1 βk)T U k 1 βk U 0 (1 βk)ϵk 1

= T U k (1 βk)T U k 1 βk T ˆU βk(U 0 ˆU ) (1 βk)ϵk 1

γ ˆPk (1 βk)

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γ ˆPl (U 0 ˆU ) i

(1 βk)βk 1(U 0 ˆU ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γ ˆPl (U 0 ˆU )

+ βk(U 0 ˆU ) (1 βk)

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γ ˆPl I γ ˆPi ϵi 1 + (1 βk)ϵk 1

βk(U 0 ˆU ) (1 βk)ϵk 1

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i βk(U 0 ˆU )

+ Πk j=1(1 βj)Π0 l=kγ ˆPl(U 0 ˆU )

i=1 Πk j=i(1 βj)Πi+1 l=kγ ˆPl I γ ˆPi ϵi 1,

where inequality comes from Lemma 9 with α = βk, U = U k, U = U k 1, and previously defined U

Now, we prove the first rate in Theorem 6.

Proof of first rate in Theorem 6. Since B1 A B2 implies A sup{ B1 , B2 } for A, B F(X), if we take right side of first inequality in Lemma 17, we have

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 U 0 U + (1 + γ)

Πk j=i(1 βj) γk i ϵi 1

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 U 0 U + 1 + γ 1 + γk+1 1 γk

1 γ max 0 i k 1

If we apply second inequality of Lemma 17 and take -norm right side, we have

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1

U 0 ˆU + 1 + γ 1 + γk+1 1 γk

1 γ max 0 i k 1

Therefore, we get

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 max n U 0 U , U 0 ˆU

+ 1 + γ 1 + γk+1 1 γk

1 γ max 0 i k 1

Now, for the second rate in Theorem 6, we present following key lemma.

Lemma 18. Let 0 < γ < 1. For the iterates {U k}k=0,1,... of Anc-VI, if U 0 T U 0, there exist nonexpansive linear operators {Pl}l=0,1,...,k and { ˆPl}l=0,1,...,k such that

T U k U k Πk j=1(1 βj)Π0 l=kγPl(U 0 U )

i=1 Πk j=i(1 βj)Πi+1 l=kγPl I γPi ϵi 1

βk(U 0 U ),

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i βk(U 0 ˆU )

i=1 Πk j=i(1 βj)Πi+1 l=kγ ˆPl I γ ˆPi ϵi 1,

for 1 k, where Πk j=k+1(1 βj) = 1, Πk+1 l=k γPl = Πk+1 l=k γ ˆPl = I, and β0 = 1.

Proof of Lemma 18. If U 0 T U 0, U 0 limm (T )m U 0 = U by Lemma 1. By Lemma 3, this also implies U 0 ˆU .

First, we prove first inequality in Lemma 18 by induction. If k = 1,

U 1 (1 β1)U 0 β1U = (1 β1)ϵ0 + β1(U 0 U ) + (1 β1)(T U 0 U 0)

(1 β1)ϵ0 + (1 β1)γP0(U 0 U ) + (2β1 1)(U 0 U )

(1 β1)ϵ0 + (1 β1)γP0(U 0 U ),

where the second inequality is from the (2β1 1)(U 0 U ) 0, and first inequality comes from Lemma 8 with α = 1, U = U 0, U = U 0 U , and let U be the entire right hand side of inequality. Then, we have

T U 1 U 1 = T U 1 (1 β1)T U 0 β1U β1(U 0 U ) (1 β1)ϵ0

γP1((1 β1)ϵ0 + (1 β1)γP0(U 0 U )) β1(U 0 U ) (1 β1)ϵ0

= (1 β1)γP1γP0(U 0 U ) β1(U 0 U ) (I γP1)(1 β1)ϵ0.

where inequality comes from Lemma 8 with α = β1, U = U 1, U = U 0, and previously defined U.

By induction,

U k (1 βk)U k 1 βk U

= βk U 0 U + (1 βk)(T U k 1 U k 1) + (1 βk)ϵk 1

βk(U 0 U ) (1 βk)βk 1(U 0 U ) + (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γPl (U 0 U )

+ (1 βk)ϵk 1 (1 βk)

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γPl I γPi ϵi 1

(1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γPl (U 0 U ) + (1 βk)ϵk 1

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γPl I γPi ϵi 1,

where the second inequality is from βk (1 βk)βk 1 0 and let U be the entire right hand side of inequality. Then, we have

= T U k (1 βk)T U k 1 βk T U βk(U 0 U ) (1 βk)ϵk 1

γPk (1 βk) Πk 1 j=1(1 βj) Π0 l=k 1γPl (U 0 U ) + (1 βk)ϵk 1

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γPl I γPi ϵi 1 βk(U 0 U ) (1 βk)ϵk 1

= Πk j=1(1 βj)Π0 l=kγPl(U 0 U )

i=1 Πk j=i(1 βj)Πi+1 l=kγPl I γPi ϵi 1

βk(U 0 U ),

where the first inequality comes from Lemma 8 with α = βk, U = U k, U = U k 1, and previously defined U.

For the second inequality in Lemma 18, if k = 1,

U 1 (1 β1)U 0 β1 ˆU = (1 β1)ϵ0 + β1(U 0 ˆU ) + (1 β1)(T U 0 U 0)

= (1 β1)ϵ0 + β1(U 0 ˆU ) + (1 β1)(T U 0 ˆU (U 0 ˆU ))

(1 β1)ϵ0 + β1(U 0 ˆU ) (1 β1)(U 0 ˆU )

where the second inequality is from U 0 T U 0 ˆU , and let U be the entire right hand side of inequality. Then, we have

T U 1 U 1 = T U 1 (1 β1)T U 0 β1U β1(U 0 ˆU ) (1 β1)ϵ0

γP1((1 β1)ϵ0 + β1(U 0 ˆU ) (1 β1)(U 0 ˆU )) β1(U 0 ˆU ) (1 β1)ϵ0

= (2β1 1)γP1(U 0 ˆU ) β1(U 0 ˆU ) (I γP1)(1 β1)ϵ0.

where inequality comes from Lemma 9 with α = β1, U = U 1, U = U 0, and previously defined U.

By induction,

U k (1 βk)U k 1 βk ˆU

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γ ˆPl (U 0 ˆU ) i

+ (βk (1 βk)βk 1)(U 0 ˆU ) + (1 βk)ϵk 1

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γ ˆPl I γ ˆPi ϵi 1,

and let U be the entire right hand side of inequality. Then, we have

= T U k (1 βk)T U k 1 βk ˆT ˆU βk(U 0 ˆU ) (1 βk)ϵk 1

γ ˆPk (1 βk)

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1γ ˆPl (U 0 ˆU ) i

+ (βk (1 βk)βk 1)(U 0 ˆU ) (1 βk)

i=1 Πk 1 j=i (1 βj)Πi+1 l=k 1γ ˆPl I γ ˆPi ϵi 1

+ (1 βk)ϵk 1 (1 βk)ϵk 1 βk(U 0 ˆU )

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=kγ ˆPl (U 0 ˆU ) i βk(U 0 ˆU )

i=1 Πk j=i(1 βj)Πi+1 l=kγ ˆPl I γ ˆPi ϵi 1,

where inequality comes from Lemma 9 with α = βk, U = U k, U = U k 1, and previously defined U.

Now, we prove the second rate in Theorem 6.

Proof of second rate in Theorem 6. If we take right side of first inequality in Lemma 18, we have γ 1 γ γ

(γk+1) 1 γk+1 U 0 U + 1 + γ 1 + γk+1 1 γk

1 γ max 0 i k 1

If we apply second inequality of Lemma 18 and take -norm right side, we have γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1

U 0 ˆU + 1 + γ 1 + γk+1 1 γk

1 γ max 0 i k 1

Therefore, we get

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1

U 0 ˆU + 1 + γ 1 + γk+1 1 γk

1 γ max 0 i k 1

since ˆU U U 0 implies that γ 1 γ γ

(γk+1) 1 γk+1 U 0 U

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1

F Omitted proofs in Section 6

For the analyses, we first define ˆT GS : Rn Rn as

ˆT GS = ˆT n ˆT 2 ˆT 1 ,

where ˆT j : Rn Rn is defined as

ˆT j (U) = (U1, . . . , Uj 1, ˆT (U)

j , Uj+1, . . . , Un)

for j = 1, . . . , n, where ˆT is Bellman anti-optimality operator.

Fact 3. [Classical result, [10, Proposition 1.3.2]] ˆT GS is a γ-contractive operator and has the same fixed point as ˆT .

Now, we introduce the following lemmas. Lemma 19. Let 0 < γ < 1. If 0 α 1, then there exist γ-contractive nonnegative matrix PGS such that

T GSU (1 α)T GS U αT GSU PGS(U (1 α) U αU ).

Lemma 20. Let 0 < γ < 1. If 0 α 1, then there exist γ-contractive nonnegative matrix ˆPGS such that

ˆPGS(U (1 α) U α ˆU ) T GSU (1 α)T GS U α ˆT GS ˆU .

Proof of Lemma 19. First let U = V, U = V , U = V . For 1 i n, we have

T i V (si) (1 α)T i V (si) αT i V (si) T πi i V (si) (1 α)T πi i V (si) αT πi i V (si)

= γPπi V (1 α) V αV (si),

where πi is the greedy policy satisfying T πi V = T V and first inequality is from T πi V T V and T πi V T V . Then, define matrix Pi as

Pi(V ) = (V1, . . . , Vi 1, (γPπi(V ))i , Vi+1, . . . , Vn)

for i = 1, . . . , n. Note that Pi is nonnegative matrix since Pπi is nonnegative matrix. Then, we have

T i V (1 α)T i V αT i V Pi(V (1 α) V αV ).

By induction, there exist a sequence of matrices {Pi}i=1,...,n satisfying

T GSV (1 α)T GS V αT GSV Pn P1(V (1 α) V αV )

since T i V = V for all i. Denote PGS as Pn P1. Then, PGS is γ-contractive nonnegative matrix since n X

j=1 (PGS)ij =

j=1 (Pi P1)ij

j=1 (Pi)ij = γ

for 1 i n, where first equality is from definition of Pl for i + 1 l n, inequality comes from definition of Pl for 1 l i 1, and last equality is induced by definition of Pi. Therefore, this implies that PGS γ.

If U = Q, with similar argument of case U = V , let πi be the greedy policy, define matrix Pi as

Pi(Q) = (Q1, . . . , Qi 1, (γPπi(Q))i , Qi+1, . . . , Qn),

and denote PGS as Pn P1. Then, PGS is γ-contractive nonnegative matrix satisfying

T GSQ (1 α)T GS Q αT GSQ PGS(Q (1 α) Q αQ ).

Proof of Lemma 20. First let U = V, U = V , ˆU = ˆV . For 1 i n, we have

T i V (si) (1 α)T i V (si) α ˆT i ˆV (si)

r(si, a) + γEs P ( | si,a) [V (s )] sup a A

(1 α)r(si, a) + (1 α)γEs P ( | si,a) h V (s ) i

αr(si, a) + αγEs P ( | si,a) h ˆV (s ) i

Es P ( | si,a) h V (s ) (1 α) V (s ) α ˆV (s ) i .

Let ˆπi( | s) = argmina A Es P ( | s,a) h V (s ) (1 α) V (s ) α ˆV (s ) i and define matrix ˆPi as

ˆPi(V ) = (V1, . . . , Vi 1, γP ˆπi(V )

i , Vi+1, . . . , Vn)

for i = 1, . . . , n. Note that ˆPi is nonnegative matrix since P ˆπi is nonnegative matrix. Then, we have

ˆPi(V (1 α) V α ˆV ) T i V (1 α)T i V αT i ˆV .

By induction, there exist a sequence of matrices { ˆPi}i=1,...,n satisfying

ˆPn ˆP1(V (1 α) V α ˆV ) T GSV (1 α)T GS V α ˆT GS ˆV ,

and denote ˆPGS as ˆPn ˆP1. With same argument in proof of Lemma 19, ˆPGS is γ-contractive nonnegative matrix.

If U = Q, with similar argument, let ˆπi( | s) = argmina A{Q(s, a) (1 α) Q(s, a) α ˆQ (s, a)} and define matrix ˆPi as

Pi(Q) = (U1, . . . , Qi 1, γP ˆπi(Q)

i , Qi+1, . . . , Qn).

Denote ˆPGS as ˆPn ˆP1. Then, with same argument in proof of Lemma 19, ˆPGS is γ-contractive nonnegative matrix satisfying

ˆPGS(Q (1 α) Q α ˆQ ) T GSQ (1 α)T GS Q α ˆT GS ˆQ .

Next, we prove following key lemma.

Lemma 21. Let 0 < γ < 1. For the iterates {U k}k=0,1,... of (GS-Anc-VI), there exist γ-contractive nonnegative matrices {Pl GS}l=0,1,...,k and { ˆPl GS}l=0,1,...,k such that

T GSU k U k

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=k Pl GS (U 0 U )

βk(U 0 U ) + Πk j=1(1 βj) Π0 l=k Pl GS (U 0 U ),

T GSU k U k

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=k ˆPl GS (U 0 ˆU ) i

βk(U 0 ˆU ) + Πk j=1(1 βj) Π0 l=k ˆPl GS (U 0 ˆU ),

where Πk j=k+1(1 βj) = 1 and β0 = 1.

Proof of Lemma 21. First, we prove first inequality in Lemma 21 by induction.

T GSU 0 U 0 = T GSU 0 U (U 0 U )

= T GSU 0 T GSU (U 0 U )

P0 GS(U 0 U ) (U 0 U ).

where inequality comes from Lemma 19 with α = 1, U = U 0.

By induction,

T GSU k U k

= T GSU k (1 βk)T GSU k 1 βk T GSU βk(U 0 U )

Pk GS(U k (1 βk)U k 1 βk U ) βk(U 0 U )

= Pk GS(βk(U 0 U ) + (1 βk)(T GSU k 1 U k 1)) βk(U 0 U )

(1 βk)Pk GS

(βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1Pl GS (U 0 U )

βk 1(U 0 U ) + Πk 1 j=1(1 βj) Π0 l=k 1Pl GS (U 0 U )

+ βk Pk GS(U 0 U ) βk(U 0 U )

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=k Pl GS (U 0 U )

βk(U 0 U ) + Πk j=1(1 βj) Π0 l=k Pl GS (U 0 U )

where the first inequality comes from Lemma 19 with α = βk, U = U k, U = U k 1, and second inequality comes from nonnegativeness of Pk GS.

First, we prove second inequality in Lemma 21 by induction.

T GSU 0 U 0 = T GSU 0 ˆU (U 0 ˆU )

= T GSU 0 ˆT GS ˆU (U 0 ˆU )

ˆP0 GS(U 0 ˆU ) (U 0 ˆU ),

where inequality comes from Lemma 20 with α = 1, U = U 0.

By induction,

T GSU k U k

= T GSU k (1 βk)T GSU k 1 βk ˆT GS ˆU βk(U 0 ˆU )

ˆPk GS(U k (1 βk)U k 1 βk ˆU ) βk(U 0 ˆU )

= ˆPk GS(βk(U 0 ˆU ) + (1 βk)(T GSU k 1 U k 1)) βk(U 0 ˆU )

(1 βk) ˆPk GS

h (βi βi 1(1 βi)) Πk 1 j=i+1(1 βj) Πi l=k 1 ˆPl GS (U 0 ˆU ) i

βk 1(U 0 ˆU ) + Πk 1 j=1(1 βj) Π0 l=k 1 ˆPl GS (U 0 ˆU )

+ βk ˆPk GS(U 0 ˆU ) βk(U 0 ˆU )

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=k ˆPl GS (U 0 ˆU ) i

βk(U 0 ˆU ) + Πk j=1(1 βj) Π0 l=k ˆPl GS (U 0 ˆU )

where the first inequality comes from Lemma 20 with α = βk, U = U k, U = U k 1, and nonnegativeness of ˆPk GS.

Now, we prove the first rate in Theorem 7.

Proof of first rate in Theorem 7. Since B1 A B2 implies A sup{ B1 , B2 } for A, B F(X), if we take right side of first inequality in Lemma 21, we have

i=1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) Πi l=k Pl GS (U 0 U )

+ βk U 0 U π + Πk j=1(1 βj) Π0 l=k Pl GS (U 0 U )

i=1 γk i+1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) + βk + γk+1Πk j=1(1 βj)

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 U 0 U ,

where the first inequality comes from triangular inequality, second inequality is from γ-contraction of Pl GS, and last equality comes from calculations. If we take right side of second inequality in Lemma 21, we have k X

i=1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) Πi l=k ˆPl GS (U 0 ˆU )

+ βk U 0 U π + Πk j=1(1 βj) Π0 l=k ˆPl GS (U 0 ˆU )

i=1 γk i+1 |βi βi 1(1 βi)| Πk j=i+1(1 βj) + βk + γk+1Πk j=1(1 βj)

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1

where the first inequality comes from triangular inequality, second inequality is from from γcontraction of ˆPl GS, and last equality comes from calculations. Therefore, we conclude T GSU k U k

γ 1 γ 1 + 2γ γk+1

(γk+1) 1 γk+1 max n U 0 U , U 0 ˆU

For the second rates of Theorem 7, we introduce following lemma. Lemma 22. Let 0 < γ < 1. For the iterates {U k}k=0,1,... of (GS-Anc-VI), if U 0 T GSU 0, then U k 1 U k T GSU k 1 T GSU k U for 1 k. Also, if U 0 T GSU 0, then U k 1 U k T GSU k 1 T GSU k U for 1 k.

Proof. By Fact 3, limm T GSU = U . By definition, if U U, T i U T i U for any 1 i n and this implies that if U U, then T GSU T GS U. Hence, with same argument in proof of Lemma 5, we can obtain desired results.

Now, we prove the second rates in Theorem 7.

Proof of second rates in Theorem 7. If U 0 T GSU 0, then U 0 U 0 and U k T GSU k by Lemma 22. Hence, by Lemma 21, we get 0 T GSU k U k

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=k Pl GS (U 0 U )

βk(U 0 U π) + Πk j=1(1 βj) Π0 l=k Pl GS (U 0 U )

(βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=k Pl GS (U 0 U ) βk(U 0 U ),

where the second inequality follows from Πk j=1(1 βj) Πi l=k Pl GS (U 0 U ) 0. Taking -norm both sides, we have

T GSU k U k

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1 U 0 U .

Otherwise, if U 0 T GSU 0, U k T GSU k and U 0 U ˆU by Lemma 22 and 3. Thus, by Lemma 21, we get

0 T GSU k U k

h (βi βi 1(1 βi)) Πk j=i+1(1 βj) Πi l=k ˆPl GS (U 0 ˆU ) i βk(U 0 ˆU ),

where the second inequality follows from 0 Πk j=1(1 βj) Π0 l=k ˆPl GS (U 0 ˆU ). Taking

-norm both sides, we have

T GSU k U k

γ 1 γ 1 + γ γk+1

(γk+1) 1 γk+1

G Broader Impacts

Our work focuses on the theoretical aspects of reinforcement learning. There are no negative social impacts that we anticipate from our theoretical results.

H Limitations

Our analysis concerns value iteration. While value iteration is of theoretical interest, the analysis of value iteration is not sufficient to understand modern deep reinforcement learning practices.