# deflated_dynamics_value_iteration__019d0eb6.pdf

Published in Transactions on Machine Learning Research (05/2025)

Deflated Dynamics Value Iteration

Jongmin Lee dlwhd2000@snu.ac.kr Seoul National University

Amin Rakhsha aminr@cs.toronto.edu Department of Computer Science, University of Toronto Vector Institute

Ernest K. Ryu eryu@math.ucla.edu University of California, Los Angeles

Amir-massoud Farahmand amir-massoud.farahmand@polymtl.ca Polytechnique Montréal Mila - Quebec AI Institute University of Toronto

Reviewed on Open Review: https: // openreview. net/ forum? id= Ib QTE24a Zw

The Value Iteration (VI) algorithm is an iterative procedure to compute the value function of a Markov decision process, and is the basis of many reinforcement learning (RL) algorithms as well. As the error convergence rate of VI as a function of iteration k is O(γk), it is slow when the discount factor γ is close to 1. To accelerate the computation of the value function, we propose Deflated Dynamics Value Iteration (DDVI). DDVI uses matrix splitting and matrix deflation techniques to effectively remove (deflate) the top s dominant eigen-structure of the transition matrix Pπ. We prove that this leads to a O(γk|λs+1|k) convergence rate, where λs+1 is the (s + 1)-th largest eigenvalue of the dynamics matrix. We also extend DDVI to the RL setting and present Deflated Dynamics Temporal Difference (DDTD) algorithm. We empirically show the effectiveness of the proposed algorithms.

1 Introduction

Computing the value function V π for a policy π or the optimal value function V is an integral step of many planning and reinforcement learning (RL) algorithms. Value Iteration (VI) is a fundamental dynamic programming algorithm for computing the value functions, and its approximate and sample-based variants, such as Temporal Different Learning (Sutton, 1988), Fitted Value Iteration (Ernst et al., 2005; Munos & Szepesvári, 2008), Deep Q-Network (Mnih et al., 2015), are the workhorses of modern RL algorithms (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 2018; Szepesvári, 2010; Meyn, 2022).

The VI algorithm, however, can be slow for problems with long effective planning horizon problems, when the agent has to look far into future in order to make good decisions. Within the discounted Markov Decision Processes formalism, the discount factor γ determines the effective planning horizon, with γ closer to 1 corresponding to longer planning horizon. The error convergence rate of the conventional VI as a function of the iteration k is O(γk), which is slow when γ is close to 1.

Recently, there has been a growing body of research that explores the application of acceleration techniques of other areas of applied math to planning and RL: Geist & Scherrer (2018); Sun et al. (2021); Ermis & Yang (2020); Park et al. (2022); Ermis et al. (2021); Shi et al. (2019) applies Anderson acceleration of fixed-point iterations, Lee & Ryu (2023) applies Anchor acceleration of minimax optimization, Vieillard et al. (2020); Goyal & Grand-Clément (2022); Grand-Clément (2021); Bowen et al. (2021); Akian et al. (2022) applies

Published in Transactions on Machine Learning Research (05/2025)

Nesterov acceleration of convex optimization, and Farahmand & Ghavamzadeh (2021); Bedaywi et al. (2024) borrow ideas from control theory/engineering, specifically PID controllers, to accelerate planning and RL algorithms.

We introduce a novel approach to accelerate VI based on modification of the eigenvalues of the transition dynamics, which are closely related to the convergence rate of VI. To see this connection, consider the policy evaluation problem where the goal is to find the value function V π for a given policy π. VI starts from an arbitrary V 0 and iteratively sets V k+1 rπ + γPπV k, where rπ and Pπ are the reward vector and the transition matrix of policy π, respectively. If V 0 = 0, the error vector/function at iteration k is V π V k = P i=k(γPπ)irπ. Let us take a closer look at this difference.

For simplicity, assume that Pπ is a diagonalizable matrix, so we can write Pπ = UDU 1 with U consisting of the (right) eigenvectors of Pπ and D being the diagonal matrix diag(λ1, . . . , λn) with 1 = |λ1| |λn|. Since (Pπ)i = UDi U 1, after some manipulations, we see that

V π V k = U

1 γλ1 0 . . . 0

1 γλ2 0 ... ... 0 0 0 (γλn)k

The diagonal terms are of the form (γλi)k, so they all converge to zero. The dominant term is (γλ1)k, which corresponds to the largest eigenvalue. As the largest eigenvalue λ1 of the stochastic matrix Pπ is 1, this leads to the dominant behaviour of O(γk). This is the same rate that we also get from the cruder norm-based contraction mapping analysis. The second dominant term behaves as O((γ|λ2|)k), and so on.

If we could somehow remove from Pπ the subspace corresponding to the top s eigen-structure with eigenvalues λ1, . . . , λs, the dominant behaviour of the new procedure would be O((γ|λs+1|)k), which can be much faster than O(γk) of the conventional VI. Although this is perhaps too good to seem feasible, this is exactly what the proposed Deflated Dynamics Value Iteration (DDVI) algorithm achieves. Even if Pπ is not a diagonalizable matrix, DDVI works. DDVI is based on two main ideas.

The first idea is the deflation technique, well-studied in linear algebra (see Section 4.2 of Saad 2011), that allows us to remove large eigenvalues of Pπ. This is done by subtracting a matrix E from Pπ. This gives us a deflated dynamics Pπ E that does not have eigenvalues λ1, . . . , λs. The second idea is based on matrix splitting (see Section 11.2 of Golub & Van Loan 2013), which allows us to use the modified dynamics Pπ E to define a VI-like iterative procedure and still converge to the same solution V π to which the conventional VI converges.

Deflation has been applied to various algorithms such as conjugate gradient algorithm (Saad et al., 2000), principal component analysis (Mackey, 2008), generalized minimal residual method (Morgan, 1995), and nonlinear fixed point iteration (Shroff & Keller, 1993) for improvement of convergence. While some prior work in accelerated planning can be considered as special cases of deflation (Bertsekas, 1995; White, 1963), to the best of our knowledge, this is the first application of the deflation technique that eliminates multiple eigenvalues in the context of RL. On the other hand, multiple planning algorithms have been introduced based on the general idea of matrix splitting (Hastings, 1968; Kushner & Kleinman, 1968; 1971; Reetz, 1973; Porteus, 1975; Rakhsha et al., 2022), though with a different splitting than this work.

After a review of relevant background in Section 2, we present DDVI and prove its convergence rate for the Policy Evaluation (PE) problem in Section 3. Next, in Section 4, we discuss the practical computation of the deflation matrix. In Section 5, we explain how DDVI can be extended to its sample-based variant and introduce the Deflated Dynamics Temporal Difference (DDTD) algorithm. Finally, in Section 6, we empirically evaluate the proposed methods and show their practical feasibility.

Published in Transactions on Machine Learning Research (05/2025)

2 Background

We first briefly review basic definitions and concepts of Markov Decision Processes (MDP) and Reinforcement Learning (RL) (Bertsekas & Tsitsiklis, 1996; Sutton & Barto, 2018; Szepesvári, 2010; Meyn, 2022). We then describe the Power Iteration and QR Iterations, which can be used to compute the eigenvalues of a matrix.

For a measurable set Ω, we denote M(Ω) as the space of probability distributions over set Ω. We use F(Ω) to denote the space of bounded measurable real-valued functions over Ω. For a matrix A, we use spec(A) to denote its spectrum (the set of eigenvalues), ρ(A) to denote its spectral radius (the maximum of the absolute value of eigenvalues), and to denote its l norm.

Markov Decision Process. The discounted MDP is defined by the tuple (X, A, P, R, γ), where X is the state space, A is the action space, P : X A M(X) is the transition probability kernel, R: X A M(R) is the reward kernel, and γ [0, 1) is the discount factor. In this work, we assume that the MDP has a finite number of states. We use r: X A R to denote the expected reward at a given state-action pair. Denote π: X M(A) for a policy. We define the reward function for a policy π as rπ(x) = Ea π( |x) [r(x, a)]. The transition kernel of following policy π is denoted by Pπ : X M(X) and is Pπ(x | x) = P

a A π(a | x)P(x | x, a) (or an integral over actions, if the action space is continuous).

The value function for a policy π is V π(x) = Eπ[P t=0 γtr(xt, at) | x0 = x] where Eπ denotes the expected value over all trajectories (x0, a0, x1, a1, . . . ) induced by P and π. We say that V is optimal value functions if V = supπ V π. We say π is an optimal policy if π Arg maxπ V π.

The Bellman operator T π for policy π is defined as the mapping that takes V F(X) and returns a new function such that its value at state x is (T πV )(x) = Ea π( |x),x P( |x,a) [r(x, a) + γV (x )]. The Bellman optimality operator T is defined as (T V )(x) = supa A r(x, a) + γEx P( | x,a) [V (x )] . The value functions V π and V are the fixed points of the Bellman operators, that is, V π = T πV π and V = T πV .

Value Iteration. The Value Iteration algorithm is one of the main methods in dynamic programming and planning for computing the value function V π of a policy (the Policy Evaluation (PE) problem), or the optimal value function V (the Control problem). It is iteratively defined as

( T πV k (Policy Evaluation) T V k (Control),

where V 0 is the initial function. For discounted MDPs where γ < 1, the Bellman operators are contractions, so by the Banach fixed-point theorem (Banach, 1922; Hunter & Nachtergaele, 2001; Hillen, 2023), the VI converges to the unique fixed points, which are V π or V , with the convergence rate of O(γk).

Let us now recall some concepts and methods from (numerical) linear algebra, see Golub & Van Loan (2013) for more detail. For any A Rn n (not necessarily symmetric nor diagonalizable) and for any matrix norm , Gelfand s formula states that spectral radius ρ(A) satisfies ρ(A) = limk Ak 1/k. Hence, Ak = O((ρ(A) + ϵ)k) for any ϵ > 0, which we denote by Ak = O((ρ(A))k). Furthermore, if A is diagonalizable, then Ak = O(ρ(A)k).

Power Iteration. Powers of A can be used to compute eigenvalues of A. The Power Iteration starts with an initial vector b0 Rn and for k = 0, 1, 2, . . . computes

If λ1 is the eigenvalue such that |λ1| = ρ(A) and the other eigenvalues λ2, . . . , λn have strictly smaller magnitude, then (bk) Abk λ1 for almost all starting points b0 (Section 7.3.1 of Golub & Van Loan 2013).

QR iteration. The QR Iteration (or Orthogonal Iteration) algorithm (Golub & Van Loan, 2013, Sections 7.3 and 8.2) is a generalization of the Power Iteration for finding multiple eigenvalues. For any A Cn n, let

Published in Transactions on Machine Learning Research (05/2025)

U 0 Cn s have orthonormal columns and perform

Zk+1 = AU k

U k+1Rk+1 = Zk+1 (QR factorization)

for k = 0, 1, . . . . If |λs| > |λs+1|, the columns of U k converge to an orthonormal basis for the dominant s-dimensional invariant subspace associated with the eigenvalues λ1, . . . , λs and diagonal entries of (U s) AU s

converge to λ1, . . . , λs for almost all starting U 0 (Golub & Van Loan, 2013, Theorem 7.3.1). Each QR iteration needs O(n2s) flops for its matrix multiplication, which is the dominant part of the computation, and O(ns2) flops for the QR factorization.

3 Deflated Dynamics Value Iteration

We present Deflated Dynamics Value Iteration (DDVI), after introducing its two key ingredients, matrix deflation and matrix splitting.

3.1 Matrix Deflation

Recall from the introduction that we would like to remove from Pπ the subspace corresponding to the top s eigen-structure with eigenvalues λ1, . . . , λs. We use the so-called deflation technique to displace, and in fact remove, some of the eigenvalues of the matrix Pπ without changing the rest. This is done by subtracting a matrix E of a certain form from Pπ (see Section 4.2 of Saad 2011 for a discussion of deflation technique).

Let Pπ be an n n matrix with eigenvalues λ1, . . . , λn, sorted in decreasing order of magnitude with ties broken arbitrarily. Let u denote the conjugate transpose of u Cn. We describe three ways of deflating this matrix and some properties of each approach as the following Facts.

Fact 1 (Hotelling s deflation, Meirovitch 1980, Section 5.6). Assume s n linearly independent eigenvectors corresponding to {λi}s i=1 exist. Write {ui}s i=1 and {vi}s i=1 to denote the top s right and left eigenvectors scaled to satisfy u i vi = 1 and u i vj = 0 for all 1 i = j s. If Es = Ps i=1 λiuiv i , then ρ(Pπ Es) = |λs+1|.

Hotelling deflation makes ρ(Pπ Es) small by eliminating the top s eigenvalues of Pπ, but requires both right and left eigenvectors of Pπ. Wielandt s deflation, in contrast, requires only right eigenvectors.

Fact 2 (Wielandt s deflation, Soto & Rojo 2006, Theorem 5). Assume s n linearly independent eigenvectors corresponding to {λi}s i=1 exist. Write {ui}s i=1 to denote the top s linearly independent right eigenvectors. Assume that vectors {vi}s i=1, which are not necessarily the left eigenvectors, satisfy u i vi = 1 and u i vj = 0 for all 1 i = j s. If Es = Ps i=1 λiuiv i , then ρ(Pπ Es) = |λs+1|.

Wielandt s deflation, however, still requires right eigenvectors, which are sometimes numerically unstable to compute. The Schur deflation, in contrast, only requires Schur vectors, which are stable to compute (Golub & Van Loan, 2013, Sections 7.3). For any Pπ Rn n, the Schur decomposition has the form Pπ = URU , where R is an upper triangular matrix with Rii = λi for i = 1, . . . , n, and U is a unitary matrix. We write ui to denote the i-th column of U and call it the i-th Schur vector for i = 1, . . . , n. Specifically, the QR iteration computes the top s eigenvalues and Schur vectors.

Fact 3 (Schur deflation, Saad 2011, Proposition 4.2). Let s n. Write {ui}s i=1 to denote the top s Schur vectors. If Es = Ps i=1 λiuiu i , then ρ(Pπ Es) = |λs+1|.

If an Es satisfies the conditions of Facts 1, 2, or 3, we say it is a rank-s deflation matrix. Later, it will be needed that for Es = Ps i=1 λiuiv i , the identity

(I αγEs) 1 = I +

αγλi 1 αγλi uiv i (1)

should hold. Indeed, the conditions of Facts 1, 2, or 3 do imply equation 1.

Published in Transactions on Machine Learning Research (05/2025)

3.2 Matrix Splitting: Successive over-relaxation

We now describe the second key ingredient of DDVI: matrix splitting. Recall that the policy evaluation problem is the problem of finding the value function V π such that T πV π = V π. This is in fact a system of linear equations: (I γPπ)V π = rπ. (2)

DDVI s approach to solve this equation is based on the Successive over-relaxation (SOR) algorithm, an iterative approach to solve linear systems.

To introduce SOR in its general form, consider a matrix A Rn n and the linear system Az = b. SOR starts by splitting A in the form of A = B + C + D. Let α = 0. Then, z solves the linear system Az = b if and only if (D + αB)z = αb (αC + (α 1)D)z,

which, in turn, holds if and only if

z = (D + αB) 1(αb (αC + (α 1)D)z),

provided that D + αB is invertible. SOR attempts to find a solution through the fixed-point iteration

zk+1 = (D + αB) 1(αb (αC + (α 1)D)zk).

Note that classical SOR uses a lower triangular B, upper triangular C, and diagonal D. Here, we generalize the standard derivation of SOR to any splitting A = B + C + D.

In the case of the policy evaluation problem of equation 2, we have A = I γPπ and b = rπ. For any E Rn n, we can consider the splitting

I γPπ = γE γ(Pπ E) + I,

where we chose B = γE, C = γ(Pπ E), and D = I. With these choices, the SOR iteration for PE is

V k+1 = (I αγE) 1(αrπ + ((1 α)I + αγ(Pπ E))V k). (3)

Notice that V π is the fixed point of this iterative procedure: V π = (I αγE) 1(αrπ + ((1 α)I + αγ(Pπ E))V π). Whether these iterations converge to the fixed point, however, depends on the choice of α and E. As an example, for α = 1 and E = 0, we recover the original VI, which is convergent with the convergence rate of O(γk). Next, we propose particular choices that lead to not only convergence but acceleration.

3.3 Deflated Dynamics Value Iteration

We are now ready to introduce DDVI. Let Es = Ps i=1 λiuiv i be a rank-s deflation matrix of Pπ satisfying one of the conditions of Facts 1, 2, or 3. The SOR iteration for PE with deflation matrix Es (cf. equation 3) is

V k+1 = (I αγEs) 1(αrπ + ((1 α)I + αγ(Pπ Es))V k). (4)

By benefitting from equation 1, we can write it as

W k+1 = (1 α)V k + αrπ + αγ

i=1 λiuiv i

αγλi 1 αγλi uiv i

We call this method Deflated Dynamics Value Iteration (DDVI). Theorem 3.1 and Corollary 3.2 describe the rate of convergence of DDVI for PE.

Published in Transactions on Machine Learning Research (05/2025)

Theorem 3.1. Let π be a policy and let λ1, . . . , λn be the eigenvalues of Pπ sorted in decreasing order of magnitude with ties broken arbitrarily. Let s n. Let Es = Ps i=1 λiuiv i be a rank-s deflation matrix of Pπ

satisfying the conditions of Facts 1, 2, or 3. For 0 < α 1, DDVI equation 5 exhibits the rate1

V k V π = O(|λ|k V 0 V π )

as k , where λ = max 1 i s,s+1 j n

n 1 α 1 αγλi

, |1 α + αγλj| o .

When α = 1, we can simplify DDVI equation 5 as follows.

Corollary 3.2. In the setting of Theorem 3.1, if α = 1, then DDVI equation 5 simplifies to

W k+1 = γ(Pπ Es)V k + rπ,

γλi 1 γλi uiv i

and the the rate simplifies to V k V π = O(|γλs+1|k V 0 V π ) as k .

Let us compare this convergence rate with the original VI s. The rate for VI is O(γk), which is slow when γ 1. By choosing Es to deflate the top s eigenvalues of Pπ, DDVI has the rate of O(|γλs+1|k), which is exponentially faster whenever λs+1 is smaller than 1. The exact behaviour depends on the spectrum of the Markov chain (whether eigenvalues are close to 1 or far from them) and the number of eigenvalues we decided to deflate by Es.

In Section 4, we discuss a practical implementation of DDVI based on the power iteration and QR iteration and implement it in the experiments of Section 6. We find that smaller values of α lead to more stable iterations when using approximate deflation matrices. We also show how we can use DDVI in the RL setting in Section 5.

This version of DDVI equation 5 applies to the policy evaluation (PE) setup. The challenge in extending DDVI to the Control setup is that Pπ changes throughout the iterations, and so should the deflation matrix Es. However, when s = 1, the deflation matrix E1 can be kept constant, and we utilize this fact in Section 4.1.

4 Computing Deflation Matrix E

The application of DDVI requires practical means of computing the deflation matrix Es. In this section, we provide three approaches as examples.

4.1 Rank-1 DDVI for PE and Control

Recall that for any stochastic n n matrix, the vector 1 = [1, . . . , 1] Rn is a right eigenvector corresponding to eigenvalue 1. This allows us to easily obtain a rank-1 deflation matrix E1 for Pπ for any policy π.

Let E1 = 1v with v Rn be a vector with non-negative entries satisfying v 1 = 1. This rank-1 Wielandt s deflation matrix can be used for DDVI (PE) as in Section 2, but we can also use it for the Control version of DDVI.

We define the rank-1 DDVI for Control as

W k+1 = max π {rπ + γ(Pπ E1)W k} = max π {rπ + γPπW k} γ(v W k)1. (6)

Here, we benefitted from the fact that E1 is not a function of π in order to take it out of the maxπ. The value function V can be computed as

V k = I + γ 1 γ 1v W k. (7)

1The precise meaning of the O notation is V k V π = O(|λ + ϵ|k V 0 V π ) as k for any ϵ > 0.

Published in Transactions on Machine Learning Research (05/2025)

Note that the iteration over W k does not depend on V k, so we do not need to compute V at each iteration only when we need it. Compared to equation 5 of DDVI (PE), we set α = 1 for simplicity. We have the following guarantee for rank-1 DDVI for Control.

Theorem 4.1. The DDVI for Control algorithm (equation 6 and equation 7) exhibits the rate

V k V 2 1 γ γk V 0 V ,

for k = 0, 1, . . . . Furthermore, if there exist a unique optimal policy π , then

V k V = O(|γλ2|k V 0 V )

as k , where λ2 is the second eigenvalue of Pπ (The O notation is as defined in Theorem 3.1).

Discussion. Although rank-1 DDVI for Control does accelerate the convergence V k V , a subtle point to note is that the greedy policy πk obtained from V k is not affected by the rank-1 deflation. Briefly speaking, this is because adding a constant to V k through 1 has no effect on the arg max computation used in the greedy policy. Indeed, the maximizer πk+1 in equation 6 is the same as arg maxπ{rπ + γPπV k}, produced by the (non-deflated) value iteration, when W 0 = V 0. Having the term v 1W k in the update W k+1 = rπ + γ(Pπk+1 v 1)W k adds the same constant to all states, so it does not change the maximizer of the next policy.

4.2 DDVI with Automatic Power Iteration

According to Fact 2, we can construct the deflation matrix Es for Pπ if we have its top s right eigenvectors. The top eigenvector u1 = 1 is known, which as we saw in Section 4.1, gives E1 and rank-1 DDVI. We show that we can start with s = 1 and run rank-1 DDVI, and using the calculations already done in DDVI updates, estimate the next top eigenvectors and increase the deflation rank.

Assume that the top s + 1 eigenvalues of Pπ have distinct magnitude, i.e., 1 = λ1 > |λ2| > > |λs+1|. Let Es = Ps i=1 λiuiv i be a rank-s deflation matrix of Pπ, as in Fact 2. Consider the DDVI algorithm in equation 5 with α = 1 as in Corollary 3.2. If V 0 = 0, since

(Pπ Es) (I γEs) 1 = Pπ Es

(verified as equation 10 in Appendix B), we have

i=0 γi (Pπ Es)i rπ.

Therefore, 1 γk (W k+1 W k) = (Pπ Es)k rπ is the iterates of a power iteration for the matrix (Pπ Es) starting from initial vector rπ. As discussed in Fact 2, the top eigenvalue of (Pπ Es) is λs+1. For large k, we expect W k+1 W k

W k+1 W k w, where w is the top eigenvector of (Pπ Es). With w, we can recover the (s + 1)-th right eigenvector of Pπ through the formula (Bru et al., 2012, Proposition 5)

λiv i w λi λs+1 ui.

Leveraging this observation, DDVI with Automatic Power Iteration (Auto PI) computes an approximate rank-s deflation matrix Es while performing DDVI: Start with a rank-1 deflation matrix, using the first right eigenvector u1 = 1, and carry out DDVI iterations. If a certain error criteria is satisfied, use W k+1 W k to approximate the second right eigenvector u2. Then, update the deflation matrix to be rank 2, and gradually increase the deflation rank. We formalize this approach in Algorithm 1.

Published in Transactions on Machine Learning Research (05/2025)

Algorithm 1 DDVI with Auto PI

1: Initialize C, ϵ For example, C = 10, ϵ = 10 4

2: function DDVI((s, V, K, {λi}s i=1, {ui}s i=1)) 3: V 0 = V , c = 0, Es = Ps i=1 λiuiv i as in Fact 2 4: for k = 0, . . . , K 1 do

5: if c C and wk+1

6: λs+1 = (wk) wk+1/ γ wk 2 2 , us+1 = wk+1 Ps i=1 λiv i wk+1

7: Return DDVI(s+1, V k, K c, {λi}s+1 i=1,{ui}s+1 i=1) 8: else

9: W k+1 = γ(Pπ Es)V k + rπ, V k+1 = (I γEs) 1W k+1

10: wk+1 = W k+1 W k, wk = W k W k 1

11: c = c + 1

12: Return V K

13: Initialize V 0 and u1 = 1, λ1 = 1 14: DDVI(1, V 0, K, λ1, u1)

4.3 Rank-s DDVI with the QR Iteration.

Recall that the QR iteration approximates top s Schur vectors. Algorithm 2 uses the QR Iteration to construct the rank-s Schur deflation matrix and performs DDVI. Compared to the standard VI, rank-s DDVI requires additional computation for the QR iteration.

Algorithm 2 Rank-s DDVI with QR Iteration

Initialize α, s, and V 0

{λi}s i=1, {us}s i=1 = QRiteration(Pπ, s) for k = 0, . . . , K 1 do

W k+1 = αγ(Pπ Ps i=1 λiuiu i )V k + (1 α)V k + αrπ

V k+1 = I + Ps i=1 αγλi 1 αγλi uiu i W k+1

The QR iteration of Algorithm 2 can be carried out in an automated manner similar to Auto PI. We refer to this as Auto QR and formally describe the algorithm in Appendix C.

5 Deflated Dynamics Temporal Difference Learning

Many practical RL algorithms such as Temporal Difference (TD) Learning (Sutton, 1988; Tsitsiklis & V. Roy, 1997), Q-Learning (Watkins, 1989), Fitted Value Iteration (Gordon, 1995), and DQN (Mnih et al., 2015) can be viewed as sample-based variants of VI. Consequently, the slow convergence in the case of γ 1 has also been observed for these algorithms by Szepesvári (1997); Even-Dar & Mansour (2003); Wainwright (2019) for TD Learning and by Munos & Szepesvári (2008); Farahmand et al. (2010); Chen & Jiang (2019); Fan et al. (2020) for Fitted VI and DQN. Here we introduce Deflated Dynamics Temporal Difference Learning (DDTD) as a sample-based variant of DDVI. To start, recall that the tabular temporal difference (TD) learning performs the updates

V k+1(Xk) = V k(Xk) + ηk(Xk)[rπ(Xk) + γV k(X k) V k(Xk)],

where ηk(Xk) is a state-dependent stepsize, (Xk, X k) are random samples from the environment such that X k is the subsequent state following Xk, and rπ(Xk) is the expected immediate reward obtained by following policy π from state Xk. In practice, we may choose a state-independent ηk(Xk) = ηk, which would still result in convergence to the true value function if each state is visited often enough.

Published in Transactions on Machine Learning Research (05/2025)

Algorithm 3 Rank-s DDTD with QR iteration

1: Initialize α, s, W, V , Es, ˆPπ, and K 2: for k = 1, 2, . . . do 3: Choose Xk uniformly at random. 4: Sample (π(Xk), Rk, X k) from environment

5: Update ˆP with (Xk, π(Xk), Rk, X k)

6: if k mod K = 0 then 7: {λi}s i=1, {us}s i=1 = QRiteration( ˆPπ, s) 8: Update Es P λiuiu i 9: W (I αγEs)V

10: W(Xk) W(Xk) + ηk(Xk) αRk + αγV (X k) αγ(Es V )(Xk) + (1 α)V (Xk) W(Xk)

11: V I + Ps i=1 αγλi 1 αγλi uiu i W

TD is a sample-based variant of VI for PE. Two key ingredients of TD learning are the random coordinate updates and the temporal difference error rπ(X) + γV (X ) V (X), whose conditional expectation is equal to (T πV V )(X). Applying the same ingredients to DDVI, we obtain DDTD. The improved convergence rate for DDVI compared to VI suggests that DDTD may also exhibit improved convergence rates.

Specifically, we define DDTD as

W k+1(Xk) = W k(Xk) + ηk(Xk) (1 α)V k(Xk) + α rπ(Xk) + γV (X k) γ(Es V k)(Xk) W k(Xk)

αγλi 1 αγλi uiv i

where Es = Ps i=1 λiuiv i is a rank-s deflation matrix of Pπ and {Xk, X k}k=0,1,... are i.i.d. random variables such that X k Pπ( | Xk) and Xk Unif(X). The W k+1-update notation means W k+1(x) = W k(x) for all x = Xk.

DDTD is the sample-based variant of DDVI performing asynchronous updates. The following result provides almost sure convergence of DDTD. Theorem 5.1. Let ηk(x) = (Pk i=0 1Xi=x) 1 when Xk = x. For α = 1, DDTD equation 8 converges to V π

almost surely.

The following result describes the asymptotic convergence rate of DDTD. Theorem 5.2. Let λ1, . . . , λn be the eigenvalues of Pπ sorted in decreasing order of magnitude with ties broken arbitrarily. Let ηk(x) = C/(k+1) with the constant C > n 2λDDTD , where λDDTD = minλ {λs+1,...,λn} Re(1 γλ). For α = 1, DDTD equation 8 exhibits the rate E[ V k V π 2] = O(k 1).

We note that in Theorem 5.2, as the rank of DDTD increases, λDDTD also increases, and this implies that higher rank DDTD has a larger range of convergent step size compared to the plain TD learning.

5.1 Implementation of DDTD

To implement DDTD practically, we take a hybrid of model-free and model-based approach for obtaining Es. At each iteration, the agent uses the new samples to update an approximate model ˆPπ of the true dynamics. This updated approximate model is used to compute Es. We update Es every K-th iterations since the change in ˆPπ and consequently the resulting Es at each iteration is small. Whenever a new Es is computed, we set W to be (I αγEs)V . This step ensures that V smoothly converges to V π by updating all coordinates of W at once based on new Es. Also, we use the random sample reward R in place of the expected reward rπ(Xk). We formally state the algorithm as Algorithm 3.

We note that DDTD simultaneously uses samples to learn the model and to update the value function, making the DDTD framework more effective than model-free learning and more computationally efficient than a purely model-based approach.

Published in Transactions on Machine Learning Research (05/2025)

0 200 400 600 800 1000 Iterations (k)

Normalized ||Vk V ||1

0 200 400 600 800 1000 Iterations (k)

Normalized ||Vk V ||1

DDVI (Auto QR) DDVI (Auto PI) DDVI (rank-1)

DDVI (rank-2)

DDVI (rank-3)

DDVI (rank-4)

Figure 1: Comparison of DDVI with different ranks, Auto PI, and Auto QR against VI in (left) Maze and (right) Chain Walk. The plots for DDVI do not start at iteration 0 because the costs of computing Es through QR iterations are incorporated as rightward shifts. Rate of DDVI with Auto PI and Auto QR changes when Es is updated.

6 Experiments

For our experiments, we use the following environments: Maze with 5 5 states and 4 actions, Cliffwalk with 3 7 states with 4 actions, Chain Walk with 50 states with 2 actions, and random Garnet MDPs (Bhatnagar et al., 2009) with 200 states. The discount factor is set to γ = 0.99 for the comparison of DDVI with different ranks and the DDTD experiments, and γ = 0.995 in other experiments. Appendix D provides full definitions of the environments and policies used for PE. All experiments were carried out on local CPUs. We report the normalized error of V k defined as V k V π 1/ V π 1.2

DDVI with Auto PI, Auto QR, different fixed ranks. In Figure 1, we compare rank-s DDVI for multiple values of s and DDVI with Auto PI and Auto QR against the VI in Chain Walk and Maze. In our plots, we incorporate the cost of the QR iteration for a fair comparison. We consider the cost of a QR iteration to be ms C where m is the number of QR iterations we applied, s is the rank of DDVI, and C is the cost of VI per iteration, and plot rank-s DDVI starting from ms iterations in Figures 1. In this experiment, QR iteration (Section 2) is used 100 times to calculate Es with Schur vectors. In almost all cases, DDVI exhibits a significantly faster convergence rate compared to VI. Aligned with the theory, we observe that higher ranks achieve better convergence rates. Also, as DDVI with Auto PI and Auto QR progress and update Es, their convergence rate improves.

DDVI and other baselines. We perform an extensive comparison of DDVI against the prior accelerated VI methods: Safe Accelerated VI (S-AVI)(Goyal & Grand-Clément, 2022), Anderson VI (Geist & Scherrer, 2018), PID VI (Farahmand & Ghavamzadeh, 2021), and Anchored VI (Lee & Ryu, 2023). In this experiment, we use the Implicitly Restarted Arnoldi Method (Lehoucq et al., 1998) from Sci Py package to calculate the eigenvalues and eigenvectors for DDVI. Figure 2 (top-left) shows the convergence behaviour of the algorithms by iteration count in 20 randomly generated Garnet environments, where our algorithms outperform all the baselines. This comparison might not be fair as the amount of computation needed for each iteration of algorithms is not the same. Therefore, in Figure 2 (top-right), we compare the algorithms by wall clock time. It can be seen that rank-2 DDVI initially has to spend time to calculate Es before it starts to update the value function, but after the slow start, it has the fastest rate. The fast rate can compensate for the initial time if a high accuracy is needed. DDVI with Auto QR and rank-1 DDVI show a fast convergence from the beginning.

We further investigate how the algorithms scale with the size of MDP and the discount factor. Figure 2 (bottom-left) shows the runtime to reach a normalized error of 10 4 as a function of the number of states.

2The source code for the experiments can be found at https://github.com/adaptive-agents-lab/ddvi.

Published in Transactions on Machine Learning Research (05/2025)

0 200 400 600 800 1000 Iterations (k)

Normalized ||Vk V ||1

0 50 100 150 200 250 Wall Clock (ms)

Normalized ||Vk V ||1

200 400 600 800 1000 Number of States

Wall Clock (ms)

200 400 600 800 1000 Horizon (1/(1 ))

Wall Clock (ms)

VI Anderson VI S-AVI PID VI Anchored VI DDVI (Auto QR) DDVI (rank-1)

DDVI (rank-2)

Figure 2: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (top-left) and wall clock time (top-right). Runtimes to reach normalized error of 10 4 is shown against the number of states (bottom-left) and horizon 1/(1 γ) (bottom-right). Plots are average of 20 randomly generated Garnet MDPs with shaded areas showing the standard error.

DDVI with Auto QR and rank-1 DDVI show the best scaling. Note that even rank-2, for which the calculation of Es is non-trivial, also scales competitively. Figure 2 (bottom-right) show the scaling behaviour with horizon of the MDP, which is 1/(1 γ). We measure the runtime to reach a normalized error of 10 4 for the horizon ranging from 100 to 1000 which corresponds to γ ranging from 0.99 to 0.999. Remarkably, DDVI algorithms have a low constant runtime even with long horizon tasks with γ 1. This is aligned with our theoretical result, as the rate γ|λs+1| remains small as γ approaches 1.

Rank-s DDTD with QR iteration. We compare DDTD with TD Learning and Dyna (Sutton, 1990; Peng & Williams, 1993) which simultaneously use samples to learn the model and to update the value function. For the sake of making the setting more realistic, we consider the case where the approximate model ˆP cannot exactly learn the true dynamics. Figure 3 shows that DDTD with large enough rank can outperform TD. Note that unlike Dyna, DDTD does not suffer from model error, which shows that DDTD only uses the model for acceleration and is not a pure model-based algorithm.

7 Conclusion

In this work, we proposed a framework for accelerating VI through matrix deflation and matrix splitting. We theoretically analyzed the proposed methods, DDVI and DDTD, and presented experimental results showing speedups in various setups. The positive experimental results demonstrate that matrix deflation to be a promising technique that may be applicable to a broader range of RL algorithms.

One direction of future work is to extend the theoretical analysis DDVI for Control using a general rank-s deflation matrix. Theoretically analyzing other RL methods combined with matrix deflation is another

Published in Transactions on Machine Learning Research (05/2025)

0k 5k 10k 15k 20k Environment Samples (t)

Normalized ||Vt V ||1

0k 5k 10k 15k 20k Environment Samples (t)

Normalized ||Vt V ||1

Dyna TD Learning DDTD (rank-1)

DDTD (rank-2)

DDTD (rank-3) DDTD (rank-4)

Figure 3: Comparison of DDTD with QR iteration, Dyna, and TD learning in (left) Chain Walk (right) Maze.

interesting direction. Another direction is incorporating the DDVI framework with a function approximator, such as a deep neural network, to handle large-scale or continuous state and action spaces.

Acknowledgments

We thank the anonymous reviewers, the action editor, as well as the members of the Adaptive Agents (Adage) Lab, in particular Tyler Kastner, for their constructive feedback. JL and EKR were supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (No.RS-2024-00421203). AMF acknowledges the funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) through the Discovery Grant program (2021-03701). Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

M. Akian, S. Gaubert, Z. Qu, and O. Saadi. Multiply accelerated value iteration for non-symmetric affine fixed point problems and application to Markov decision processes. SIAM Journal on Matrix Analysis and Applications, 43(1):199 232, 2022. 1, 16

D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping. Neural Information Processing Systems, 1997. 16

M. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. Neural Information Processing Systems, 2011. 16

P. Bacon. Temporal Representation Learning. Ph D thesis, Mc Gill University, 2018. 16

P. Bacon and D. Precup. A matrix splitting perspective on planning with options. ar Xiv:1612.00916, 2016.

S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133 181, 1922. 3

M Bedaywi, A Rakhsha, and A-m Farahmand. PID accelerated temporal difference algorithms. Reinforcement Learning Journal, 5:2071 2095, 2024. 2

D. P. Bertsekas. Generic rank-one corrections for value iteration in markovian decision problems. Operations research letters, 17(3):111 119, 1995. 2, 16

D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. 1, 3

Published in Transactions on Machine Learning Research (05/2025)

J. Bhandari, D. Russo, and R. Singal. Conference on learning theory. 2018. 17

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor critic algorithms. Automatica, 45 (11):2471 2482, 2009. 10, 27

V. S. Borkar. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization, 36(3): 840 851, 1998. 17

V. S. Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447 469, 2000. 17, 20

W. Bowen, X. Huaqing, Z. Lin, L. Yingbin, and Z. Wei. Finite-time theory for momentum Q-learning. Conference on Uncertainty in Artificial Intelligence, 2021. 1, 16

R. Bru, R. Canto, R. L. Soto, and A. M. Urbano. A Brauer s theorem and related results. Central European Journal of Mathematics, 10:312 321, 2012. 7, 24

J. Chen and N. Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, 2019. 8

S. Chen, A. Devraj, A. Busic, and S. Meyn. Explicit mean-square error bounds for Monte-Carlo and linear stochastic approximation. International Conference on Artificial Intelligence and Statistics, 2020. 17, 21

Z. Chen, S. T Maguluri, S. Shakkottai, and K. Shanmugam. A Lyapunov theory for finite-sample guarantees of markovian stochastic approximation. Operations Research, 2023. 17, 22

P. Dai, D. S. Weld, and J. Goldsmith. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181 209, 2011. 16

G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor. Finite sample analyses for TD (0) with function approximation. Association for the Advancement of Artificial Intelligence, 2018. 17

A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance. IEEE Transactions on Automatic Control, 67(11):5948 5963, 2021. 16

M. Ermis and I. Yang. A3DQN: Adaptive Anderson acceleration for deep Q-networks. IEEE Symposium Series on Computational Intelligence, 2020. 1, 16

M. Ermis, M. Park, and I. Yang. On Anderson acceleration for partially observable Markov decision processes. IEEE Conference on Decision and Control, 2021. 1, 16

D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005. 1

E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 2003. 8

J. Fan, Z. Wang, Y. Xie, and Z. Yang. A theoretical analysis of deep Q-learning. Learning for dynamics and control, 2020. 8

A-m Farahmand and M. Ghavamzadeh. PID accelerated value iteration algorithm. International Conference on Machine Learning, 2021. 2, 10, 16, 27

A-m Farahmand, R. Munos, and Cs. Szepesvári. Error propagation for approximate policy and value iteration. Neural Information Processing Systems, 2010. 8

M. Geist and B. Scherrer. Anderson acceleration for reinforcement learning. European Workshop on Reinforcement Learning, 2018. 1, 10, 16

G. H. Golub and C. F. Van Loan. Matrix Computations. The John Hopkins University Press, 4th edition, 2013. 2, 3, 4, 16

Published in Transactions on Machine Learning Research (05/2025)

G. Gordon. Stable function approximation in dynamic programming. International Conference on Machine Learning, 1995. 8

V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration. Operations Research, 71(2):517 535, 2022. 1, 10, 16, 28

J. Grand-Clément. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. ar Xiv:2104.10677, 2021. 1, 16

A. Gupta, R. Jain, and P. W Glynn. An empirical algorithm for relative value iteration for average-cost MDPs. IEEE Conference on Decision and Control, 2015. 16

N. A. J. Hastings. Some notes on dynamic programming and replacement. Journal of the Operational Research Society, 19:453 464, 1968. 2, 16

T. Hillen. Elements of Applied Functional Analysis. AMS Open Notes, 2023. 3

J. K. Hunter and B. Nachtergaele. Applied analysis. World Scientific Publishing Company, 2001. 3

T. Jaakkola, M. Jordan, and S. Singh. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 1993. 17

H. Kushner and A. Kleinman. Numerical methods for the solution of the degenerate nonlinear elliptic equations arising in optimal stochastic control theory. IEEE Transactions on Automatic Control, 1968. 2, 16

H. Kushner and A. Kleinman. Accelerated procedures for the solution of discrete Markov control problems. IEEE Transactions on Automatic Control, 1971. 2, 16

M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 2003.

C. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does constant step-size and iterate averaging go? International Conference on Artificial Intelligence and Statistics, 2018. 17

J. Lee and E. K. Ryu. Accelerating value iteration with anchoring. Neural Information Processing Systems, 2023. 1, 10, 16

R.B. Lehoucq, D.C. Sorensen, and C. Yang. ARPACK Users Guide: Solution of Large-scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. Society for Industrial and Applied Mathematics, 1998. 10

L. Mackey. Deflation methods for sparse PCA. In Neural Information Processing Systems, 2008. 2, 16

H. B. Mc Mahan and G. J. Gordon. Fast exact planning in Markov decision processes. International Conference on Automated Planning and Scheduling, 2005. 16

L. Meirovitch. Computational methods in structural dynamics. Springer Science & Business Media, 1980. 4,

S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press, 2022. 1, 3

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, and et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. 1, 8

A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 1993. 16

R. B. Morgan. A restarted gmres method augmented with eigenvectors. SIAM Journal on Matrix Analysis and Applications, 16(4):1154 1171, 1995. 2, 16

Published in Transactions on Machine Learning Research (05/2025)

R. Munos and Cs. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008. 1, 8

M. Park, J. Shin, and I. Yang. Anderson acceleration for partially observable Markov decision processes: A maximum entropy approach. ar Xiv:2211.14998, 2022. 1, 16

J. Peng and R. J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4):437 454, 1993. 11, 16

E. L. Porteus. Bounds and transformations for discounted finite markov decision chains. Operations Research, 23(4):761 784, 1975. 2, 16

A. Rakhsha, A. Wang, M. Ghavamzadeh, and A-m Farahmand. Operator splitting value iteration. Neural Information Processing Systems, 2022. 2, 16, 27, 28

D. Reetz. Solution of a markovian decision problem by successive overrelaxation. Zeitschrift für Operations Research, 17:29 32, 1973. 2, 16

Y. Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011. 2, 4, 16, 25

Y. Saad, M. Yeung, J. Erhel, and F. Guyomarc h. A deflated version of the conjugate gradient algorithm. SIAM Journal on Scientific Computing, 21(5):1909 1926, 2000. 2, 16

H. Sharma, M. Jafarnia-Jahromi, and R. Jain. Approximate relative value learning for average-reward continuous state mdps. Uncertainty in Artificial Intelligence, 2020. 16

W. Shi, S. Song, H. Wu, Y. Hsu, C. Wu, and G. Huang. Regularized Anderson acceleration for off-policy deep reinforcement learning. Neural Information Processing Systems, 2019. 1, 16

G. M. Shroff and H. B. Keller. Stabilization of unstable procedures: the recursive projection method. SIAM Journal on numerical analysis, 30(4):1099 1120, 1993. 2, 16

A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. Naval Research Logistics, 70(5):423 442, 2023. 16

R. L. Soto and O. Rojo. Applications of a Brauer theorem in the nonnegative inverse eigenvalue problem. Linear algebra and its applications, 416(2-3):844 856, 2006. 4

K. Sun, Y. Wang, Y. Liu, B. Pan, S. Jui, B. Jiang, L. Kong, et al. Damped Anderson mixing for deep reinforcement learning: Acceleration, convergence, and stabilization. Neural Information Processing Systems, 2021. 1, 16

R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988. 1, 8

R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning. 1990. 11

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. 1, 3

Cs. Szepesvári. The asymptotic convergence-rate of Q-learning. Neural Information Processing Systems, 1997.

Cs. Szepesvári. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 1, 3

J. N. Tsitsiklis and B. V. Roy. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 1997. 8

N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020. 1, 16

Published in Transactions on Machine Learning Research (05/2025)

M. J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp ℓ -bounds for Q-learning. ar Xiv:1905.06265, 2019. 8

C. J. C. H. Watkins. Learning from Delayed Rewards. Ph D thesis, King s College, University of Cambride, 1989. 8

D. J. White. Dynamic programming, markov chains, and the method of successive approximations. J. Math. Anal. Appl, 6(3):373 376, 1963. 2, 16

D. Wingate, K. D. Seppi, and S. Mahadevan. Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research, 2005. 16

A Prior Works

Acceleration in RL. Prioritized sweeping and its several variants (Moore & Atkeson, 1993; Peng & Williams, 1993; Mc Mahan & Gordon, 2005; Wingate et al., 2005; Andre et al., 1997; Dai et al., 2011) specify the order of asynchronous value function updates, which may lead to accelerated convergence to the true value function. Speedy Q-learning (Azar et al., 2011) changes the update rule of Q-learning and employs aggressive learning rates to accelerate the convergence. Sidford et al. (2023) accelerate the computation of PπV k by sampling it with variance reduction techniques.

Recently, there has been a growing body of research that explores the application of acceleration techniques of other areas of applied math to planning and RL: Geist & Scherrer (2018); Sun et al. (2021); Ermis & Yang (2020); Park et al. (2022); Ermis et al. (2021); Shi et al. (2019) apply Anderson acceleration of fixed-point iterations, Lee & Ryu (2023) apply Anchor acceleration of minimax optimization, Vieillard et al. (2020); Goyal & Grand-Clément (2022); Grand-Clément (2021); Bowen et al. (2021); Akian et al. (2022) apply Nesterov acceleration of convex optimization, and Farahmand & Ghavamzadeh (2021) apply ideas inspired by PID controllers in control theory.

Specifically, regarding the variants of VI used in our experiments, S-AVI (Goyal & Grand-Clément, 2022) exhibits O(γk s )-rate where γs satisfies γ γs 1, PID VI (Farahmand & Ghavamzadeh, 2021) exhibits a

guaranteed O 1+γ 1 γ 1+γ+ 1 γ

k rate under reversible MDP (and possibly faster, depending on the MDP),

and Anchored VI (Lee & Ryu, 2023) exhibits O (γ 1 γ)(1+2γ γk+1)

(γk+1) 1 γk+1 -rate in terms of Bellman error.

Matrix deflation. Deflation techniques were first developed for eigenvalue computation (Meirovitch, 1980; Saad, 2011; Golub & Van Loan, 2013). Matrix deflation, eliminating top eigenvalues of a given matrix with leaving the rest of the eigenvalues untouched, has been applied to various algorithms such as the conjugate gradient algorithm (Saad et al., 2000), principal component analysis (Mackey, 2008), the generalized minimal residual method (Morgan, 1995), and nonlinear fixed-point iteration (Shroff & Keller, 1993) to improve the convergence.

Some prior work in RL has explored subtracting constants from the iterates of VI, resulting in methods that resemble our rank-1 DDVI. For discounted MDP, Devraj & Meyn (2021) propose relative Q-learning, which can roughly be seen as the Q-learning version of rank-1 DDVI for Control, and Bertsekas (1995) proposes an extrapolation method for VI, which can be seen as rank-1 DDVI with Schur deflation matrix for PE. White (1963) introduced relative value iteration in the context of undiscounted MDP with average reward, and several variants were proposed (Gupta et al., 2015; Sharma et al., 2020).

Matrix splitting of value iteration. Matrix splitting has been studied in the RL literature to obtain an acceleration. Hastings (1968) and Kushner & Kleinman (1968) first suggested Gausss-Seidel iteration for computing value function. Kushner & Kleinman (1971) and Reetz (1973) applied Successive Over-Relaxation to VI and generalized Jacobi and Gauss-Seidel iterations. Porteus (1975) proposed several transformations of MDP that can be seen as a Gauss Seidel variant of VI. Rakhsha et al. (2022) used matrix splitting with the approximate dynamics. Bacon & Precup (2016); Bacon (2018) analyzed planning with options through the matrix splitting perspective.

Published in Transactions on Machine Learning Research (05/2025)

Convergence analysis of TD learning Jaakkola et al. (1993) first proved the convergence of TD learning using the stochastic approximation (SA) technique. Borkar & Meyn (2000); Borkar (1998) suggested ODEbased framework to provide asymptotic convergence of SA including TD learning with an asynchronous update. Lakshminarayanan & Szepesvari (2018) study linear SA under i.i.d. noise with respect to mean square error, and Chen et al. (2020) study asymptotic convergence rate of linear SA under Markovian noise. Finite-time analysis of TD learning was first provided by Dalal et al. (2018) and extended to Markovian noise setting by Bhandari et al. (2018). Leveraging Lyapunov theory, Chen et al. (2023) establishes finite-time analysis of Markovian SA with an explicit bound.

B Proof of Theoretical Results

We prove the theoretical results in this appendix.

B.1 Proof of Theorem 3.1

By the definition of DDVI (equation 4) and the property of fixed point V π, which satisfies V π = (I αγE) 1(αrπ + ((1 α)I + αγ(Pπ E))V π) as mentioned in Section 3.2, we have

(I αγEs) 1 [(1 α)I + αγ(Pπ Es)] (V k 1 V π) =

(I αγEs) 1 (1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1 ((1 α)I + αγ(Pπ Es))(V k 2 V π) =

(I αγEs) 1 (1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1 k 1 ((1 α)I + αγ(Pπ Es))(V 0 V π),

where Es is a rank-s deflation matrix of Pπ satisfying the conditions of Facts 1, 2, or 3. This implies that

V k V π C ((1 α)(I αγE) 1 + αγ(Pπ E)(I αγE) 1)k 1 V 0 V π , (9)

for some constant C R.

First, suppose Es satisfies Facts 1 or 2. Let Us = [u1, . . . , us] and Vs = [v1, . . . , vs]. Define Ds,f(λi) = diag(f(λ1), . . . , f(λs)) for some function f. Then Us, Vs, Ds,f(λi) are s s matrices and Es = Us Ds,λi V s . By Jordan decomposition, Pπ = UJU 1 where J is Jordan matrix with Jii = λi for i = 1, . . . , n. Let Js be s s submatrix of J satisfying (Js)ij = Jij for 1 i, j s. Then, by condition of Theorem 3.1, Js = diag(λ1, . . . , λs) = Ds,λi, and the i-th column of U is ui for 1 i s. By simple calculation, we get PπUs = Us Js = Us Ds,λi and (I αγEs) 1 = I + Us Ds, γαλi 1 γαλi V s .

(Pπ Us Ds,λi V s )(Us Ds, γαλi 1 γαλi V s ) = Us Ds,λi Ds, γαλi 1 γαλi V s Us Ds,λi Ds, γαλi 1 γαλi V s

(Pπ Es)(I αγEs) 1 = Pπ Es. (10)

Then, the term on the right-hand side (RHS) of equation 9 can be written as

(1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1 =

(1 α) I + Us Ds, γαλi 1 γαλi V s

+ αγ UJU 1 Us Ds,λi V s =

(1 α)I + αγUJU 1 + Us Ds, (λiγ 1)γα2λi

1 γαλi V s =

U (1 α)I + αγJ + e1:s Ds, (λiγ 1)γα2λi

1 γαλi V s U U 1,

Published in Transactions on Machine Learning Research (05/2025)

where ei Rn is the i-th unit vector and e1:s = [e1, . . . , es], and (1 α)I + αγJ + e1:s Ds, (λiγ 1)γα2λi

1 γαλi V s U is

an upper triangular matrix with diagonal entries (1 α) 1 αγλ1 , . . . , (1 α) 1 αγλs , (1 α) + αγλs+1, . . . , (1 α) + αγλn.

Now suppose Es = Ps i=1 λiuiu i satisfies Fact 3. Similarly, let Us = [u1, . . . , us]. Then, Es = Us Ds,λi U s . By Schur decomposition, Pπ = URU where R is an upper triangular matrix with Rii = λi for i = 1, . . . , n, and U is a unitary matrix. By simple calculation, we have PπUs = Us Rs where Rs is the s s submatrix of R such that (Rs)ij = Rij for 1 i, j s and (I αγEs) 1 = I + Us Ds, γαλi 1 γαλi U s .

(Pπ Us Ds,λi U s )(Us Ds, γαλi 1 γαλi U s ) = Us Rs Ds, γαλi 1 γαλi U s Us Ds,λi Ds, γαλi 1 γαλi U s

= Us(Rs Ds,λi)Ds, γαλi 1 γαλi U s ,

(Pπ Es)(I αγEs) 1 = (Pπ Us Ds,λi U s )(I + Us Ds, γαλi 1 γαλi U s )

= (Pπ Us Ds,λi U s ) + Us(Rs Ds,λi)Ds, γαλi 1 γαλi U s .

(1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1

= (1 α)I + (1 α)Us Ds, γαλi 1 γαλi U s + αγURU αγUs Ds,λi U s + αγUs(Rs Ds,λi)Ds, γαλi 1 γαλi U s

= (1 α)I + αγURU + Us

Ds, (1 α)γαλi

1 γαλi Ds,αγλi + αγ(Rs Ds,λi)Ds, γαλi 1 γαλi

= (1 α)I + αγURU + Us

D (λiγ 1)γα2λi

1 γαλi + αγ(Rs Dλi)D γαλi 1 γαλi

= (1 α)I + αγURU + Us Rs, (λiγ 1)γα2λi

= U (1 α)I + αγR + e1:s Rs, (λiγ 1)γα2λi

1 γαλi e 1:s

where Rs, (λiγ 1)γα2λi

1 γαλi is s s upper triangular matrix with diagonal entries (λ1γ 1)γα2λ1

1 γαλ1 , . . . , (λsγ 1)γα2λs

satisfying Rs, (λiγ 1)γα2λi

1 γαλi = Ds, (λiγ 1)γα2λi

1 γαλi +αγ(Rs Ds,λi)Ds, γαλi 1 γαλi . By simple calculation, we know that (1

α)I+αγR+e1:s R (αλiγ α)γαλi

1 γαλi e 1:s is n n upper triangular matrix with diagonal entries (1 α) 1 αγλ1 , . . . , (1 α) 1 αγλs , (1

α) + αγλs+1, . . . , (1 α) + αγλn.

Therefore, by spectral analysis, we conclude that

V k V π = O(|λ|k V 0 V π )

as k , where

λ = max 1 i s,s+1 j n

n 1 α 1 αγλi

, |1 α + αγλj| o .

B.2 Proof of Collorary 3.2

Putting α = 1 in Theorem 3.1, we conclude Corollary 3.2.

Published in Transactions on Machine Learning Research (05/2025)

B.3 Proof of Theorem 4.1

By the definition of DDVI for Control (equation 6 and equation 7),

V k V = I + γ 1 γ E1

W k (I γE1) V ,

where E1 = 1v . Let W = (I γE1) V . Note that

max π {rπ + γ(Pπ E1)W } = max π {rπ + γ(Pπ E1)V }

= max π {rπ + γPπV } γE1V

W k W = γ (Pπg E1) W k 1 + rπg γ Pπ E1 W rπ ,

where πg = arg maxπ{rπ + γ(Pπ E1)W k 1} and π is an optimal policy. Then, by the definition of greedy policy, we have

γ (Pπg E1) W k 1 + rπg γ Pπ E1 W rπ γ (Pπg E1) (W k 1 W ),

γ (Pπg E1) W k 1 + rπg γ Pπ E1 W rπ γ Pπ E1 (W k 1 W ).

This implies that

W k W γ (Pπg E1) (W k 1 W ),

W k W γ Pπ E1 (W k 1 W ),

γPπ (W k 1 W ) W k W + γE1(W k 1 W ) γPπg(W k 1 W ).

Then, there exist 0 ti 1 such that

ti γPπ (W k 1 W )

i + (1 ti) γPπg(W k 1 W )

i = W k W + γE1(W k 1 W )

for 1 i n. Define πk(a | i) = tiπ (a | i) + (1 ti)πg(a | i) for all a A and 1 i n. Then πk satisfies

W k W = γ (Pπk E1) (W k 1 W ).

Thus, we get

W k W = γk k Y

i=1 (Pπi E1) (W 0 W )

i=1 (Pπi E1) (I γE1)(V 0 V )

i=1 (Pπi E1) (V 0 V )

= γk (Pπk E1)

i=1 Pπi(V 0 V ),

Published in Transactions on Machine Learning Research (05/2025)

where the third and last equality follows from (Pπi E1)E1 = 0. This implies that

V k V = γk I + γ 1 γ E1

i=1 Pπi(V 0 V )

= γk Pπk + γ 1 γ E1Pπk 1 1 γ 1v k 1 Y

i=1 Pπi(V 0 V ).

Then, we have

V k V 2 1 γ γk V 0 V ,

since E1Pπk is stochastic matrix with E1Pπk = 1.

Suppose there exists a unique optimal policy π . Then, if (non-deflated) VI generates V k for k = 0, 1, . . . , there exist K such that arg maxπ T πV k = π for k > K. Since DDVI for Control generates the same policy as VI for Control, DDVI for Control also generates greedy policy π for k > K iterations. Therefore, we get

γk I + γ 1 γ 1v Pπ E1 k K K Y

i=1 (Pπi E1) (V 0 V )

= Cγk Pπ E1 k K V 0 V

= O(|γλ2|k V 0 V )

for k > K and some constant C R.

B.4 Proof of Theorem 5.1

Consider the following stochastic approximation algorithm

Y k+1 = Y k + ηk(Xk)(Xk)f(Y k, ζk) (11)

for k = 0, 1, . . . , where Y k Rn, ηk(Xk) R+, f( , z) : Rn Rn is uniformly Lipschitz, and {ζk}k=0,1,... are i.i.d. random variables. Let F(y) = E[f(y, ζk)], F (y) = limr F(ry)/r, M k+1 = f(Y k, ζk) F(Y t), and Fk = σ(Y i, M i, 1 i t). The following result from Borkar & Meyn (2000) shows the convergence of the Y k sequence.

Proposition B.1. (Borkar & Meyn, 2000, Theorem 2.2 and 2.5) If (i) ηk = (k +1) 1, (ii) {M k, Fk}k=0,1,..., are martingale difference sequence , (iii) E[M k+1 | Fk] K(1 + Y k 2) for some constant K, (iv) y(t) = F (y(t)) has asymptotically stable equilibrium origin, and (v) y(t) = F(y(t)) has a unique globally asymptotically stable equilibrium y , then {Y t}k=0,1,... of equation 11 converges to y almost surely, and furthermore, {Y t}k=0,1,... of

Y k+1(ik) = Y k(ik) + ηνik,kf(Y k, ζk)(ik) (12)

for k = 0, 1, . . . , where ik are random variables taking values in {1, 2, , n} and νik,k = Pk t=0 1{i=it} when ik = i satisfying lim inf νik,k/k > c for some constant c > 0, also converge to y almost surely.

In equation 12, Y k(ik) and f(Y k, ζk)(ik) are the i-th coordinate of Y k and f(Y k, ζk), respectively. The equation 12 can be interpreted as asynchronous version of equation 11. We note that Proposition B.1 is a simplified version with stronger conditions of Theorem 2.2 and 2.5 of Borkar & Meyn (2000).

Recall that DDTD equation 8 with α = 1 is

W k+1(Xk) = W k(Xk) + ηk(Xk) rπ(Xk) + γV k(X k) γ(Es V k)(Xk) W k(Xk)

V k+1 = (I γEs) 1 W k+1. (13)

Published in Transactions on Machine Learning Research (05/2025)

We will first show that W k (I γEs)V π and this directly implies that V k V π. For applying Proposition B.1, consider

W k+1 = W k + ηkf(W k, Zt),

where ηk = (k + 1) 1, Zt Rn such that Zt(x) Pπ( | x) and {Zt}k=0,1,... are i.i.d. random variables, and f(W k, Zt)(x) = rπ(x) + γ((I γEs) 1 W k)(Zt(x)) γEs (I γEs) 1 W k(x) W k(x). Then,

F(W) = E[f(W, Zt)] = [γ(Pπ Es)(I γEs) 1 I]W + rπ,

M k+1 = f(W k, Zt) F(W k), and Fk = σ(W i, M i, 1 i t).

We now check the conditions. First, since f(W, z)(x) f(W , z)(x) = (I γEs) 1 (W W )(z(x)) (I + γEs (I γEs) 1)(W W )(x) implies f(W, z) f(W , z) ( (I γEs) 1 + (I + γEs (I γEs) 1) ) W W , f( , z) is uniformly Lipschitz. We have

E[M k+1(x) | Ft]

= E[γ((I γEs) 1 W k)(Zt(x)) γ(Es (I γEs) 1 W k)(x) γ((Pπ Es))(I γEs) 1W k)(x) | Fk]

= γPπ (I γEs) 1 W k)(x) γPπ(I γEs) 1W k(x)

for all x X. This implies that M k+1 is a martingale difference sequence. Also,

E[ M k+1 2 | Fk] ( γ (I γEs) 1 + γPπ(I γEs) 1 )2 W k 2 K(1 + W k 2)

for constant K = ( γ (I γEs) 1 + γPπ(I γEs) 1 )2.

By definition, F (y) = limr F(ry)/r = γ(Pπ Es))(I γEs) 1 I]y. In Section B.1, we showed that spec(F + I) = {γλs+1, . . . , γλn, 0}. Hence, spec(F ) = {γλs+1 1, . . . , γλ1 1, 1}. Since |λi| 1 for all s + 1 i n, real parts of all eigenvalues F are negative. This implies that y(t) = F (y(t)) has an asymptotically stable equilibrium origin and y(t) = F(y(t)) has a unique globally asymptotically stable equilibrium W π = [I (γ(Pπ Es))(I γEs) 1] 1rπ.

Lastly, since Xk Unif(X) and {Xk}k=0,1,... are i.i.d. random variables, νi,t/t converges to 1

n almost surely by the law of large numbers. Since ηνXk,k = (Pk i=0 1Xk=Xi) 1, Therefore, by Proposition B.1, {W k}k=0,1,... of equation 13 converge to W π almost surely, and this implies that V k (I γEs) 1W π = (I γPπ) 1rπ = V π

almost surely.

B.5 Proof of Theorem 5.2

Consider the following linear stochastic approximation algorithm

Y k+1 = Y k + ηk(Xk)(A(ζk)Y t + b(ζk)) (14)

for k = 0, 1, . . . , where Y k Rn, {ζk}k=0,1,..., are random variables, b(ζk) Rn, A(ζk) Rn Rn, and Ai,j , bk < , for 1 i, j n and 1 k n. The following result from Chen et al. (2020) provides a rate of convergence of Y k to the solution Y = A 1b, where A and b are the expectation of matrices A(ζk) and vectors b(ζk), respectively.

Proposition B.2. (Chen et al., 2020, Theorem 2.5 and Corollary 2.7) If (i) ηk = g(k + 1) 1 for g > 0, (ii) {ζk}k=0,1,... is ergodic (aperiodic and irreducible) Markov process with a unique invariant measure µ, (iii) Eµ[At] = A and Eµ[Bt] = b, (iv) A is Hurwitz, (v) Re(λ) < 1/2 for all λ spec(g A), then E[ Y t Y 2] = O(k 1) where Y = A 1b.

We note that Proposition B.2 is also a simplified version with stronger conditions of Theorem 2.5 and Corollary 2.7 of Chen et al. (2020).

Published in Transactions on Machine Learning Research (05/2025)

Define ψ(Xk) Rn as ψ(Xk)(x) = 1{Xk=x}. Then, DDTD equation 8 with α = 1 is equivalent to

W k+1 = W k + ηk(Xk)(A(Xk, X k)W k + b(Xk, X k)),

where A(X, X ) = ψ(X)(ψ(X ) γ (I γEs) 1 ψ(X) γEs (I γEs) 1 ψ(X) )

and b(X, X ) = ψ(X)rπ(X). Hence Ai,j , bk < , for 1 i, j n and 1 k n.

Since {Xk, X k}k=0,1,... are i.i.d. random variables, {Xk, X k}k=0,1,... are Markov process. Let µ {Xk, X k} and P(x1, x 1 | x2, x 2) be transition matrix of {Xk, X k}k=0,1,.... Then P(x1, x 1 | x2, x 2) = µ(x1, x 1) = 1 n Pπ(x 1 | x1). Since (µ ) P( , | , ) = µ for any distribution µ , µ is a unique invariant measure. Hence, {Xk, X k}k=0,1,... is irreducible and aperiodic. We have,

Eµ[A(Xk, X k)] = Eµ[ψ(Xk)(ψ(X k) γ (I γEs) 1 ψ(Xt) γEs (I γEs) 1 ψ(Xk) ) | Xk]

= Eµ[ψ(Xk)(ψ(Xk) Pπ (I γEs) 1 ψ(Xt) γEs (I γEs) 1 ψ(Xk) )]

= Eµ[ψ(Xk)ψ(Xk) (Pπ (I γEs) 1 γEs (I γEs) 1 I)]

= n 1((Pπ γEs) (I γEs) 1 I),

and Eµ[b(Xk, X k)] = n 1rπ. Then, Y = A 1b = [I (γ(Pπ Es))(I γEs) 1] 1rπ = W π.

As we discussed in Section B.4, spec(A) = {γλs+1 1, . . . , γλn 1, 1}. Let λDDTD = minλ {λs+1,...,λn} Re(1 γλ). Then, if g > n 2λDDTD , the real component of the eigenvalues Re(λ) < 1/2 for all λ spec(g A). Therefore, by Proposition B.2, E[ W k W π 2] = O(k 1) and this implies that E[ V k V π 2] E[ (I γEs) 1 2 W k W 2] = O(k 1).

B.6 Non-asymptotic convergence analysis of DDTD

Consider following the stochastic approximation algorithm

Y k+1 = Y k + ηk(Xk)(f(Y k, ζk) Y k + Zk) (15)

for k = 0, 1, . . . , where Y k Rn, ηk(Xk) R+, {ζk}k=0,1,... are Markov chain with unique distribution µ and transition matrix P. Let F(y) = E[f(y, ζk)] and Fk = σ(Y i, M i, Zi, ηi1 i t). Define mixing time as tδ = {min k 0, : maxy P k( , y) µ( ) TV < δ} where TV stands for the total variation distance. Let Y be fixed point of F(y). Then following proposition holds.

Proposition B.3. (Chen et al., 2023, Theorem 1) Let (i) {ζk}k=0,1,... is aperiodic and irreducible Markov chain, (ii) F is β-contraction respect to some c, (iii) f( , z) : Rn Rn is A1-uniformly Lipschitz with respect to c and f(0, y) c B1 for any y, (iv) {M k, Fk}k=0,1,..., are martingale difference sequence, and (v) E[Zk+1 | Fk] A2 + B2 Y k 2 for some A2, B2 > 0. Then, if ηk = η,

E[ Y k Y 2 c] c1d1(1 d2η)k tα + d3

for all k K, and if ηk = 1 d2(k+h), we have

E[ Y k Y 2 c] c1d1 K + h

k + h + 8η2d3c2tklog(k + h)

for all k K, where {ηk(Xk)}k=0,1,... is nonincreasing sequence satisfying Pk 1 i=k tk αi min n d2 d3A2 , 1 4A o for

all k tk, K = min{k 0 : k tk}, A = A1+A2+1, B = B1+B2, c1 = ( Y 0 Y c+ Y 0 c+A/B)2, c2 = (A Y c + B)2, lcs s c ucs s for chosen norm s such that 1

2 2 s is L-smooth function

with respect to s, d1 = 1+θu2 cs 1+θl2cs , d2 = 1 βd1/2 1 , and d3 = 114L(1+θu2 cs) θl2cs such that θ is chosen satisfying β2 (d1) 1.

Published in Transactions on Machine Learning Research (05/2025)

DDTD in equation 8 with α = 1 is equivalent to

W k+1 = W k + ηk(Xk)(f(W k, Xk, X k) W k + Zk),

where Zk = 0, and f(W k, X, X )(x) = 1{X=x}(rπ(x) + γ((I γEs) 1 W k)(X ) γEs (I γEs) 1 W k(x) W k(x)) + W k(x). As we showed in Section B.5, {Xk, X k}k=0,1,... is irreducible and aperiodic Markov chain with an invariant measure µ. Furthermore, mixing time of {Xk, X k}k=0,1,... is 1.

We also have

F(W) = Eµ {X,X }[1{X=x}(rπ(x) + γ((I γEs) 1 W)(X ) γEs (I γEs) 1 W(x) W(x)) + W(x)]

= Eµ {X,X }[1{X=x}(rπ(x) + γ((I γEs) 1 W)(X ) γEs (I γEs) 1 W(x) W(x)) + W(x) | X]

= Eµ {X,X }[1{X=x}(rπ(x) + Pπγ((I γEs) 1 W)(X) γEs (I γEs) 1 W(x) W(x)) + W(x)]

n(rπ(x) + γ(Pπ Es) (I γEs) 1 W(x)) + (1 1

n)W(x). (16)

Let λ1, . . . , λn be the eigenvalues of Pπ sorted in the decreasing order of magnitude with ties broken arbitrarily. Let ADDTD = γ(Pπ Es) (I γEs) 1. Since spectral radius is the infimum of matrix norm, for any ϵ > 0, there exists ϵ such that ρ( 1

n ADDTD + (1 1

n ADDTD + (1 1

n ADDTD + (1 1

n)I) + ϵ where ρ( 1

n ADDTD + (1 1

n)I) = maxλ {λs+1,...,λn}{|1 1

The upper bound f(0, X, X ) ϵ = 1x=Xrπ(x) ϵ rπ ϵ = Bϵ and f(W, X, X )(x) f(W , X, X )(x) = 1{X=x}(γ (I γEs) 1 (W W )(X ) (γEs (I γEs) 1 (W W ))(x)) implies that

f(W, X, X ) f(W , X, X ) ϵ (γ (I γEs) 1 ϵ + (γEs (I γEs) 1 ϵ W W ϵ

As shown in equation 16, we have F(W) = Eµ {X,X }[f(W, X, X )] = 1

n(rπ + γ(Pπ Es) (I γEs) 1 W) + (1 1

n)W. This shows that

F(W1) F(W2) ϵ =

n ADDTD + (1 1

n)I (W1 W2) ϵ 1

nρ + ϵ + (1 1

and F(W) has fixed point W π since 1

n ADDTD + (1 1

n)I and ADDTD share same fixed point.

Thus, by Theorem B.3 and E[ V k V π 2 ϵ] (I γEs) 1 2 ϵE[ W k W 2 ϵ], we have the following two non-asymptotic convergence results for DDTD one for constant stepsize and the other for a diminishing stepsize. Theorem B.4 (DDTD with constant stepsize). Let λ1, . . . , λn be the eigenvalues of Pπ sorted in decreasing order of magnitude with ties broken arbitrarily. Let ηk = η. For any ϵ > 0, there exist aϵ, bϵ, cϵ, dϵ > 0 and ϵ such that for α = 1 and 0 < η < dϵ, DDTD exhibits the rate

E[ V k V π 2 ϵ] aϵ(1 η + ρbϵη)k 1 + cϵη

for all k 1 where ρ = 1 1

n max{|λs+1|, . . . , |λn|} + ϵ.

We note that aϵ, bϵ, cϵ, dϵ of Theorem B.4 can be obtained by plugging A1 = Aϵ, A2 = 0, B1 = Bϵ, B2 = 0 in Proposition B.3. Theorem B.5 (DDTD with diminishing stepsize). Let λ1, . . . , λn be the eigenvalues of Pπ sorted in decreasing order of magnitude with ties broken arbitrarily. For any ϵ > 0, there exist aϵ, bϵ, cϵ, dϵ > 0 and ϵ such that for α = 1 and ηk(Xk) = 1 (k+bϵ)(1 cϵρ), DDTD exhibits the rate

E[ V k V π 2 ϵ] aϵ k + bϵ + 1 1 cϵρ

2 dϵ k + bϵ

for all k 1 where ρ = 1 1

n max{|λs+1|, . . . , |λn|} + ϵ.

Again, we note that a, b, c, of Theorem B.5 can be obtained by plugging A1 = Aϵ, A2 = 0, B1 = Bϵ, B2 = 0 in Proposition B.3.

Published in Transactions on Machine Learning Research (05/2025)

Algorithm 4 DDVI with Auto PI (Detailed)

1: Initialize C, ϵ 2: function DDVI((s, V, K, {λi}s i=1, {ui}s i=1, αs )) 3: V 0 = V , c = 0 4: Es = Ps i=1 λiuiv i as in Fact 2 5: for k = 0, . . . , K 1 do

6: if c C and wk+1

7: λ s+1 = (wk) wk+1/ wk 2 2 8: λs+1 = (λ s+1 1 + αs)/(αsγ)

9: us+1 = wk+1 Ps i=1 αλi(1 γλi)

1 αsγλi v i wk+1

λi λs+1 ui 10: Return DDVI(s+1, V k, K c, {λi}s+1 i=1, {ui}s+1 i=1, αs+1) 11: else 12: W k+1 = (1 αs)V k + αsγ(Pπ αs Es)V k + αsrπ

13: V k+1 = (I αsγEs) 1W k+1

14: c = c+1 15: wk+1 = W k+1 W k

16: wk = W k W k 1

17: Return V K

18: Initialize V 0 and u1 = 1, λ1 = 1 19: DDVI(1, V 0, K, λ1, u1, 1)

C DDVI with QR Iteration, Auto PI and Auto QR

C.1 DDVI with Auto PI and Auto QR

We first introduce DDVI with Auto PI for 0 < α 1. If W 0 = 0, we have

i=0 ((1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1)iαrπ

by equation 4, and

(W k+1 W k) = ((1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1)kαrπ

are the iterates of a power iteration with respect to (1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1. In Section B.1, we show that

(1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1 = (1 α)I + αγPπ + Us Ds, (λiγ 1)γα2λi

1 γαλi V s .

For α 1, we expect that spectral radius of matrix be 1 α+γαλs+1 and (1 α)I+αγPπ+Us Ds, (λiγ 1)γα2λi

and Pπ + Us Ds, (λiγ 1)αλi

1 γαλi V s have the same top eigenvector. With the same argument as in Section 4.2, we

can recover the s + 1-th eigenvector of Pπ by Bru et al. (2012, Proposition 5). Leveraging this observation, we formalize this approach in Algorithm 4.

Now, we introduce DDVI with Auto QR for 0 < α 1.

Assume the top s + 1 eigenvalues of Pπ are distinct, i.e., 1 = λ1 > |λ2| > > |λs+1|. Let Es = Ps i=1 λiuiu i be a rank-s deflation matrix of Pπ as in Fact 3. If W 0 = 0, we again have

i=0 ((1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1)iαrπ,

W k+1 W k = ((1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1)kαrπ,

Published in Transactions on Machine Learning Research (05/2025)

Algorithm 5 DDVI with Auto QR

1: Initialize C, ϵ

2: function DDVI((s, V, K, {λi}s i=1, {ui}s i=1, αs)) 3: V 0 = V , c = 0, 4: Es = Ps i=1 λiuiu i as in Fact 3 5: for k = 0, . . . , K 1 do

6: if c C and wk+1

7: λ s+1 = (wk) wk+1/ wk 2 2 8: λs+1 = (λ s+1 1 + αs)/(αsγ) 9: u s+1 = wk+1 Ps i=1 u i wk+1ui 10: us+1 = 1

(u s+1) u s+1 u s+1

11: Return DDVI(s+1, V k, K c, {λi}s+1 i=1, {ui}s+1 i=1, αs) 12: else 13: W k+1 = (1 αs)V k + αsγ(Pπ αs Es)V k + αsrπ

14: V k+1 = (I αsγEs) 1W k+1

15: c = c+1 16: wk+1 = W k+1 W k

17: wk = W k W k 1

18: Return V K

19: Initialize V 0 and u1 = 1, λ1 = 1 20: DDVI(1, V 0, K, λ1, u1, 1)

and (W k+1 W k) = ((1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1)kαrπ

is the iterates of a power iteration with respect to (1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1. In Section B.1, we show that

(1 α)(I αγEs) 1 + αγ(Pπ Es)(I αγEs) 1 = (1 α)I + αγPπ + Us Rs, (λiγ 1)γα2λi

1 γαλi U s .

For α 1, we expect that spectral radius of matrix is 1 α+γαλs+1 and (1 α)I+αγPπ+Us Rs, (λiγ 1)γα2λi

and Pπ + Us Rs, (λiγ 1)αλi

1 γαλi U s have the same top eigenvector. With the same argument as in Section 4.2,

for large k, we expect W k+1 W k

W k+1 W k w, where w is the top eigenvector of Pπ Us Rs, (λiγ 1)αλi

1 γαλi U s . Thus,

by orthornormalizing w against u1, . . . , us, we can obtain top s + 1-th Schur vector of Pπ Saad (2011, Section 4.2.4). Leveraging this observation, we propose DDVI with Automatic QR Iteration (Auto QR) and formalize this in Algorithm 5.

D Environments

We introduce four environments and polices used in experiments.

Cliffwalk The Cliffwalk is a 3 7 grid world. The top-left corner is initial state, top-right corner is the goal terminal state with reward of 10, and the other states in the first row are terminal states with a reward of 10. There is a penalty of 1 in all other states. The agent has four actions: UP(0), RIGHT(1), DOWN(2), and LEFT(3). Each action has 90% chance to successfully move in the selected direction, and with probability of 10% one of the other three directions is randomly selected and the agent moves in that direction. If the agent attempts to get out of the boundary, it will stay in place. We use the optimal policy for the policy evaluation task in our experiments. The environment is shown in Figure 4.

Published in Transactions on Machine Learning Research (05/2025)

Figure 4: Cliffwalk environment. Dark shaded region is goal state with a positive reward. Gray shaded regions are terminal states with negative reward. Arrows show the optimal policy.

Figure 5: Maze environment. The dark shaded region is the goal state with positive reward. Arrows show optimal policy.

Figure 6: Chain walk environment. Dark shaded region is the goal state with a positive reward and gray shaded region is state with a negative reward. Arrows show optimal policy.

Published in Transactions on Machine Learning Research (05/2025)

Maze The Maze is a 5 5 grid world with 16 walls. The top-left corner is the initial state, and the bottom-left corner is the goal state with reward of 10. Similar to the Clifffwalk, there is a penalty of 1 in all other states, and the agent has four actions: UP(0), RIGHT(1), DOWN(2), and LEFT(3). Each action has 90% chance to successfully move in the selected direction, and with probability of 10% one of the other three directions is randomly selected and the agent moves in that direction. If the agent attempts to get out of the boundary or hits a wall, it will stay in place. We use the following policies for the policy evaluation task in our DDVI and DDTD experiments, respectively:

(2, 2, 3, 0, 3, 0, 2, 1, 3, 2, 2, 2, 3, 3, 1, 0, 3, 0, 3, 3, 2, 2, 1, 1, 0),

(2, 2, 3, 0, 3, 0, 2, 1, 3, 2, 2, 2, 3, 3, 1, 0, 3, 0, 3, 3, 3, 3, 1, 1, 0).

Here, the action for each state is listed row-wise. The environment is shown in Figure 5.

Chain Walk We use the Chain Walk environment, as described by Farahmand & Ghavamzadeh (2021), which is similar to the formulation by Lagoudakis & Parr (2003). Chain Walk is parametrized by the tuple (|X|, |A|, bp, br). It is a circular chain, where the state 1 and 50 are connected. The reward is state-dependent. State 40 is goal state with reward 1, state 11 gives reward 1, and all other states give reward 0. Agent has two actions: RIGHT(0) and LEFT(1). With each action, the agent has 70% chance to successfully move in the selected direction, 10% chance to stay still, and with probability of 20% the agent moves in the opposite direction. We use the following policy in our experiments: (0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1). The environment is shown in Figure 6.

0 10 20 30 40 50 60 70 80 Iterations (k)

Normalized Vk V 1

VI rank-1 DDVI for control

(a) Rank-1 DDVI for control in Chain Walk

0 10 20 30 40 50 60 70 80 Iteration (k)

Normalized Vk V 1

VI rank-1 DDVI for control

(b) Rank-1 DDVI for control in Garnet

Figure 7: Comparison of rank-1 DDVI for control and VI in (left) Chain Walk and (right) Garnet.

Garnet We use the Garnet environment as described by Farahmand & Ghavamzadeh (2021); Rakhsha et al. (2022), which is based on Bhatnagar et al. (2009). Garnet is parameterized by the tuple (|X|, |A|, bp, br). |X| and |A| are the number of states and actions, and bp is the branching factor of the environment, which is the number of possible next-states for each state-action pair. We randomly select bp states without replacement and then, transition distribution is generated randomly. We select br states without replacement, and for each selected x, we assign state-dependent reward to be a uniformly sampled value in (0, 1).

E Rank-1 DDVI for Control Experiments

In this experiment, we consider Chain Walk and Garnet. We use Garnet with 100 states, 8 actions, a branching factor of 6, and 10 non-zero rewards throughout the state space. We use normalized errors V k V π 1/ V 1, γ = 0.995, and E1 = 1

n11 , where n is the number of states. For Garnet, we plot average values on the 100 Garnet instances and denote one standard error with the shaded area. Figure 7(a) and 7(b) show that rank-1 DDVI for Control does indeed provide an acceleration.

Published in Transactions on Machine Learning Research (05/2025)

F Additional DDVI Experiments and Details

In all experiments, we set DDVI s α = 0.99. We perform further experiments of DDVI with different ranks for discount factor γ = 0.95, 0.995 (Figure 8 and 9), and the QR iteration is run for 100 iterations in experiments.

0 200 400 600 800 1000 Iterations (k)

Normalized ||Vk V ||1

0 200 400 600 800 1000 Iterations (k)

Normalized ||Vk V ||1

DDVI (Auto QR) DDVI (Auto PI) DDVI (rank-1)

DDVI (rank-2)

DDVI (rank-3)

DDVI (rank-4)

Figure 8: Comparison of DDVI with different ranks, Auto PI, and Auto QR against VI with discount factor γ = 0.95 in (left) Maze and (right) Chain Walk. The plots for DDVI do not start at iteration 0 because the costs of computing Es through QR iterations are incorporated as rightward shifts. Rate of DDVI with Auto PI and Auto QR changes when Es is updated.

0 200 400 600 800 1000 Iterations (k)

Normalized ||Vk V ||1

0 200 400 600 800 1000 Iterations (k)

Normalized ||Vk V ||1

DDVI (Auto QR) DDVI (Auto PI) DDVI (rank-1)

DDVI (rank-2)

DDVI (rank-3)

DDVI (rank-4)

Figure 9: Comparison of DDVI with different ranks, Auto PI, and Auto QR against VI with discount factor γ = 0.995 in (left) Maze and (right) Chain Walk. The plots for DDVI do not start at iteration 0 because the costs of computing Es through QR iterations are incorporated as rightward shifts. Rate of DDVI with Auto PI and Auto QR changes when Es is updated.

We perform further comparison of convergence in Figures 10, 11, 12, 13. Figure 10 is run with Garnet environment with 50 states and 40 branching factor that matches the setting in Goyal & Grand-Clément (2022). In experiments, the QR iteration is run for 600 iterations. For PID VI, we set η = 0.05 and ϵ = 10 10. In Anderson VI, we have m = 5. In Figure 2, we use 20 Garnet MDPs with branching factor bp = 2, and br = 0.1|X|.

G DDTD Experiments

A key part of our experimental setup is the model ˆP. To show the robustness of DDTD to model error and also its advantage over Dyna, we consider the scenario that ˆP has some non-diminishing error. We achieve this with the same technique as Rakhsha et al. (2022). Assume that each iteration t, the empirical distribution of the nextstate from x, a is PMLE( |x, a), which is going to be updated with every sample. For

Published in Transactions on Machine Learning Research (05/2025)

0 200 400 600 800 1000 Iterations (k)

Normalized ||Vk V ||1

0 50 100 150 200 250 Wall Clock (ms)

Normalized ||Vk V ||1

VI Anderson VI S-AVI PID VI Anc VI DDVI (Auto QR) DDVI (rank-1)

DDVI (rank-2)

Figure 10: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (left) and wall clock time (right). Plots are average of 20 randomly generated Garnet MDPs with 50 states and branching factor of 40 with shaded areas showing the standard error.

0 500 1000 1500 2000 Iterations (k)

Normalized ||Vk V ||1

0 20 40 60 80 100 Wall Clock (ms)

Normalized ||Vk V ||1

VI Anderson VI S-AVI PID VI Anc VI DDVI (Auto QR) DDVI (rank-1)

DDVI (rank-2)

Figure 11: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (left) and wall clock time (right) in Maze.

some hyperparameter θ [0, 1], we set the approximate dynamics ˆP( |x, a) as

ˆP( |x, a) = (1 θ) PMLE( |x, a) + θ U({x |PMLE(x |x, a) > 0}) (17)

where U(S) for S X is the uniform distribution over S. The hyperparameter θ controls the amount of error introduced in ˆP. If θ = 0, we have ˆP = PMLE which becomes arbitrarily accurate with more samples. With larger θ, ˆP( |x, a) will be smoothed towards the uniform distribution more, which leads to a larger model error. In Dyna, we keep the empirical average of past rewards for each x, a in ˆr: X A R and perform planning with ˆP, ˆr at each iteration to calculate the value function, that is V = (I γ ˆPπ) 1ˆrπ. In Figure 3, we have shown the result for θ = 0.3. In Figure 14 and Figure 15 we show the results for θ = 0.1, 0.3, 0.5 in both Maze and Chain Walk environments.

As we see in Figure 14 and Figure 15, DDTD shows a faster convergence than the conventional TD. In Maze, this is only achieved with higher rank versions of DDTD but in Chain Walk, even rank-1 DDTD is able to significantly accelerate the learning. Also note that unlike Dyna, DDTD is converging to the true value function despite the model error. We observe that the impact of model error on DDTD is very mild. In some cases such as rank-3 DDTD in Maze, between θ = 0.3 and θ = 0.5, we observe slightly slower convergence with higher model error. The hyperparamaters of TD Learning and DDTD are given in Tables 1,2, and 3.

Published in Transactions on Machine Learning Research (05/2025)

0 500 1000 1500 2000 Iterations (k)

Normalized ||Vk V ||1

0 20 40 60 80 100 Wall Clock (ms)

Normalized ||Vk V ||1

VI Anderson VI S-AVI PID VI Anc VI DDVI (Auto QR) DDVI (rank-1)

DDVI (rank-2)

Figure 12: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (left) and wall clock time (right) in Chain Walk.

0 500 1000 1500 2000 Iterations (k)

Normalized ||Vk V ||1

0 20 40 60 80 100 Wall Clock (ms)

Normalized ||Vk V ||1

VI Anderson VI S-AVI PID VI Anc VI DDVI (Auto QR) DDVI (rank-1)

DDVI (rank-2)

Figure 13: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (left) and wall clock time (right) in Cliffwalk.

Table 1: Hyperparamters for the Maze environment for γ = 0.99. Cells with multiple values provide the value of the hyperparameter for θ = 0.1, θ = 0.3, and θ = 0.5, respectively.

rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD TD η (learning rate) 0.3, 0.3, 0.3 0.3, 0.3, 0.3 0.07, 0.07, 0.07 0.07, 0.07, 0.07 0.3 α 0.8, 0.8, 0.8 0.7, 0.8, 0.8 0.9, 0.9, 0.9 0.9, 0.9, 0.9 - K 10 10 10 10 -

Table 2: Hyperparamters for the Maze environment for γ = 0.95. Cells with multiple values provide the value of the hyperparameter for θ = 0.1, θ = 0.3, and θ = 0.5, respectively.

rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD TD η (learning rate) 0.5, 0.5, 0.5 0.5, 0.5, 0.5 0.4, 0.4, 0.4 0.4, 0.4, 0.4 0.5 α 0.8, 0.8, 0.8 0.8, 0.8, 0.8 0.8, 0.8, 0.8 0.8, 0.8, 0.8 - K 10 10 10 10 -

Table 3: Hyperparamters for the Chainwalk environment for both γ = 0.99 and γ = 0.95. Cells with multiple values provide the value of the hyperparameter for θ = 0.1, θ = 0.3, and θ = 0.5, respectively.

rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD TD learning rate (η) 1 1 1 1 1 α 0.9, 0.9, 0.9 0.9, 0.8, 0.8 0.9, 0.8, 0.8 0.9, 0.8, 0.8 - K 10 10 10 10 -

Published in Transactions on Machine Learning Research (05/2025)

0k 5k 10k 15k 20k

Normalized ||Vt V ||1

Low Model Error

0k 5k 10k 15k 20k Environment Samples (t)

Medium Model Error

0k 5k 10k 15k 20k

High Model Error

Dyna TD Learning rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD

0k 1k 2k 3k 4k 5k 0.0

Normalized ||Vt V ||1

Low Model Error

0k 1k 2k 3k 4k 5k Environment Samples (t)

Medium Model Error

0k 1k 2k 3k 4k 5k

High Model Error

Dyna TD Learning rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD

Figure 14: Comparison of DDTD with TD Learning and Dyna in Maze with γ = 0.99 (Top) and γ = 0.95 (Bottom). Left: low model error with θ = 0.1. Middle: medium model error with θ = 0.3. Right: high model error with θ = 0.5. Each curve is average of 20 runs. Shaded area shows one standard error.

Published in Transactions on Machine Learning Research (05/2025)

0k 5k 10k 15k 20k 0.0

Normalized ||Vt V ||1

Low Model Error

0k 5k 10k 15k 20k Environment Samples (t)

Medium Model Error

0k 5k 10k 15k 20k

High Model Error

Dyna TD Learning rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD

0k 1k 2k 3k 4k 5k

Normalized ||Vt V ||1

Low Model Error

0k 1k 2k 3k 4k 5k Environment Samples (t)

Medium Model Error

0k 1k 2k 3k 4k 5k

High Model Error

Dyna TD Learning rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD

Figure 15: Comparison of DDTD with TD Learning and Dyna in Chainwalk with γ = 0.99 (Top) and γ = 0.95 (Bottom). Left: low model error with θ = 0.1. Middle: medium model error with θ = 0.3. Right: high model error with θ = 0.5. Each curve is average of 20 runs. Shaded area shows one standard error.