# rankone_modified_value_iteration__1a7be2cb.pdf

Rank-One Modified Value Iteration

Arman S. Kolarijani 1 Tolga Ok 1 Peyman Mohajerin Esfahani 1 2 Mohamad Amin Sharifi Kolarijani 1

In this paper, we provide a novel algorithm for solving planning and learning problems of Markov decision processes. The proposed algorithm follows a policy iteration-type update by using a rank-one approximation of the transition probability matrix in the policy evaluation step. This rank-one approximation is closely related to the stationary distribution of the corresponding transition probability matrix, which is approximated using the power method. We provide theoretical guarantees for the convergence of the proposed algorithm to optimal (action-)value function with the same rate and computational complexity as the value iteration algorithm in the planning problem and as the Q-learning algorithm in the learning problem. Through our extensive numerical simulations, however, we show that the proposed algorithm consistently outperforms firstorder algorithms and their accelerated versions for both planning and learning problems.

1. Introduction

Value iteration (VI) and policy iteration (PI) algorithms lie at the heart of most if not all algorithms for optimal control of Markov decision processes (MDPs) in both cases of the planning problem (i.e., with access to the true model of the MDP) and the reinforcement learning problem (i.e., with access to samples of the MDP) (Sutton & Barto, 2018; Bertsekas, 2023). Their widespread application stems from their simple implementation and straightforward combination with function approximation schemes such as neural networks. Both VI and PI are iterative algorithms that ultimately find the fixed-point of the Bellman (optimality) operator T: For γ-discounted, finite state-action MDPs, the

1Delft University of Technology, The Netherlands 2University of Toronto, Canada. Correspondence to: Arman S. Kolarijani <a.sharifikolarijani@tudelft.nl>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

value function vk at iteration k is given by

vk+1 = vk + Gk T(vk) vk , k = 0, 1, . . . , (1)

where Gk = I in the VI algorithm and Gk = (I γPk) 1

in the PI algorithm, I is the identity matrix and Pk is the state transition probability matrix of the MDP under the greedy policy with respect to vk. Both algorithms are guaranteed to converge to the optimal value function and control policy; see, e.g., (Puterman, 2014, Thms. 6.3.3 and 6.4.2). When it comes to computational complexity, one observes a trade-off between the two algorithms: VI has a lower per-iteration complexity compared to PI, while PI converges in a smaller number of iterations compared to VI. The faster convergence of PI is partially explained by its second-order nature which leads to a local quadratic convergence rate (Puterman & Brumelle, 1979; Bertsekas, 2022; Gargiani et al., 2022), compared to the linear convergence rate of VI (Puterman, 2014, Thms. 6.3.3).

This trade-off between the two algorithms has motivated much research to improve the convergence rate of VI and/or the per-iteration complexity of PI. One of the first improvements is the Relaxed VI algorithm (Kushner & Kleinman, 1971; Porteus & Totten, 1978) which allows for a greater step size compared to the standard VI algorithm. A large body of research in this area originates from the correspondence between VI and gradient descent method and between PI and Newton method (Grand-Cl ement, 2021; Kolarijani & Mohajerin Esfahani, 2023). Here, approaches such as accelerated gradient descent and quasi-Newton methods from the optimization literature are adapted to develop modified versions of VI and PI. For instance, the combination of the VI algorithm with Nesterov acceleration (Nesterov, 1983) and Anderson acceleration (Anderson, 1965) have been explored in (Goyal & Grand-Cl ement, 2022) and (Zhang et al., 2020), respectively, for solving the planning problem. More recently, Halpern s anchoring acceleration scheme (Halpern, 1967) has been used to introduce the Anchored VI algorithm (Lee & Ryu, 2024) which in particular exhibits a O(1/k)-rate for large values of discount factor and even for γ = 1. In the case of the learning problem, Speedy Q-Learning (Ghavamzadeh et al., 2011), Momentum Q-Learning (Weng et al., 2021), and Nesterov Stochastic Approximation (Devraj et al., 2019) are among the algorithms that use the idea of momentum to achieve a better rate of convergence compared to standard Q-learning (QL).

Rank-One Modified Value Iteration

The Zap Q-Learning algorithm (Devraj & Meyn, 2017) can be thought of as a second-order learning algorithm which was inspired by the stochastic Newton-Raphson (SNR) algorithm (Ruppert, 1985). The Quasi-Policy Iteration/Learning algorithms (Kolarijani & Mohajerin Esfahani, 2023) are, on the other hand, an example of using the idea of quasi Newton methods for developing algorithms for optimal control of MDPs. Another class of modified VI algorithms is the Generalized Second-Order VI algorithm (Kamanchi et al., 2022) which applies the Newton method on a smoothed version of the Bellman operator. Tools and techniques from linear algebra have also been exploited to modify the VI algorithm, particularly for policy evaluation. The Operator Splitting VI algorithm (Rakhsha et al., 2022) is an example that exploits the matrix splitting method for solving the linear equation corresponding to policy evaluation for a given cheap-to-access model of the underlying MDP. Recently, in (Lee et al., 2024), the authors combined the matrix splitting method with the matrix deflation techniques for removing the dominant eigenstructure of the transition probability matrix to speed up the policy evaluation.

Contribution. In this paper, we propose a novel algorithm that modifies the VI algorithm by incorporating a computationally efficient PI-type update rule. To this end, we consider the update rule (1) with a matrix gain of the form Gk = (I e Pk) 1, where e Pk is a rank-one approximation of Pk. To be precise, we consider the approximation e Pk = 1d k , where dk = P k dk is a stationary distribution of Pk and 1 is the all-one vector. The proposed algorithm then uses the power method for approximating dk iteratively using the true matrix Pk in the planning problem and its sampled version in the learning problem. In particular,

(1) we propose the Rank-one VI (R1-VI) Algorithm 1 as the modified VI algorithm for solving the planning problem and prove its convergence to the optimal value function (Theorem 3.3);

(2) we propose the Rank-one QL (R1-QL) Algorithm 2

as the modified QL algorithm for solving the learning problem and prove its convergence to the optimal Qfunction (Theorem 4.2);

(3) we compare the proposed R1-VI and R1-QL algorithms with the state-of-the-art algorithms for solving planning and learning problems of MDPs and show the empirically faster convergence of the proposed algorithms compared to the ones with the same periteration computational complexity (i.e., the first-order algorithms and their accelerated versions).

Paper organization. In Section 2, we provide the necessary background and the problem definition along with the

standard VI and QL algorithms for solving the planning and learning problems of MDPs. The proposed R1-VI algorithm for solving the planning problem and its analysis are discussed in Section 3. Section 4 presents the R1-QL algorithm for solving the learning problem and its analysis. In Section 5, we provide the results of our extensive numerical simulations and compare the proposed algorithms with a range of existing algorithms for solving the optimal control problem of MDPs. Finally, some limitations of the proposed algorithms and future research directions are discussed in Section 6. All the technical proofs are provided in Appendix A.

Notations. The set of real numbers is denoted by R. For a vector v Rn, we use v(i) and [v](i) to denote its i-th element. Similarly, M(i, j) and [M](i, j) denote the element in i-th row and j-th column of the matrix M Rm n. We use v, u = Pn i=1 v(i) u(i) to denote the the inner product of the two vectors v, u Rn. v 1 = Pn i=1 |v(i)|, v 2 = p

v, v , and v = maxn i=1 |v(i)| denote the 1-norm, 2-norm, and -norm of the vector v Rn, respectively. We use ρ(M) to denote the spectral radius (i.e., the largest eigenvalue in absolute value) of a square matrix M Rn n. Given a set X, (X) denotes the set of probability distributions on X. Let x P be a random variable with distribution P (X). We use bx P to denote a sample of the random variable x drawn from the sample space X of x according to the distribution P. We use 1, 0, and I to denote the all-one vector, the all-zero vector, and the identity matrix, respectively, with their dimension being clear from the context.

2. Optimal control of MDPs

Consider a finite MDP (S, A, P, c, γ). Here, S := {1, 2, . . . , n} and A := {1, 2, . . . , m} are the state and action spaces, respectively. The transition kernel P : S A (S) is the conditional probability P(s+|s, a) of the transition to state s+ given the current state-action pair (s, a). The function c R|S A| = Rnm, bounded from below, represents the stage cost c(s, a) of taking the control action a while the system is in state s. And, γ (0, 1) is the discount factor which can be seen as a trade-off parameter between shortand long-term costs. A control policy π : S A is a mapping from states to actions. Fix policy π. For the corresponding Markov chain under the policy π we define: (i) the state transition probability matrix P π R|S| |S| = Rn n where P π(s, s+) = P s+|s, π(s) for s, s+ S, (ii) the state-action transition probability matrix P π R|S A| |S A| = R(nm) (nm)

where P π (s, a), (s+, a+) = P(s+|s, a) if a+ = π(s+) and = 0 otherwise for (s, a), (s+, a+) S A, and (iii) the stage cost cπ R|S| = Rn where cπ(s) = c s, π(s) for s S.

Rank-One Modified Value Iteration

Under policy π, the value function vπ R|S| = Rn is the expected discounted cost endured by following policy π over an infinite-horizon trajectory, that is, for each s S,

vπ(s) := Est+1 P π(st, )

t=0 γtcπ(st)

The action-value function qπ R|S A| = Rnm (or the so-called Q-function) under policy π is defined as follows: for each (s, a) S A

qπ(s, a) := c(s, a) + γ Es+ P ( |s,a) vπ(s+) .

These functions can be shown to satisfy the fixed-point equations (Puterman, 2014, Thm. 6.1.1)

vπ = cπ + γP πvπ, qπ = c + γP πqπ. (2)

The problem of interest is to control the MDP in a manner that the expected, discounted, infinite-horizon cost is minimized. To do so, one aims to find the optimal policy π

such that for any policy π,

v (s) := vπ (s) vπ(s), s S,

or, equivalently,

q (s, a) := qπ (s, a) qπ(s, a), (s, a) S A.

The optimal value function can be characterized as the solution to the fixed-point equation v = T(v ), where T : R|S| R|S| is the so-called Bellman (optimality) operator defined as follows: for each s S

[T(v )](s) := min a A c(s, a) + γ Es+ P ( |s,a) v (s+) .

(3) The operator T is a γ-contraction in the -norm (i.e., T(v) T(w) γ v w for all v, w Rn) (Puterman, 2014, Prop. 6.2.4). This contraction property is essentially the basis for the VI algorithm, introduced in (4).

intialize v0 R|S| = Rn

for k = 0, 1, . . . vk+1(s) = [T(vk)](s), s S endfor

From the Banach fixed-point theorem (see, e.g., (Puterman, 2014, Thm. 6.2.3)), the VI algorithm converges to v

with a linear rate γ. Correspondingly, one can derive the fixed-point characterization q = T(q ) of the optimal Q-function, where

[T(q )](s, a) :=c(s, a)

+ γ Es+ P ( |s,a)

min a+ A q (s+, a+) ,

for each (s, a) S A (Szepesv ari, 2009, Fact 3). This characterization is particularly useful when one only has access to samples bs + P( |s, a) of the next state s+ for each state-action pair (s, a) S A (and not to the true transition probability kernel of the MDP). In particular, let us define the empirical Bellman operator b T : R|S A| S R|S A| as follows: for each (s, a, s+) S A S

[b T(q, s+)](s, a) := c(s, a) + γ min a+ A q(s+, a+).

A classic algorithm for the learning problem is then the (synchronous) QL algorithm (Watkins & Dayan, 1992; Kearns & Singh, 1998), given in (5).

intialize q0 R|S A| = Rnm

for k = 0, 1, . . . for (s, a) S A bs + k P( |s, a) δk = qk(s, a) [b T(qk, bs + k )](s, a) qk+1(s, a) = qk(s, a) λkδk endfor endfor

Above, λk 0 are the step-sizes. The QL algorithm is also guaranteed to converge to q almost surely given that the step-sizes satisfy the Robbins Monro conditions (i.e., P k=0 λk = and P k=0 λ2 k < ) (Tsitsiklis, 1994; Jaakkola et al., 1993). In particular, with a polynomial step-size λk = 1/(1 + k)ω with ω (1/2, 1), QL outputs an ϵ-accurate Q-function with high probability after O(τ 4/ω ϵ 2/ω + τ 1/(1 ω)) iterations of synchronous sampling with τ = 1 γ (Even-Dar & Mansour, 2003), while with a re-scaled linear step-size λk = 1/(1+τk), QL has been shown to require O(τ 5 ϵ 2) iterations of synchronous sampling for the same performance (Wainwright, 2019).

3. Rank-one value iteration (R1-VI)

Another well-known algorithm for solving the planning problem in MDPs is the PI algorithm. In order to provide this algorithm in a compact form, let us first introduce the notion of greedy policy. Given a value function v Rn, the greedy policy with respect to v, denoted by πv : S A, is

πv(s) argmin a A Es+ P ( |s,a) c(s, a) + γv(s+) , s S.

(6) The PI algorithm is then summarized in (7).

intialize π0 : S A for k = 0, 1, . . . vk = vπk [policy evaluation eq. (2)] πk+1 = πvk [policy improvement eq. (6)] endfor

Rank-One Modified Value Iteration

The algorithm to be proposed in this section is based on an alternative representation of the iterations of the PI algorithm; see also (Puterman, 2014, Prop. 6.5.1).

Lemma 3.1 (Policy iteration). Each iteration of the PI algorithm (7) equivalently reads as

vk+1 = vk + (I γPk) 1 T(vk) vk , (8)

where Pk := P πvk is the state transition probability matrix of the MDP under the greed policy πvk.

The PI algorithm outputs the optimal policy in a finite number of iterations (Puterman, 2014, Thm. 6.4.2). Moreover, the algorithm has a local quadratic rate of convergence when initiated in a small enough neighborhood around the optimal solution (Bertsekas, 2022; Gargiani et al., 2022). The faster convergence of PI compared to VI comes however with a higher per-iteration computational complexity: The per-iteration complexities of VI and PI are O(n2m) and O(n2m + n3), respectively. The extra O(n3) complexity is due to the policy evaluation step, i.e., solving a linear system of equations; see also the matrix inversion in the characterization (8). To address this issue, we propose to use a low-rank approximation of Pk instead. Such approach allows us to approximate (I γPk) 1 with a reduced computational cost by using the Woodbury formula (Hager, 1989). To be precise, we propose the rank-one VI (R1-VI) algorithm

vk+1 = vk + (I γ e Pk) 1 T(vk) vk , (9)

where e Pk = 1d k , (10)

is a rank-one approximation of the true transition probability matrix Pk at iteration k, with dk := dπk (S) being a stationary distribution of the greedy policy πk = πvk, i.e., a solution of d k Pk = d k . Under certain conditions, (10) is indeed the best rank-1 approximation of Pk:

Lemma 3.2 (Rank-1 approximation). Assume that the transition probability matrix Pk is ergodic (i.e., irreducible and aperiodic). Then,

1d k = argmin P Rn n ρ(P Pk)

s.t. P 0, P 1 = 1, rank(P ) = 1. (11)

where dk is the unique stationary distribution of Pk. That is, e Pk = 1d k is the best rank-1 approximation of Pk in terms of the spectral radius.

Using the Woodbury formula, we then have

(I γ e Pk) 1 = (I γ1d k ) 1 = I + γ 1 γ 1d k ,

and hence the R1-VI update (9) reads as

vk+1 = T(vk) + γ 1 γ dk, T(vk) vk 1. (12)

Algorithm 1 Rank-One Value Iteration (R1-VI)

Input: transition kernel P : S A (S); cost function c R|S A|; discount factor γ (0, 1); Output: optimal value function v

1: initialize: v0 R|S|, d 1 (S); 2: for k = 0, 1, 2, . . . do 3: Pk = 0 R|S| |S|; 4: for s S do

ak argmin a A

c(s, a) + γ Es+ vk(s+) ,

[T(vk)](s) = min a A c(s, a) + γ Es+ vk(s+) ;

6: Pk(s, s+) = P(s+|s, ak), s+ S; 7: end for 8: f = P k dk 1; dk = f/ f 1; 9: vk+1 = T(vk) + γ 1 γ dk, T(vk) vk 1; 10: end for

Next to be addressed is the computation of the vector dk. Considering the fact that dk is a left eigenvector of Pk corresponding to the eigenvalue 1, we can use the power method (Golub & Van Loan, 2013, Sec. 7.3) to compute it as follows

f = P k d(i) k , d(i+1) k = f f 1 , i = 0, 1, . . . , I 1, (13)

with some initialization d(0) k (S) and I {1, 2, . . .}. We then use dk = d(I) k in the update rule (12). We note that the normalization is only to avoid the accumulation of numerical errors; to see this note that for the row stochastic matrix Pk, we have P k d (S) for any d (S). Under the assumptions of Lemma 3.2, the preceding iteration converges linearly to the unique stationary distribution with a rate equal to the second largest eigenvalue modulus of Pk (Gallager, 2011, Thm. 3.4.1).

The complete description of the proposed R1-VI algorithm is provided in Algorithm 1. We note that Algorithm 1 includes a single iteration (i.e., I = 1) of the power method in (13) initialized by d(0) k = dk 1. The reason for this choice is that the greedy policy πvk and hence the corresponding transition matrix Pk usually stays the same over multiple iterations k of the algorithm in the value space. This means that the algorithm effectively performs multiple iterations of the power method. Despite using this approximation dk of the stationary distribution of Pk with a single iteration of the power method, the proposed algorithm can be shown to converge.

Theorem 3.3 (Convergence of R1-VI). The iterates vk of the R1-VI Algorithm 1 converge to the optimal value function v = T(v ) with at least the same rate as VI, i.e., with linear rate γ.

Let us also note that the per-iteration complexity of the

Rank-One Modified Value Iteration

proposed R1-VI Algorithm 1 is O(n2m), i.e., the same as that of VI. We finish this section with the following remark.

Remark 3.4 (Generalization to modified policy iteration). Recall the generic value update rule

vk+1 = vk + Gk T(vk) vk , k = 0, 1, . . . , (14)

with the gain matrix Gk = I in the VI algorithm and Gk = (I γPk) 1 in the PI algorithm. Also, note that (I γPk) 1 = P ℓ=0 γℓP ℓ k since ρ(I γPk) < 1 (Puterman, 2014, Cor. C.4). Inserting the truncated sum

ℓ=0 γℓP ℓ k,

with L {0, 1, . . .} in the update rule (14), we derive the Modified PI (MPI) algorithm which converges linearly with rate γ for any choice of L (Puterman, 2014, Thm. 6.5.5). (Observe that L = 0 and L = correspond to the standard VI and PI algorithms, respectively). The proposed rankone modification can be in general combined with the MPI algorithm. Indeed, we have

(I γPk) 1 =

ℓ=0 γℓP ℓ k =

ℓ=0 γℓP ℓ k + γLP L k

ℓ=0 γℓP ℓ k

ℓ=0 γℓP ℓ k + γLP L k (I γPk) 1.

Then, by using the approximation e Pk = 1d k in the matrix inversion on the right-hand side of the equation above, we derive the gain matrix of the rank-one MPI (R1-MPI) algorithm to be

ℓ=0 γℓP ℓ k + γLP L k (I γ e Pk) 1

ℓ=0 γℓP ℓ k + γL+1

Observe that R1-VI is now a special case of R1-MPI with L = 0.

4. Rank-one Q-learning (R1-QL)

In this section, we focus on the learning problem in which we have access to a generative model that provides us with samples of the MDP (as opposed to access to the true transition probability kernel of the MDP in the planning problem). To start, let us provide the PI update rule for the Q-function. The proof is similar to the proof of Lemma 3.1 and omitted.

Lemma 4.1 (Policy iteration for Q-function). Each iteration of the PI algorithm for the Q-function is given by

qk+1 = qk + (I γP k) 1

T(qk) qk , (15)

where P k := P πqk is the state-action transition probability matrix of the MDP under the greed policy πqk(s) argmina A qk(s, a) for s S.

The idea is again to use the rank-one approximation e Pk = 1d k of the matrix P k in the update rule (15), where dk is now the stationary distribution of P k. This leads to the update rule

qk+1 = T(qk) + γ 1 γ dk, T(qk) qk 1,

at each iteration of the planning problem. Then, for the learning problem, considering the synchronous update of all state-action pairs (s, a) S A at each iteration k, we arrive at the rank-one Q-learning (R1-QL) update rule

1 γ bdk, b Tk(qk) qk ,

qk+1 = (1 λk)qk + λk b Tk(qk) + αk1, (16)

where λk 0 are properly chosen step-sizes satisfying the Robbins Monro conditions (e.g., λk = 1/(k + 1)), and b Tk is the empirical Bellman operator evaluated at iteration k, i.e., [b Tk(qk)](s, a) := [b T(qk, bs + k )](s, a) with bs + k P( |s, a) for each (s, a) S A the subscript k in b Tk denotes the dependence on the next-state sample bs + k generated at iteration k.

What remains to be addressed is computing the estimation bdk of the stationary distribution in (16) using the samples. At each iteration k, define the sparse matrix Fk R|S A| |S A| = R(nm) (nm) with exactly one nonzero entry equal to 1 in each row (s, a) S A corresponding to the column (s+, a+), where

s+ = bs + k P( |s, a),

a+ = ba + k argmin a+ A qk(bs + k , a+).

Observe that the matrix Fk is a sampled version of the stateaction transition probability matrix P k. Using this sample, we can form the stochastic approximation

b Pk = (1 λk) b Pk 1 + λk Fk.

for the state-action transition probability matrix. We note that the same approximation is used in the Zap Q-learning algorithm (Devraj & Meyn, 2017). With this approximation in hand, we can again use the power method for finding the stationary distribution. In particular, with a single iteration of the power method initialized by the previous stationary distribution bdk 1, we have

bdk = b P k bdk 1 = (1 λk) b P k 1 bdk 1 + λk F k bdk 1.

Rank-One Modified Value Iteration

Algorithm 2 Rank-One Q-Learning (R1-QL)

Input: samples from transition kernel P : S A (S); cost function c R|S A|; discount factor γ (0, 1); Output: optimal Q-function q

1: initialize: q0 R|S A|, bd 1 (S A); 2: for k = 0, 1, 2, . . . do 3: λk = 1/(k + 1); 4: f = 0 R|S A|; 5: for (s, a) S A do 6: bs + k P( |s, a);

ba + k argmin a+ A qk(bs + k , a+), b Tk(qk) (s, a) = c(s, a) + γ min a+ A qk(bs + k , a+);

8: f(bs + k , ba + k ) = f(bs + k , ba + k ) + bdk 1(s, a); 9: end for 10: bdk = (1 λk) bdk 1 + λkf; bdk = bdk/ bdk 1; 11: αk = γλk

1 γ bdk, b Tk(qk) qk ;

12: qk+1 = (1 λk)qk + λk b Tk(qk) + αk1; 13: end for

Now, using the approximation bdk 1 b P k 1 bdk 1 (i.e., assuming bdk 1 is a stationary distribution of b Pk 1 which does not hold exactly since bdk 1 is only an approximation of a stationary distribution of b Pk 1), we derive

bdk = (1 λk) bdk 1 + λk F k bdk 1.

We note that the vector f = F k bdk 1 can be computed using the following pseudo-code:

intialize f = 0 R|S A|

for (s, a) S A bs + k P( |s, a), ba + k argmin a+ A qk(bs + k , a+)

f(bs + k , ba + k ) = f(bs + k , ba + k ) + bdk 1(s, a) endfor

Observe that the approximation bdk 1 b P k 1 bdk 1 significantly reduces the memory and time complexity of the algorithm since we do not need to keep track of the estimates b Pk of the state-action transition probability matrix and perform full matrix-vector multiplications for updating the estimates bdk of the stationary distribution.

The complete description of the proposed R1-QL algorithm is provided in Algorithm 2. We again note that the normalization is only introduced to avoid the accumulation of numerical errors. The following result discusses the convergence of the proposed algorithm.

Theorem 4.2 (Convergence of R1-QL). The iterates qk of the R1-QL Algorithm 2 converge to the optimal Qfunction q = T(q ) almost surely with at least the same rate as QL.

Finally, we note that the per-iteration time complexity of the R1-QL Algorithm 2 is the same as that of the synchronous QL algorithm, i.e., O(nm2).

5. Numerical simulations

In this section, we compare the performance of several planning and learning algorithms with our proposed methods. The experiments are conducted on Garnet (Archibald et al., 1995) and Graph MDPs (Devraj & Meyn, 2017), focusing on the Bellman errors T(vk) vk and T(qk) qk and the value errors vk v and qk q . In all of our experiments, we run policy iteration (PI) until it converges to calculate the optimal values v and q for reference. Garnet and Graph MDPs are particularly compelling for our empirical analysis as we can run PI to compute the optimal values, enabling us to measure value errors throughout the iterations. The Garnet MDPs have a state size of n = 200, an action size of m = 5, and randomly generated transition probabilities and costs with a branching factor of 10. For our numerical experiments, we consider 25 randomly generated instances of Garnet MDPs and report the three quantiles of the errors. For the Graph MDPs, we use the same configuration as described in (Devraj & Meyn, 2017), providing a complementary benchmark to validate the effectiveness of our proposed method. In what follows, we report the result of our simulations for the planning and learning problems.

Planning Algorithms. We compare several value iteration (VI) algorithms with the same per-iteration time complexity as our proposed R1-VI Algorithm 1. We also include PI for reference. We mainly focus on comparing R1-VI with accelerated VI methods, namely, Nesterov-VI (Goyal & Grand-Cl ement, 2022) and Anderson-VI (Geist & Scherrer, 2018). In order to keep the time complexity the same, we use Anderson-VI with the memory parameter equal to one leading to a rank-one approximation of the Hessian matrix. The update rule of the accelerated VI algorithms is provided in Appendix B.1.

In Figure 1, we report the median number of iterations required to reach a certain error threshold for each algorithm across both MDPs for four different values of the discount factors γ. Our results indicate that the convergence performance of R1-VI is comparable with PI while maintaining the same per-iteration time complexity as VI. Additionally, R1-VI significantly outperforms the VI algorithm and its accelerated versions, particularly when the discount factor is close to 1. This observation can be partially explained by the fact that the proposed R1-VI algorithm forms an approximation of the inverse of the Hessian , i.e., (I γPk) 1, by incorporating its largest eigenvalue 1 1 γ . A more detailed comparison of the algorithms is provided in Appendix B.2.

Rank-One Modified Value Iteration

0.9 0.95 0.99 0.999 1

0.9 0.95 0.99 0.999 1

0.9 0.95 0.99 0.999

0.9 0.95 0.99 0.999

PI R1-VI Anderson-VI Nesterov-VI VI

Graph MDP Value error < 1e-02

Graph MDP Bellman error < 1e-04

Garnet MDP Value error < 1e-02

Garnet MDP Bellman error < 1e-04

Figure 1: Planning algorithms the median number of iterations required for each algorithm to reach a fixed error threshold across four discount factors γ for the two MDPs.

0.9 0.95 0.99 0.999

0.9 0.95 0.99 0.999 10 2

0.9 0.95 0.99 0.999

0.9 0.95 0.99 0.999

Zap QL R1-QL Speedy QL QL

Value error

Bellman error

Value error

Bellman error

Figure 2: Learning algorithms the median error values achieved by each learning algorithm over the course of 5000 iterations across four discount factors γ for the two MDPs.

Learning Algorithms. In the learning experiments, we report the Bellman and value errors of the algorithms trained in a synchronous fashion, following the methodology outlined in (Ghavamzadeh et al., 2011), ensuring a consistent evaluation. In synchronous learning, in each iteration k, a sample bs + P( |s, a) of the next state is generated for each stateaction pair (s, a) S A and the action-value function qk is updated in all state-action pairs; see the update rule in QL algorithm (5) and R1-VI Algorithm 2. All the learning algorithms use the same samples generated through the training. Besides the proposed R1-QL Algorithm 2, we report the performance of Speedy QL (Ghavamzadeh et al., 2011), Zap QL (Devraj & Meyn, 2017), and the standard QL (5) in Garnet and Graph MDPs for several discount factors; see Appendix B.1 for the update rule of Speedy QL and Zap QL. We run each algorithm using the same step-size schedule (λk) k=0, namely, linearly decaying λk = 1/(1 + k), to ensure a fair comparison.

Figure 2 shows the final error values after running each algorithm for 5000 iterations. R1-QL achieves comparable or lower error values across both MDPs. In contrast to the other algorithms, R1-QL consistently maintains a similar level of error values across various discount factors, particularly at higher discount factors a characteristic attributed to Policy Iteration (PI) algorithms. It is worth mentioning that since both Zap QL and R1-QL estimate (I γP k) 1

using the samples, they behave more robustly against the in-

crease in the discount factor γ; see Figure 2. Regarding per iteration complexity, R1-QL has the same time and memory complexity as QL and Speedy QL. In contrast, Zap QL, due to the inherent full-rank matrix inversion, incurs higher time and memory complexity. Albeit Zap QL can be efficiently implemented with lower complexity, it typically exhibits a higher computational cost in practice. A comprehensive analysis of the error trajectories observed during training is provided in Appendix B.2.

6. Limitations and future research directions

We finish the paper by discussing some limitations of the proposed rank-one modification of the VI and QL algorithms along with some future research directions.

Let us start by noting that the provided theoretical results in Theorems 3.3 and 4.2 guarantee the convergence of the proposed algorithms with the same rate as the standard VI and QL algorithms. However, our numerical experiments with Garnet and Graph MDPs in Section 5 show that the proposed algorithms have a faster convergence rate compared to standard VI and QL and their accelerated versions. This gap can be explained by the fact that our proof technique does not exploit that the vectors dk and bdk used in the update rule are specifically constructed to approximate the stationary distribution of the Markov chain induced by the greedy policy (see Appendices A.3 and A.4 for details). That is, the

Rank-One Modified Value Iteration

convergence of R1-VI and R1-QL algorithms is guaranteed for any choice of dk (S) and bdk (S A). In fact, when the stationary distribution concentrates on a single state, as happens when there is an absorbing state with zero reward, the second term in the R1VI update rule (12) vanishes. Appendix B.3 provides an empirical analysis in Gridworld (Sutton & Barto, 2018), which includes an absorbing state and thus violates the assumption of Lemma 3.2. Moreover, the provided proof of convergence shows that the greedy policies generated with respect to the iterates of R1-VI and R1-QL are the same as those for the standard VI and QL algorithms, respectively (see Lemmas A.2 and A.4). In other words, the proposed algorithms do not affect the speed of convergence to the optimal policy compared to VI and QL. Nevertheless, at least in the case of R1-VI, the faster convergence in the value spaces leads to a faster termination of the algorithm for a given performance bound for the greedy policy. In this regard, let us also note that the mismatch between convergence in value space and policy space also arises in other accelerated VI/QL algorithms; see Appendix B.4.

Second, the proposed algorithms heavily depend on the structure of the transition probability matrices Pk and P k and their rank-one approximation using the corresponding stationary distributions. This dependence particularly hinders the application of the proposed algorithms to generic function approximation setups in solving the optimal control problem of MDPs with continuous state-action spaces. We note that a similar issue for the Zap Q-leaning algorithm (Devraj & Meyn, 2017) has been successfully addressed in (Chen et al., 2020).

Third, we note that the proposed R1-QL algorithm 2 is a synchronous algorithm that updates all state-action pairs (s, a) S A of the Q-function qk at each iteration k. This algorithm can be modified in a standard fashion for the asynchronous case. However, the provided convergence analysis can not be extended for the corresponding asynchronous algorithm in a straightforward manner. Moreover, the straightforward asynchronous implementation of R1-QL leads to an O(nm) per-iteration complexity for updating a single component (s, a) S A at each iteration k, which is higher than the O(m) per-iteration complexity of the standard asynchronous QL algorithm. We note that the Zap Q-leaning algorithm (Devraj & Meyn, 2017) also suffers from this issue. Addressing these issues requires a more involved analysis and modification of the proposed algorithm in the asynchronous case, which we leave for future research.

Finally, the basic idea of the proposed algorithms can also be used for developing the rank-one modified version of the existing algorithms for the average cost setting. For example, consider the PI algorithm that uses the relative VI

algorithm in the policy evaluation step for unichains (Puterman, 2014, Sec. 8.6.1). This algorithm can be characterized via the following update rule in the value space: For a fixed s S,

vk+1 = vk + (I Pk)(I ese s ) + 1e s 1(T(vk) vk ,

vk+1(s) = 0,

where es is the s-th unit vector and T is now the undiscounted Bellman operator. Now, observe that

Gk = (I Pk)(I ese s ) + 1e s 1

= I Pk + (pk es + 1)e s 1

= I 1d k + (pk es + 1)e s 1,

where pk = Pk( , s) is the s-th column of Pk and we used the approximation Pk 1d k in the last equality. The matrix inversion can then be handled efficiently using the Woodbury formula. However, the convergence of this algorithm and any possible improvement in the convergence rate when dk is approximated via the power method requires further investigation.

Acknowledgements

The authors would like to thank the reviewers for their useful comments. This work was partially supported by the Horizon Europe Pathfinder Open project RELIEVE101099481 and by the European Research Council (ERC) project TRUST-949796.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Anderson, D. G. Iterative Procedures for Nonlinear Integral Equations. Journal of the ACM (JACM), 12(4):547 560, 1965.

Archibald, T., Mc Kinnon, K., and Thomas, L. On the Generation of Markov Decision Processes. Journal of the Operational Research Society, 46(3):354 361, 1995.

Bertsekas, D. Lessons from Alpha Zero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific, 2022.

Bertsekas, D. A Course in Reinforcement Learning. Athena Scientific, 2023.

Rank-One Modified Value Iteration

Chen, S., Devraj, A. M., Lu, F., Buˇsi c, A., and Meyn, S. Zap Q-Learning with Nonlinear Function Approximation. In Advances in Neural Information Processing Systems, volume 33, pp. 16879 16890, 2020.

Devraj, A. M. and Meyn, S. Zap Q-Learning. In Advances in Neural Information Processing Systems, volume 30, 2017.

Devraj, A. M., Buˇsi c, A., and Meyn, S. On Matrix Momentum Stochastic Approximation and Applications to Q-Learning. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 749 756, 2019.

Even-Dar, E. and Mansour, Y. Learning Rates for QLearning. Journal of Machine Learning Research, 5 (1):1 25, 2003.

Gallager, R. G. Discrete Stochastic Processes. 2011.

Gargiani, M., Zanelli, A., Liao-Mc Pherson, D., Summers, T., and Lygeros, J. Dynamic Programming Through the Lens of Semismooth Newton-Type Methods. IEEE Control Systems Letters, 6:2996 3001, 2022.

Geist, M. and Scherrer, B. Anderson Acceleration for Reinforcement Learning. preprint ar Xiv:1809.09501, 2018.

Ghavamzadeh, M., Kappen, H., Azar, M., and Munos, R. Speedy Q-Learning. In Advances in Neural Information Processing Systems, volume 24, 2011.

Golub, G. H. and Van Loan, C. F. Matrix Computations. JHU Press, 2013.

Goyal, V. and Grand-Cl ement, J. A First-Order Approach to Accelerated Value Iteration. Operations Research, 71 (2):517 535, 2022.

Grand-Cl ement, J. From Convex Optimization to MDPs: A Review of First-Order, Second-Order and Quasi-Newton Methods for MDPs. preprint ar Xiv:2104.10677, 2021.

Hager, W. W. Updating the Inverse of a Matrix. SIAM Review, 31(2):221 239, 1989.

Halpern, B. Fixed Points of Nonexpanding Maps. Bulletin of the American Mathematical Society, 73(6):957 961, 1967.

Jaakkola, T., Jordan, M., and Singh, S. Convergence of Stochastic Iterative Dynamic Programming Algorithms. In Advances in neural information processing systems, volume 6, 1993.

Kamanchi, C., Diddigi, R. B., and Bhatnagar, S. Generalized Second-Order Value Iteration in Markov Decision Processes. IEEE Transactions on Automatic Control, 67 (8):4241 4247, 2022.

Kearns, M. and Singh, S. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms. In Advances in neural information processing systems, volume 11, 1998.

Kolarijani, M. A. S. and Mohajerin Esfahani, P. From Optimization to Control: Quasi Policy Iteration. ar Xiv preprint ar Xiv:2311.11166, 2023.

Kushner, H. and Kleinman, A. Accelerated Procedures for the Solution of Discrete Markov Control Problems. IEEE Transactions on Automatic Control, 16(2):147 152, 1971.

Lee, J. and Ryu, E. Accelerating Value Iteration with Anchoring. In Advances in Neural Information Processing Systems, volume 36, 2024.

Lee, J., Rakhsha, A., Ryu, E. K., and Farahmand, A.- M. Deflated dynamics value iteration. ar Xiv preprint ar Xiv:2407.10454, 2024.

Nesterov, Y. E. A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k2). Doklady Akademii Nauk SSSR, 269(3):543 547, 1983.

Porteus, E. L. and Totten, J. C. Accelerated Computation of the Expected Discounted Return in a Markov Chain. Operations Research, 26(2):350 358, 1978.

Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.

Puterman, M. L. and Brumelle, S. L. On the Convergence of Policy Iteration in Stationary Dynamic Programming. Mathematics of Operations Research, 4(1):60 69, 1979.

Rakhsha, A., Wang, A., Ghavamzadeh, M., and Farahmand, A.-M. Operator splitting value iteration. Advances in Neural Information Processing Systems, 35:38373 38385, 2022.

Ruppert, D. A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure. The Annals of Statistics, 13 (1):236 245, 1985.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT press, 2018.

Szepesv ari, C. Algorithms for Reinforcement Learning. 2009.

Tsitsiklis, J. N. Asynchronous Stochastic Approximation and Q-Learning. Machine learning, 16:185 202, 1994.

Wainwright, M. J. Stochastic Approximation with Cone Contractive Operators: Sharp ℓ -Bounds for Q-Learning. ar Xiv preprint ar Xiv:1905.06265, 2019.

Rank-One Modified Value Iteration

Watkins, C. J. and Dayan, P. Q-Learning. Machine Learning, 8(3):279 292, 1992.

Weng, B., Xiong, H., Zhao, L., Liang, Y., and Zhang, W. Finite-Time Theory for Momentum Q-Learning. In Uncertainty in Artificial Intelligence (UAI), pp. 665 674, 2021.

Zhang, J., O Donoghue, B., and Boyd, S. Globally Convergent Type-I Anderson Acceleration for Nonsmooth Fixed-Point Iterations. SIAM Journal on Optimization, 30(4):3170 3197, 2020.

Rank-One Modified Value Iteration

A. Technical Proofs

A.1. Proof of Lemma 3.1

We begin by providing two basic results on MDPs. First, recall that given a policy π, the value vπ of π solves the (Bellman consistency) equation (Puterman, 2014, Thm. 6.1.1)

vπ(s) = cπ(s) + γ Es+ P π(s, ) vπ(s+) , s S.

Hence, we have vπ = (I γP π) 1cπ. (17)

Moreover, the definitions of the Bellman operator in (3) and of the greedy policy in (6) imlpy that

T(v) = cπv + γP πvv, v Rn. (18)

Recall Pk := P πvk . Using these two results, we have at each iteration of the PI algorithm (7),

vk+1 = vπk+1 = vπvk (17) = (I γPk) 1cπvk (18) = (I γPk) 1 T(vk) γPkvk

= (I γPk) 1 T(vk) vk + (I γPk)vk

= vk + (I γPk) 1 T(vk) vk .

This concludes the proof.

A.2. Proof of Lemma 3.2

Let (ρi)n i=1 be the eigenvalues of Pk such that |ρ1| |ρ2| |ρn|. Since Pk is a row stochastic matrix, we have ρ1 = 1 (Gallager, 2011, Thm. 3.4.1). Moreover, the assumption that Pk is irreducible and aperiodic implies that |ρi| < 1 for all i = 1 (Gallager, 2011, Thm. 3.4.1), that is, ρ1 = 1 is the unique eigenvalue of Pk on the unit circle in the complex plane and all other eigenvalues lie inside the unit disc. The unique (up to scaling) right and left eigenvectors corresponding to ρ1 = 1 are the all-one vector 1 and the stationary distribution dk. From these results it follows that e Pk = 1d k is the unique solution of (11).

A.3. Proof of Theorem 3.3

For each iteration k 0 of the R1-VI Algorithm 1, we have

vk+1 = T(vk) + αk1

αk = γ 1 γ dk, T(vk) vk , (19)

where dk (S) is an approximation of the stationary distribution of the MDP under the greedy policy with respect to vk. We start with analyzing the effect of a constant shift in the argument of the Bellman operator.

Lemma A.1. For all α R and v Rn, we have

T(v + α1) = T(v) + γα1.

Proof. For each s S, we have

[T(v + α1)](s) = min a A

c(s, a) + γ X

s+ S P(s+|s, a) [v + α1] (s+)

= γα + min a A

c(s, a) + γ X

s+ S P(s+|s, a)v(s+)

= γα + [T(v)](s).

Rank-One Modified Value Iteration

We next use the preceding result to provide an alternative characterization of the iterates of R1-VI in (19).

Lemma A.2. For each k 0, the iterates in (19) are equivalently given by ( vk+1 = T(k+1)(v0) + βk+11

βk+1 = γβk + αk, with β0 = 0.

Proof. (Proof by induction.) Consider k = 0 and observe that

v1 = T(v0) + β11,

with β1 = γβ0 + α0 = α0. Next, for some k 1 and βk R, assume vk = T(k)(v0) + βk1. Then,

vk+1 = T(vk) + αk1 = T T(k)(v0) + βk1 + αk1 = T(k+1)(v0) + γβk1 + αk1,

where the last equality follows from Lemma A.1. Therefore,

vk+1 = T(k+1)(v0) + (γβk + αk)1 = T(k+1)(v0) + βk+11,

which concludes the proof.

We now employ the preceding lemmas to provide an alternative characterization of the constant shifts αk in the R1-VI updates in (19).

Lemma A.3. For each k 0, one has

D dk, T(k+1)(v0) T(k)(v0) E γβk.

Proof. From Lemma A.2, we have vk = T(k)(v0) + βk1, k = 0, 1, . . . .

(Notice that for k = 0, the preceding equation simply implies v0 = v0 since T(0) is the identity operator and β0 = 0.) Then, from Lemma A.1, it follows that

T(vk) = T T(k)(v0) + βk1 = T(k+1)(v0) + γβk1.

Thus, T(vk) vk = T(k+1)(v0) T(k)(v0) (1 γ)βk1

dk, T(vk) vk = dk, T(k+1)(v0) T(k)(v0) (1 γ)βk dk, 1

= dk, T(k+1)(v0) T(k)(v0) (1 γ)βk,

where we used dk (S) (i.e., dk, 1 = 1) in the second equality above. Therefore, for αk in (19), we have

D dk, T(k+1)(v0) T(k)(v0) E γβk.

This completes the proof.

Now, observe that plugging in αk from Lemma A.3 in the update rule of Lemma A.2 leads to

vk+1 = T(k+1)(v0) + γ 1 γ dk, T(k+1)(v0) T(k)(v0) 1, k = 0, 1, . . . . (20)

Rank-One Modified Value Iteration

Then, using the fact that v = T(v ) and the Bellman operator is a γ-contraction in the -norm, we have

vk+1 v = T(k+1)(v0) T(v ) + γ 1 γ dk, T(k+1)(v0) T(k)(v0) 1

T(k+1)(v0) T(v ) + γ 1 γ

dk, T(k+1)(v0) T(k)(v0)

γk+1 v0 v + γ 1 γ

T(k+1)(v0) T(k)(v0)

γk+1 v0 v + γk+1

1 γ T(v0) v0

γk+1 v0 v + 1 1 γ T(v0) v0 .

That is, vk v as k linearly with rate γ.

A.4. Proof of Theorem 4.2

Let us begin with recalling the definition of the empirical Bellman operator

b Tk(q) (s, a) := c(s, a) + γ min a+ A q(bs + k , a+), (21)

where bs + k P( |s, a), that is to say bs + is sampled according to the law P( |s, a) at iteration k. From this definition, it immediately follows that b Tk(q + α1) = b Tk(q) + γα1, α R, q R|S A|. (22)

Also, recall that each iteration k 0 of the R1-QL Algorithm 2 reads as

qk+1 = (1 λk)qk + λk b Tk(qk) + αk1

αk = λk γ 1 γ

bdk, b Tk(qk) qk , (23)

where bdk (S A) is an estimation of the stationary distribution of the state-action transition probability matrix of the MDP under the greedy policy with respect to qk. Let us also consider the standard QL iterates

q QL k+1 = (1 λk)q QL k + λk b Tk(q QL k ), k = 0, 1, . . . ,

with the same initialization q QL 0 = q0 and empirical Bellman operator b Tk( ) for all k as the R1-QL algorithm (23).

The first result concerns an alternative characterization of the iterates in (23).

Lemma A.4. For each k 0, the iterates of R1-QL algorithm (23) equivalently read as

( qk+1 = q QL k+1 + βk+11

βk+1 = (1 λk)βk + γλkβk + αk, with β0 = 0. (24)

Proof. (Proof by induction) For k = 0, since q QL 0 = q0, we can write

q1 = (1 λ0)q0 + λ0 b Tk(q0) + α01

= (1 λ0)q QL 0 + λ0 b Tk(q QL 0 ) + α01

= q QL 1 + β11,

Rank-One Modified Value Iteration

where β1 = α0. Assume next qk = q QL k + βk1 for some k 0. Then, it follows that

qk+1 = (1 λk)qk + λk b Tk(qk) + αk1

= (1 λk)(q QL k + βk1) + λk b Tk(q QL k + βk1) + αk1

(22) = (1 λk)(q QL k + βk1) + λk b Tk(q QL k ) + γβk1 + αk1

= (1 λk)q QL k + λk b Tk(q QL k ) + (1 λk)βk + γλkβk + αk 1

= q QL k+1 + βk+11.

This concludes the proof.

We next provide a useful characterization of the constant shifts αk in the R1-QL update rule (23).

Lemma A.5. For each k 0, one has

αk = λk γ 1 γ

bdk, b Tk(q QL k ) q QL k γλkβk.

Proof. From Lemma A.4 and since q0 = q QL 0 and β0 = 0, we have

qk = q QL k + βk1, k = 0, 1, . . . .

Hence, we can use (22) to write

b Tk(qk) = b Tk(q QL k + βk1) = b Tk(q QL k ) + γβk1.

As a result,

b Tk(qk) qk = b Tk(q QL k ) + γβk1 q QL k βk1

= b Tk(q QL k ) q QL k (1 γ)βk1,

and bdk, b Tk(qk) qk = bdk, b Tk(q QL k ) q QL k (1 γ)βk bdk, 1

= bdk, b Tk(q QL k ) q QL k (1 γ)βk,

where we use the identity bdk, 1 = 1 in the second line since bdk (S A). Recalling the definition of αk in (23), one thus have

αk = λk γ 1 γ

bdk, b Tk(q QL k ) q QL k γλkβk.

Plugging the expression for αk from Lemma A.5 into the update rule of Lemma A.4, we derive the iteration

βk+1 = (1 λk)βk + λk γ 1 γ

bdk, b Tk(q QL k ) q QL k , k = 0, 1, . . . , (25)

initialized by β0 = 0. Next, using the convergence of QL, we can show the convergence of βk:

Lemma A.6. The iterates βk in (25) converge to zero almost surely.

Proof. Define β1,0 = β2,0 = 0 and consider the iterations

β1,k+1 = (1 λk)β1,k + λk γ 1 γ

bdk, b Tk(q QL k ) T(q QL k ) , (26)

β2,k+1 = (1 λk)β2,k + λk γ 1 γ

bdk, Tk(q QL k ) q QL k , (27)

Rank-One Modified Value Iteration

for k = 0, 1, . . ., so that βk = β1,k +β2,k for all k 0. In what follows, we use the fact that the QL iterates q QL k are bounded and converge to q = T(q ) almost surely (Tsitsiklis, 1994, Thm. 4). First, observe that the iteration (26) converges to zero almost surely using (Tsitsiklis, 1994, Lem. 1) and the fact that (see also (Tsitsiklis, 1994, Sec. 7))

h bdk, b Tk(q QL k ) T(q QL k ) i = bdk, Ebs+ k

h b Tk(q QL k ) T(q QL k ) i = 0,

h bdk, b Tk(q QL k ) T(q QL k ) 2i Ebs+ k

b Tk(q QL k ) T(q QL k ) 2

The iteration (27) also converges to zero almost surely since

β2,k+1 = γ 1 γ

[Tℓ(q QL ℓ) q QL ℓ](s, a) )

is a scaled, weighted average of Cesaro means of the sequences [Tk(q QL k ) q QL k ](s, a) that converge to zero almost surely for each (s, a) S A (recall that λk = 1/(k + 1) and q QL k q = T(q ) almost surely).

Finally, recall the characterization qk = q QL k + βk1 in Lemma A.4 and observe that q QL k q and βk 0 almost surely. Therefore, qk q almost surely.

B. On numerical experiments

B.1. Algorithms

Below, we provide the update rules of the algorithms we employed in our numerical experiments. We adapt the update rules provided by (Kolarijani & Mohajerin Esfahani, 2023) in our implementations for the planning algorithms.

Nesterov VI algorithm (Goyal & Grand-Cl ement, 2022):

zk = vk + 1 p

γ (vk vk 1),

vk+1 = zk + 1 1 + γ (T(zk) zk).

Anderson VI algorithm (Geist & Scherrer, 2018): (The following update rule is for Anderson acceleration with memory equal to 1 which corresponds to a rank-one approximation of the Hessian.)

zk = vk vk 1,

z k = T(vk) T(vk 1),

0, z k (zk z k) = 0,

z k vk T(vk)

z k (zk z k) , otherwise,

vk+1 = (1 δk)T(vk) + δk T(vk 1).

Speedy QL algorithm (Ghavamzadeh et al., 2011): (The following update rule is the synchronous implementation of Speedy QL.)

Rank-One Modified Value Iteration

for (s, a) S A

bs + P( |s, a),

zk(s, a) = c(s, a) + γ min a+ A qk(bs +, a+),

z k(s, a) = c(s, a) + γ min a+ A qk 1(bs +, a+),

qk+1 = qk + 1 1 + k (z k qk) + k 1 + k (zk z k).

Zap QL algorithm (Devraj & Meyn, 2017): (The following update rule is also the synchronous implementation of Zap QL without eligibility trace. The matrix Fk R|S A| |S A| below denotes the sampled transition matrix at iteration k.)

Fk = 0 R|S A| |S A|,

for (s, a) S A

bs + P( |s, a),

ba + = argmin a+ A qk(bs +, a+),

b Tk(qk) (s, a) = c(s, a) + γ min a+ A qk(bs +, a+) qk(s, a),

Fk (s, a), (bs +, ba +) = 1,

b Pk = b Pk 1 + 1 2 + k (Fk b Pk 1),

qk+1 = qk + 1 1 + k (I γ b Pk) 1 b Tk(qk) qk .

B.2. Extended numerical analysis

In this appendix, we provide the Bellman and value errors observed throughout the iterations of planning and learning algorithms. We run each planning algorithm until the error thresholds in Table 1 are achieved.

Table 1: Error thresholds for planning algorithms.

MDP Error γ = 0.9 γ = 0.9 γ = 0.9 γ = 0.9

Garnet value 10 5 10 4 10 4 10 2

Bellman 10 5 10 5 10 5 10 4

Graph value 10 5 10 4 10 3 10 2

Bellman 10 5 10 5 10 5 10 4

We consider four different values for discount factor γ across 25 realizations of the Garnet MDP. Note that the planning algorithms are deterministic in nature, hence, the only source of variation in the errors is due to the random realization of the Garnet MDPs. Figure 3 shows the range of error values observed within the span of the iterations of the planning algorithms. The solid curve represents the median error values, while the shaded region around the curve indicates the errors between the first and the third quantiles. We follow the same style of presentation in the other figures in this section.

We observe in Figure 3 that Anderson VI has the highest error variance. Furthermore, the error curves for both Anderson VI and Nesterov VI are not monotonically decreasing, which is particularly visible for Anderson VI. This is because the iterations in these algorithms are not necessarily a contraction with a guaranteed reduction in the Bellman/value error.

Rank-One Modified Value Iteration

1 2 5 10 2 5 10 2

1 2 5 10 2 5 10 2 2

1 2 5 10 2 5 10 2 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5 10 4

1 2 5 10 2 5

1 2 5 10 2 5 10 2 2

1 2 5 10 2 5 10 2 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

PI R1-VI Anderson-VI Nesterov-VI VI

Iteration Iteration Iteration Iteration

Iteration Iteration Iteration Iteration

Value error

Value error

Value error

Value error

Bellman error

Bellman error

Bellman error

Bellman error

Figure 3: Comparison of the planning algorithms in Garnet MDP with various γ values.

1 2 5 10 2 5 10 2

1 2 5 10 2 5 10 2 2

1 2 5 10 2 5 10 2 2 5 10 3

1 2 5 10 2 5 10 2 2 5 10 3 2 5 10 4

1 2 5 10 2 5 10 2

1 2 5 10 2 5 10 2 2

1 2 5 10 2 5 10 2 2 5 10 3

1 2 5 10 2 5 10 2 2 5 10 3 2 5

PI R1-VI Anderson-VI Nesterov-VI VI

Iteration Iteration Iteration Iteration

Iteration Iteration Iteration Iteration

Value error

Value error

Value error

Value error

Bellman error

Bellman error

Bellman error

Bellman error

Figure 4: Comparison of the planning algorithms in Graph MDP with various γ values.

Nevertheless, in our numerical experiments, both algorithms seem to be convergent. Figure 4 shows the error curves of planning algorithms for the Graph MDP. Here, the non-monotonic behavior of Anderson VI is more apparent as the error values initially increase with an oscillating behavior. Similar to Garnet MDPs, R1-V1 consistently provides lower errors throughout the iterations.

Figure 5 and 6 show the error curves for the learning algorithms in Garnet and Graph MDPs, respectively. In these experiments, we run each learning algorithm with 5 different seeds to marginalize the randomness in the sampling process. We observe that the difference between Zap QL, Speedy QL, and R1-QL is not noticeable at lower values of the discount factors (i.e., γ 0.95). However, as the discount factor increases, particularly at γ = 0.999, the gap between the error values increases. At higher values of the discount factors, R1-QL consistently yields lower error values, while QL struggles to minimize the errors due to the linearly decaying step-size λk. Furthermore, we observe that Zap QL displays higher error

Rank-One Modified Value Iteration

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

Zap QL R1-QL Speedy QL QL

Iteration Iteration Iteration Iteration

Iteration Iteration Iteration Iteration

Value error

Value error

Value error

Value error

Bellman error

Bellman error

Bellman error

Bellman error

Figure 5: Comparison of the learning algorithms in Garnet MDP with various γ values.

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5 10 3

1 2 5 10 2 5 10 2 2 5 10 3 2 5 10 2

1 2 5 10 2 5 10 2 2 5 10 3 2 5 10 2

1 2 5 10 2 5 10 2 2 5 10 3 2 5

1 2 5 10 2 5 10 2 2 5 10 3 2 5

Zap QL R1-QL Speedy QL QL

Iteration Iteration Iteration Iteration

Iteration Iteration Iteration Iteration

Value error

Value error

Value error

Value error

Bellman error

Bellman error

Bellman error

Bellman error

Figure 6: Comparison of the learning algorithms in Graph MDP with various γ values.

variance, which may be due to the inversion of the estimated Hessain (I γ b P ). In contrast, R1-QL exhibits considerably lower variance, despite implicitly performing a similar inversion. We argue that the lower error variance observed with R1-QL is due to the low-rank approximation of the transition probability matrix via estimation of the corresponding stationary distribution.

B.3. Reducible MDPs

To illustrate R1-VI s behavior in a reducible MDP, we replicate the comparison from Section 5 within the Gridworld environment of (Sutton & Barto, 2018), which inherently contains an absorbing state. We consider two Gridworld variants:

1. Absorbing Gridworld, in which the absorbing state yields positive reward.

Rank-One Modified Value Iteration

2. Terminal Gridworld, in which the absorbing state grants a zero reward.

Here, reducibility refers to the Markov chain induced by the optimal policy. Note, however, that even in the absence of an absorbing state, where all actions from a state lead back to itself, a non-optimal policy may still produce a reducible chain in Gridworld.

0.9 0.95 0.99 0.999

0.9 0.95 0.99 0.999

0.9 0.95 0.99 0.999

0.9 0.95 0.99 0.999

PI R1-VI Anderson-VI Nesterov-VI VI

discount (γ) discount (γ) discount (γ) discount (γ)

Absorbing Grid World

Value error < 0.01

Absorbing Grid World

Bellman error < 0.01

Terminal Grid World

Value error < 0.01

Terminal Grid World

Bellman error < 0.01

Figure 7: Comparison of planning algorithms on two reducible Gridworld MDP instances, one with a zero-reward absorbing state and one with a positive-reward absorbing state, across various γ values.

In Absorbing-Gridworld (left side of Figure 7), R1-VI yields the lowest value and Bellman error, apart from PI, across all γ values. However, in Terminal Gridworld (right side of Figure 7), R1-VI performs slightly worse than the other accelerated algorithms. This is explained by the fact that under the optimal policy, the stationary distribution concentrates on the absorbing state, which provides zero reward and hence zero Bellman error when the values are initialized to zero. Consequently, the second term in the R1-VI update rule (9) vanishes. In contrast, in Absorbing Gridworld, the Bellman error at the absorbing state is not immediately zero, hence the second term in R1-VI contributes to improve convergence.

B.4. Policy performance

Throughout Section 5, we compared both the planning and learning algorithms, including our proposed R1VI and R1QL methods, using the value and Bellman error metrics. However, rapid convergence in value does not necessarily translate into equally rapid convergence in policy space, which is the ultimate criterion of policy optimization. In this section, we present a comparative analysis based on the policy evaluation metric in the Graph and Garnet MDPs.

Figure 8 shows that the planning algorithms yield exactly the same policy evaluation, except for PI and Anderson-VI. Anderson-VI initially struggles to find the optimal policy due to instabilities in the value space (shown in Figures 3 and 4), but eventually converges to the optimal policy. In both MDPs, policy convergence occurs in fewer than five steps, except for Anderson-VI, whereas convergence in the value space requires several orders of magnitude more steps.

The convergence of policies among the learning algorithms is more varied than that of the planning algorithms. Figure 9 shows that, for the Graph MDP, all algorithms except Zap-QL achieve almost identical performance, converging within five steps for every value of γ. In the Garnet MDP, convergence requires more steps, and improvements over iterations are slower.

In both Figures 8 and 9, the proposed R1VI and R1QL algorithms match the policy performance of VI and QL, respectively, across all iterations. Moreover, VI in the planning setting and QL in the learning setting produce among the highest policy performance observed across both MDPs, despite exhibiting the slowest convergence in the value space.

Rank-One Modified Value Iteration

1 2 5 10 2 5 100 2 5 1000 2 5 10k

1 2 5 10 2 5 100 2 5 1000 2 5 10k

1 2 5 10 2 5 100 2 5 1000 2 5 10k

1 2 5 10 2 5 100 2 5 1000 2 5 10k

1 2 5 10 2 5 100 2 5 1000 2 5 10k

1 2 5 10 2 5 100 2 5 1000 2 5 10k

1 2 5 10 2 5 100 2 5 1000 2 5 10k

1 2 5 10 2 5 100 2 5 1000 2 5 10k

PI R1-VI Anderson-VI Nesterov-VI VI

iterations iterations iterations iterations

iterations iterations iterations iterations

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Graph MDP discount (γ) = 0.9

Graph MDP discount (γ) = 0.95

Graph MDP discount (γ) = 0.99

Graph MDP discount (γ) = 0.999

Garnet MDP discount (γ) = 0.9

Garnet MDP discount (γ) = 0.95

Garnet MDP discount (γ) = 0.99

Garnet MDP discount (γ) = 0.999

Figure 8: Comparison of the value of the greedy policy produced by various planning algorithms (R1-VI, VI, PI, Anderson-VI, and Nesterov-VI) on Garnet and Graph MDPs over a range of discount factors γ. Note that R1-VI, VI, and Nesterov-VI yield essentially overlapping results.

1 2 5 10 2 5 100 2 5 1000 2 5

1 2 5 10 2 5 100 2 5 1000 2 5

1 2 5 10 2 5 100 2 5 1000 2 5

1 2 5 10 2 5 100 2 5 1000 2 5

10 2 5 100 2 5 1000 2

10 2 5 100 2 5 1000 2 11.4

10 2 5 100 2 5 1000 2 52

10 2 5 100 2 5 1000 2

QL Zap QL Speedy QL R1-QL

iterations iterations iterations iterations

iterations iterations iterations iterations

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Policy evaluation

Graph MDP discount (γ) = 0.9

Graph MDP discount (γ) = 0.95

Graph MDP discount (γ) = 0.99

Graph MDP discount (γ) = 0.999

Garnet MDP discount (γ) = 0.9

Garnet MDP discount (γ) = 0.95

Garnet MDP discount (γ) = 0.99

Garnet MDP discount (γ) = 0.999

Figure 9: Comparison of the value of the greedy policy produced by the R1-QL and QL learning algorithms on Garnet and Graph MDPs as a function of the discount factor γ. Results for R1-QL and QL coincide. The y-axis shows the median policy-evaluation score of the corresponding greedy policies across 5 different seeds.