# operator_splitting_value_iteration__985df8a2.pdf

Operator Splitting Value Iteration

Amin Rakhsha1,2 Andrew Wang1,2 Mohammad Ghavamzadeh3

Amir-massoud Farahmand2,1

1Department of Computer Science, University of Toronto 2Vector Institute 3Google Research

We introduce new planning and reinforcement learning algorithms for discounted MDPs that utilize an approximate model of the environment to accelerate the convergence of the value function. Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough. We also introduce a sample-based version of the algorithm called OS-Dyna. Unlike the traditional Dyna architecture, OS-Dyna still converges to the correct value function in presence of model approximation error.

1 Introduction

Consider a planning problem for a discounted MDP with dynamics P. Suppose that we have access to an approximate model ˆP P as well. For example, P might be a high-fidelity, but slow, simulator, and ˆP is a lower-fidelity, but fast, simulator. Or in the context of model-based reinforcement learning (MBRL), P is the unknown dynamics of a real-world system, from which we can only acquire expensive samples, and ˆP is a learned model, from which samples can be cheaply acquired. Can we use this approximate model ˆP to accelerate the computation of the value function of a policy π (Policy Evaluation (PE) problem) or the optimal value function (Control problem)?

The Value Iteration (VI) algorithm and its approximate variant are fundamental algorithms in Dynamic Programming that can find the (approximate) value of a policy or the optimal value function. They are also the backbone of many reinforcement learning (RL) algorithms such as Temporal Difference Learning [Sutton, 1988], Fitted Value Iteration [Gordon, 1995, Ernst et al., 2005, Munos and Szepesvári, 2008], and Deep Q Network [Mnih et al., 2015]. Value Iteration, however, can be slow when the discount factor is close to 1, as its convergence rate is O(γk). Moreover, even though we could use VI using ˆP instead of P to avoid expensive queries to P, the obtained value function would converge to a solution different from the value function of the original MDP.

This paper proposes an algorithm called Operator Splitting Value Iteration (OS-VI) that benefits from an approximate model ˆP to potentially accelerate the convergence of the value function sequence to the value function with respect to (w.r.t.) the true model P (Section 3). This algorithm is for both PE (Section 3.1) and Control (Section 3.2) problems. The acceleration is not uniform though, and depends on how close ˆP is to P (Section 4).

A key inspiration behind OS-VI is the (matrix) splitting approach in the numerical linear algebra, which is used to iteratively solve large linear systems of equations [Varga, 2000, Saad, 2003, Golub and Van Loan, 2013]. With a proper choice of splitting, one may change the convergence rate of

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

linear systems solvers. We show that the conventional VI for PE can be seen as a particular choice of splitting. This observation suggests that one may choose other forms of splitting as well in order to change the convergence rate. It turns out that we can choose a splitting that benefits from having access to ˆP (Section 2). The new splitting leads to OS-VI for PE. For the Control problem, the connection between solving linear system of equations and VI is not as straightforward anymore, as the former is linear, while the latter is not, but we can still get inspired from the splitting approach to design OS-VI for Control. The key step of such an algorithm is a new policy improvement step.

The form of the OS-VI algorithm opens up a connection to MBRL where the approximate model ˆP is learned using data. This leads to the OS-Dyna algorithm, inspired by a generic Dyna architecture [Sutton, 1990]. OS-Dyna is a hybrid of model-free and model-based algorithms. It uses the learned model in its inner planning loop, alike Dyna, but uses samples from the true model P in order to correct the effect of errors in the model. Existing MBRL algorithms would converge to an incorrect solution if the approximate model ˆP does not converge to the true model P. This would be the case whenever model approximation error exists. On the other hand, OS-Dyna can still converge to the correct value function even when ˆP does not converge to P. As far as we know, this is the first model-based RL algorithm with such property.

2 From value iteration to splitting-based linear system of equations solvers and back

We briefly describe the VI algorithm and the splitting methods for solving linear system of equations, and explain their connections. We consider a discounted Markov Decision Process (MDP) (X, A, R, P, γ) [Bertsekas and Tsitsiklis, 1996, Szepesvári, 2010, Sutton and Barto, 2019]. We defer formal definitions to the supplementary material. We only mention that for a policy π, we denote by Pπ its transition kernel, by rπ : X R the expected value of its reward distribution, and by V π = V π(R, P) its state-value function. We also represent the optimal state-value function by V = V (R, P) and the optimal policy by π = π (R, P). The Bellman operator T π : B(X) B(X) for policy π and the Bellman optimality operator T : B(X) B(X) are1

(T πV )(x) rπ(x) + γ Z Pπ(dy|x)V (y); (T V )(x) max a A

r(x, a) + γ Z P(dy|x, a)V (y) .

These operators can be written more compactly as T π : V 7 rπ + γPπV and T : V 7 maxπ{rπ + γPπV }. The greedy policy at state x X is

πg(x; V ) argmax a A

r(x, a) + γ Z P(dy|x, a)V (y) , (2.1)

or more compactly, πg(V ) argmaxπ T πV . We have T V = T πg(V )V , that is, the effect of the Bellman optimality operator T applied to a value function V is the same as applying the Bellman operator of the greedy policy w.r.t. V to V .

2.1 Value Iteration

The value function V π and the optimal value function V are the fixed points of the operators T π and T , respectively, and satisfy the Bellman equation. For the PE problem, this means that

V π = rπ + γPπV π (I γPπ)V π = rπ. (2.2)

There are several ways to compute the value function of a policy π or the optimal value function V , including the iterative methods such as VI and Policy Iteration (PI) algorithms, or solving a linear system of equations (for PE) or linear programming (for Control). We focus on the VI algorithm in this work. VI repeatedly applies the Bellman operator to the most recent approximation of the value function: Given an initial value function V0, it generates a sequence (Vk)k 0 as follows:

Vk T πVk 1, (Policy Evaluation) T Vk 1. (Control) (2.3)

1For countable state and action spaces, the integrals are replaced by summations. We present OS-VI and its theoretical analysis for general state/action spaces, but limit our experiments to finite state/action problems.

VI for Control can be written in an equivalent form: At iteration k, we first compute the greedy policy πk πg(Vk 1) (policy improvement step), and then Vk T πk Vk 1. Therefore, the policy improvement step is obtained through finding a policy that is greedy w.r.t. the last value function Vk 1, that is, the best policy if we only look one step ahead. This form will be conductive for our later developments. As the Bellman operator is a γ-contraction w.r.t. the supremum norm, the convergence rate of Vk to V π (or V ) would be O(γk). This rate can be slow when γ is close to 1.

2.2 Matrix splitting for solving linear system of equations

The VI for PE can be seen as a (matrix) splitting-based iterative method to solve the linear system of equations (2.2). Consider the linear system Az = b, with A Rd d and z, b Rd. Suppose that A is decomposed to A = M N for some choices of M, N Rd d (more generally, A, M, and N can be linear operators). Therefore, z satisfies Mz = Nz + b. The splitting-based iterative approach defines the new approximation zk given the current zk 1 by solving

Mzk = Nzk 1 + b,

or equivalently,

zk = M 1(Nzk 1 + b). (2.4)

To analyze the convergence of this iterative method, consider the error ek zk z. As z = M 1(Nz + b), we have ek = M 1(Nzk 1 + b) M 1(Nz + b) = M 1N(zk 1 z), so the dynamics of the error is

ek = M 1Nek 1 = (M 1N)2ek 2 = = (M 1N)ke0. (2.5)

Let G M 1N. The norm of the sequence (ek)k 1 can be upper bounded as

ek = Gke0 Gk e0 G k e0 . (2.6)

The errors are (norm-) convergent if G = M 1N < 1, for some choice of norm. More generally, the necessary and sufficient condition for convergence is that the spectral radius ρ(G), the maximum absolute value of eigenvalues of G, is smaller than one, see e.g., Theorem 4.1 of Saad [2003] or Theorem 11.2.1 of Golub and Van Loan [2013].2 The convergence is faster if the spectral radius (or norm) is closer to zero.

The success of this iterative procedure depends on how we choose M and N such that the norm (or spectral radius) is as small as possible. Also we want to choose an M such that solving Mzk = Nzk 1 + b is not very expensive. For example, if M is an identity matrix I, we get that N = I A, and the iteration becomes zk = (I A)zk 1 + b. This iteration is convergent if ρ(I A) < 1. Other commonly used choices lead to the Jacobi and Gauss-Seidel methods that are described in the supplementary material.

We are now ready to make the connection between splitting-based iterative methods and VI for PE. If we choose A = I γPπ, we see that equation AV π = rπ is indeed the Bellman equation for policy π (2.2). The VI for PE, which is Vk = γPπVk 1 + rπ = (I A)Vk 1 + rπ, corresponds to the choice of M = I and N = γPπ. This brings up the question of whether it is possible to split A differently so that the resulting VI-like procedure has better convergence properties. We next suggest a particular choice.

3 Operator splitting value iteration algorithm

We introduce the Operator Splitting Value Iteration (OS-VI) algorithm. We start from the PE problem and develop the Control version based on that. We also present a visualization of how OS-VI works.

3.1 OS-VI for policy evaluation

Given a policy π, true model P, and approximate model ˆP, we split I γPπ to M π and N π as

M π = I γ ˆPπ , N π = γ(Pπ ˆPπ).

2For any matrix norm, we have ρ(G) G , so the condition on the norm is sufficient, but not necessary. Our analysis will be norm-based.

Following the recipe of (2.4), the OS-VI algorithm for PE would be

Vk (I γ ˆPπ) 1h rπ + γ(Pπ ˆPπ)Vk 1 i , (3.1)

starting from an initial V0.3 To gain more intuition and prepare for further developments, we define a few notations. We define the Varga operator Sπ : B(X) B(X), named after Richard S Varga (1928 2022) who has made significant contributions to matrix analysis, as the mapping between the space of all bounded functions over X to the same space as

Sπ : V 7 (I γ ˆPπ) 1h rπ + γ(Pπ ˆPπ)V i .

Observe that (3.1) can be compactly written as

Vk SπVk 1. (3.2)

It is not difficult to see that SπV π = V π, i.e., the value function of a policy π is a fixed-point of the Varga operator Sπ. This and other properties of the Varga operator are shown in the supplementary material.

Given any value function V , define an auxiliary reward function r V : X A R as

r V (x, a) r(x, a) + γ Z P(dy|x, a) ˆP(dy|x, a) V (y). (3.3)

Similar to rπ, we define the notation rπ V : X R as rπ V (x) = r V (x, π(x)) for a deterministic policy π (and similarly for a stochastic policy). With this notation, the effect of applying Sπ to V is

SπV = (I γ ˆPπ) 1 rπ V .

This is the value function of following π in an MDP with dynamics ˆP and reward r V . Therefore, at each iteration of OS-VI (PE), we evaluate the policy π according to the approximate dynamics, and a reward function that consists of the original reward r and the correction term γ(P ˆP)Vk 1. The computation of this value function is a standard PE problem with the approximate model. For instance, we may use another VI (PE) with dynamics ˆP to solve it: We initialize U0 V , and then for i 1, we set Ui rπ V + γ ˆPπUi 1. This converges to SπV = (I γ ˆPπ) 1 rπ V with the usual rate of O(γi). Note that aside from the computation of rπ V , which requires querying P in order to compute the PπV term, this iterative process only uses the approximate model ˆP, which is assumed to be cheap to access.

What is the benefit of this OS-VI procedure? If the approximate model ˆP is close to the true dynamics P, this leads to a faster convergence of Vk to V π, as shall be quantified soon. The faster convergence is in terms of the number of queries to P, which is assumed to be expensive. To see this, consider the hypothetical case that ˆP is exactly the same as P, for example, if the cheap simulator happens to perfectly match the reality. Then, SπV = (I γPπ) 1(rπ + 0V ) = V π, and the value function for the original MDP is obtained in one iteration of OS-VI. Of course, we often can only hope for ˆP P. In Section 4, we study the impact of error in ˆP on the convergence rate of OS-VI in more details, and show that the convergence of OS-VI can be much faster than classic algorithms even if ˆP is only a close approximation of P.

3.2 OS-VI for control

The VI for Control can be seen as an iterative procedure that computes the greedy policy πk πg(Vk 1) = argmaxπ T πVk 1 in its policy improvement step, and then uses one step of the Bellman operator w.r.t. the obtained policy πk to compute the new estimate of the value function Vk T πk Vk 1, as described after (2.3). The OS-VI algorithm for Control follows a similar structure with the difference that 1) the improved policy is the optimizer of the Varga operator, and not the Bellman

3Although splitting is originally studied mostly in the context of linear algebra and matrices, we are applying the idea more generally. We are not assuming that the state space X is finite, and we allow it to be more general, such as a subset of Rd. Consequently, M π, N π, Pπ, etc. are operators rather than matrices.

operator, and 2) the new value function is obtained by applying the Varga operator of the newly obtained policy. To be concrete, given a value function V , define the S-improved policy

πV (V ) argmax π SπV [= (I γ ˆPπ) 1 rπ V ]. (3.4)

This policy is the optimal policy for the auxiliary MDP (X, A, r V , ˆP, γ). We also define the Varga optimality operator S : B(X) B(X) as

S : V 7 max π SπV.

The function S V is equal to SπV (V )V , i.e., the Varga operator of the S-improved policy w.r.t. V applied to a value function V (compare it with T V = T πg(V )V ).

The OS-VI (Control) is then simply

Vk S Vk 1, (3.5)

which in its expanded form, consists of the following two steps:

πk πV (Vk 1), (policy improvement). (3.6)

Vk Sπk Vk 1 [= (I ˆPπk) 1(rπk + γ(Pπk ˆPπk)Vk 1)], (partial policy evaluation). (3.7)

Comparing the S-improved policy (3.4) used in the policy improvement step (3.6) of OS-VI with the conventional greedy policy (2.1) is insightful. The greedy policy is argmaxπ T πV . Expanding T πV , we see that the greedy policy is the maximizer of rπ + γPπV . The function rπ + γPπV is a single-step bootstrapped estimate of the value of V π, and its maximizer, the greedy policy, is in general different from the optimal policy, which maximizes V π. On the other hand, the S-improved policy πV (V ) solves a full MDP with an approximate model ˆP and a reward function that has both the original reward r and the correction term γ(P ˆP)V . In the special case that ˆP = P, the correction term is zero, and πV (V ) would be the optimal policy π for the original MDP. As often ˆP P, the value function of policy πV (V ) is not exactly the optimal value. The partial policy evaluation step (3.7) updates the value function to a value that is closer to the optimal value function.

Remark. The use of matrix splitting-based ideas, either explicitly or implicitly, in the context of dynamic programming is not completely novel to this work. Kushner and Kleinman [1971] is one of the earliest paper that mentions the Jacobi and Gauss-Seidel procedures for computing the value function. Porteus [1975] proposes several transformations to the reward and probability transition matrix with the goal of improving the computational cost of solving the transformed MDP. One of the transformations, called pre-inverse transform, has some similarities with the operator splitting of this work. The end result, however, is different. Bacon and Precup [2016] offer a matrix splitting perspective on planning with options. The connection between multi-step models and matrix splitting is developed in Chapter 4 of Bacon [2018]. Refer to the supplementary material for more discussion.

3.3 Visualizing how OS-VI works

To present some intuition on how OS-VI works, we visualize the value function trajectories of several algorithms, including OS-VI, on a 2-state MDP, in Figure 1. We consider the policy evaluation for the dynamics Pπ = [ 0.9 0.1 0.1 0.9 ] with the reward rπ = 1 0.5 and γ = 0.9. We consider two approximate models: a relatively accurate ˆPπ accurate = [ 0.85 0.15 0.05 0.95 ], and an inaccurate ˆPπ inaccurate = [ 0.6 0.4 0.3 0.7 ].

In addition to OS-VI (PE), the first algorithm is the conventional VI (PE), which repeatedly applies the Bellman operator according to the true model Pπ to the most recent approximation of the value function. We use T π P to refer to its Bellman operator and to label the corresponding trajectory in the value space. This algorithm converges to the true value function V π. The second algorithm is VI (PE) that uses ˆPπ as the model. This procedure is the basis of the Dyna architecture. We use T π ˆ P to refer to its Bellman operator and to label the corresponding trajectory in the value space. Due to the error of T π ˆ P compared to T π P, the algorithm converges to a wrong value function ˆV π, as both figures show. We observe that even when the model is relatively accurate as in Figure 1a, the converged value function is quite wrong. This illustrates one limitation of the conventional model-based RL algorithms where an inaccurate model may lead to significantly inaccurate estimate of the value function.

0 1 2 3 4 5 V (0)

T π P T π ˆ P T π k Sπ

(a) More accurate model ˆPaccurate

0 1 2 3 4 5 V (0)

T π P T π ˆ P T π k Sπ

(b) Less accurate model ˆPinaccurate

Figure 1: The value function trajectories of VI (PE) with the true model (T π P), VI (PE) with approximate model T π ˆ P, and OS-VI (PE) (Sπ; and Tk for its inner loop) for a 2-state problem.

The OS-VI algorithm repeatedly applies the Varga operator Sπ to the most recent approximation of the value function. As discussed earlier, each computation of SπVk 1 corresponds to solving an auxiliary MDP (X, A, r Vk 1, ˆP, γ). We denote the Bellman operator of this auxiliary MDP by T π k . The figures show the trajectory generated by the iterative application of Sπ on the most recent value function as well as the trajectory for solving each auxiliary MDP, indicated by T π k . We observe that the OS-VI algorithm converges to the correct value function despite using an incorrect model. When the model is more accurate, very few iterations of OS-VI gets a value close to V π (two iterations in Figure 1a); when the model is less accurate, a few more iterations are needed. Compared to VI, at least in these examples, the total number of iterations of OS-VI is significantly smaller.

When the initial value function is V0 = 0, the result of the first iteration of OS-VI is the same value function computed by the VI with the wrong model ˆPπ. This is because V1 SπV0 = (I γ ˆPπ) 1 rπ V0 and rπ V0 = rπ, so V1 = (I γ ˆPπ) 1rπ, the same solution as the value function obtained using the approximate model ˆP. In these figures, this shows itself by the overlapping of the red arrows followed by T π ˆ P and the first segment of the orange arrows, which are generated by the repeated application of T π 1 . In further iterations of OS-VI, the auxiliary MDPs change and the path followed by T π k (k 2) deviates from the solution of the VI with the wrong model.

4 Convergence analysis of operator splitting value iteration

In this section, we present the convergence analysis of OS-VI. Our results show that OS-VI has an O(γ k) convergence rate for an effective discount factor γ that depends on the error between ˆP and P. For small enough error, γ < γ and OS-VI has a faster convergence rate compared to the classic VI, Policy Iteration (PI), and Modified Policy Iteration (MPI), which all have O(γk) behaviour. We provide results for both the L and Lp norms.

4.1 Convergence of OS-VI for policy evaluation

We study the convergence behaviour of OS-VI (PE) in presence of error in value updates. Specifically, we consider that at each iteration k, the update (3.2) has an error, i.e.,

Vk = SπVk 1 + ϵvalue k (4.1)

The error ϵvalue k encompasses various sources of errors that might occur in solving the auxiliary MDP (X, A, r Vk 1, ˆP, γ). One source is the function approximation error due to using a function approximator to represent Vk, which is often required in large state spaces. Another is the estimation (i.e., statistical) error caused due to having a finite number of samples, instead of direct access to P, in the RL setting. Refer to Munos and Szepesvári [2008], Antos et al. [2008], Farahmand et al. [2016], Chen and Jiang [2019], Fan et al. [2019] for the discussion of function approximation and estimation errors in the RL context. In this work, we do not analyze how the number of samples, the function

approximator, etc. affect the errors ϵvalue k . We offer error propagation results similar to Munos [2007] (approximate VI), Munos [2003] (approximate PI), and Scherrer et al. [2015] (approximate modified PI) for approximate OS-VI.

To study the convergence behaviour of OS-VI (PE), let Gπ = (I γ ˆPπ) 1γ(Pπ ˆPπ). We use the fact that SπV π = V π and write V π Vk = SπV π SπVk 1 ϵvalue k = Gπ(V π Vk 1) ϵvalue k Gπ V π Vk 1 + ϵvalue k . (4.2) Now, we have that

Gπ = (I γ ˆPπ) 1γ(Pπ ˆPπ) γ 1 γ

Pπ ˆPπ , (4.3)

where we used the fact that for any square matrix F with a matrix norm F p < 1, it holds that (I F) 1 p 1 1 F p (see Lemma 2.3.3 of Golub and Van Loan 2013), and that the

supremum norm of a stochastic matrix ˆPπ is 1. Assuming that ϵvalue k ϵvalue for all k 1 and defining effective discount factor γ = γ 1 γ Pπ ˆPπ , the upper bounds (4.2) and (4.3) lead to

V π Vk γ k V π V0 + 1 γ k

1 γ ϵvalue.

A few remarks are in order. First, whenever γ < γ, this is guaranteed to be faster than the convergence rate of the conventional VI, which is O(γk). This happens if Pπ ˆPπ < 1 γ. If the model is very accurate, we obtain much faster rate than VI s. Since each iteration k corresponds to a query to the true model P, a faster rate entails that the algorithm requires fewer number of queries to the expensive model to reach the same level of accuracy.

Second, although the model error Pπ ˆPπ is a reasonable choice to measure the distances between distributions (it is the maximum of the Total Variation distance between Pπ( |x) and ˆPπ( |x) over x, which itself can be upper bounded by their KL divergence; see the supplementary material), it is somewhat conservative as it takes the supremum over the state space. Likewise, requiring ϵvalue k to be small is also conservative, as approximating SπVk 1 using a function approximator given samples (RL setting) often leads to an Lp-norm type of guarantee. We now provide a different analysis to address these issues.

To present the Lp-norm result, we need to define some notations. First, we define the conditional discounted future-state distribution of policy π under ˆP as the following probability distribution: Given a measurable set B, we have ˆηπ(B|x) = (1 γ) P t=0 γt P Xt B|X0 = x, π, ˆP , where

the chain (Xt)t 0 starts from state x and evolves by following policy π under transitions ˆP. For an arbitrary distribution ρ over the state space, we define the discounted future-state distribution concentration coefficient of the approximate model as

ˆCπ(ρ)2 = 1

Z ρ(dx) dˆηπ( |x)

Here dˆηπ( |x)

dρ is the Radon Nikodym derivative of the distribution ˆηπ( |x) w.r.t. the distribution ρ. It is assumed that for any x X, ˆηπ( |x) ρ, i.e., ˆηπ( |x) is absolutely continuous w.r.t. ρ (otherwise, the coefficient would be set to infinity). This coefficient measures how concentrated the distribution ˆηπ( |x) is compared to ρ. This is weighted according to the state distribution ρ. Similar concentrability coefficients, but not exactly this one, have appeared in the error propagation results [Kakade and Langford, 2002, Munos, 2003, 2007, Farahmand et al., 2010, Scherrer et al., 2015]. Finally, we define the weighted χ2-divergence of ˆPπ and Pπ as

χ2 ρ(Pπ || ˆPπ) Z ρ(dx)χ2 Pπ( |x) || ˆPπ( |x) = Z ρ(dx) Z ˆPπ(dy|x) Pπ(dy|x) 2

ˆPπ(dy|x) .

This notion of model error is less strict in requiring accurate approximation P in all states. Usually only a subset of the state space is important or even reachable in a problem. The above model error can focus on only specific areas of the state space through the choice of distribution ρ.

We are now ready to present the main theorem for the approximate OS-VI (PE).

Theorem 1. Consider the approximate OS-VI algorithm for PE (4.1). Let be either the supremum norm ( = ) or 4,ρ for ρ being an arbitrary distribution over the state space ( = 4, ρ). Assume that ϵvalue k ϵvalue for all k 1. Furthermore, define the effective discount factor as

Pπ ˆPπ ( = ), q

ˆCπ(ρ)χ2ρ(Pπ || ˆPπ) ( = 4, ρ).

For any k 0, we have V π Vk γ k V π V0 + 1 γ k

1 γ ϵvalue.

4.2 Convergence of OS-VI for control

We turn to analyzing OS-VI for Control. We consider two types of errors: The first is the error between the computed value function and the true optimal value function of the auxiliary MDP, i.e., Vk S Vk 1. The second is the suboptimality of obtained policy compared to the optimal policy of the auxiliary MDP, i.e., Sπk Vk 1 S Vk 1. Concretely, we have

Vk = S Vk 1 + ϵvalue k , (4.5)

Sπk Vk 1 = S Vk 1 + ϵpolicy k . (4.6)

We have the following result for the approximate OS-VI (Control). Theorem 2. Consider the approximate OS-VI algorithm for control (4.5)-(4.6). Let be either the supremum norm ( = ) or 4,ρ for ρ being an arbitrary distribution over the state space ( = 4, ρ). For any k 1, let Πk = {π , πk} {πV (Vi 1) : 1 i < k}. Assume that ϵvalue k ϵvalue for all k 1. Furthermore, define the effective discount factor as

maxπ Πk Pπ ˆPπ ( = ),

2 ˆCπ(ρ)χ2ρ(Pπ || ˆPπ) ( = 4, ρ).

We then have

V πk V 2γ k

1 γ V0 V + 2γ (1 γ k 1)

(1 γ )2 ϵvalue + 1 1 γ

ϵpolicy k .

We can compare this result with the convergence result of VI. For VI with the supremum norm, following the proof of Equation (2.2) by Munos [2007], we can show that V V πk 2γk

1 γ V V0 + 2γ(1 γk 1)ϵvalue

(1 γ)2 , with Vi T Vi 1 ϵvalue for all i < k (similar result for the Lp-norm also holds, see Theorem 5.2 in Munos 2007). For the approximate VI, the initial error V V0 decays with the rate of O(γk). This should be compared with O(γ k) rate of OS-VI. The effect of error at each step ϵvalue is also similar: approximate VI has (1 γ) 2 dependence while approximate OS-VI has (1 γ ) 2. What is remarkable is that as opposed to γ, which is a fixed parameter of the problem and can be close to 1, γ can be made arbitrary close to zero when the approximate model ˆP becomes more accurate. The additional information given by ˆP allows us to get much faster rate than VI. Of course, this requires the model to be accurate. An inaccurate model might be detrimental to the convergence rate, and may even lead to divergence. Similar conclusions can be made in comparing OS-VI with Policy Iteration and Modified Policy Iteration, as discussed in the supplementary material.

5 Operator splitting Dyna

In the RL setting, we only have access to samples from P. MBRL algorithms, such as variants of the Dyna architecture, learn ˆP from those samples, and use it to find the value function or policy. The learned model ˆP is generally different from P due to the finiteness of the samples as well as the possibility of model approximation error: the true dynamics P may not be representable with the function approximator used to represent ˆP. This is another way to say that the world may be too big to be represented by our models. A MBRL algorithm that uses ˆP in lieu of P does not find

Algorithm 1 OS-Dyna

1: Initialize V0, r = 0, and the approximate model ˆP. 2: for t = 1, 2, . . . do 3: Sample (Xt, At, Rt, X t) from environment. 4: Update the model ˆP with (Xt, At, Rt, X t).

5: r(Xt, At) r(Xt, At) + αt Rt + γVt 1(X t) γEX ˆ P( |Xt,At)[Vt 1(X )] r(Xt, At) .

6: Vt V π( r, ˆP) (For PE) or Vt V ( r, ˆP) , πt π ( r, ˆP) (For Control). 7: end for

the true value of the true MDP. Based on OS-VI, we propose OS-Dyna, as a hybrid model-based and model-free RL algorithm, that takes advantage of both the true environment and the model in its updates and can converge to the true value function despite using inaccurate ˆP.

Learning ˆP in OS-Dyna is similar to other MBRL algorithms [Moerland et al., 2022]: one can use various model learning approaches, either based on maximum likelihood estimate or a decision-aware model learning approach, to learn the model. Given a learned ˆP, we can compute Vk from the auxiliary reward function rk r Vk 1 by solving the PE or the Control problem in the auxiliary MDP (X, A, rk, ˆP), as discussed in Section 3.

As Vk is a function of rk, we focus on how rk should be estimated. The update rule of rk in OS-VI entails that for every (x, a), we have

rk(x, a) = r(x, a) + γ P( |x, a) ˆP( |x, a) V π( rk 1, ˆP), (Policy Evaluation) (5.1)

rk(x, a) = r(x, a) + γ P( |x, a) ˆP( |x, a) V ( rk 1, ˆP). (Control) (5.2)

We update our estimation of r using samples, as shall be discussed soon, and then the value function is updated to V π( r, ˆP) (PE) or V ( r, ˆP) (Control) with most recent estimate of r. The challenge is that the above update rules need access to distribution P( |x, a) for every (x, a), while we only have samples from P at some (x, a) pairs. Fortunately, this challenge has been tackled in developing sample-based algorithms based on the classic VI:

(x, a) : Qk(x, a) = r(x, a) + γP( |x, a)Vk 1, (5.3)

where Vk 1 = Qk 1(x, π(x)) in PE and Vk 1 = maxa Qk 1(x, a) in Control. There are multiple approaches to develop sample-based algorithms based on (5.3) such as Fitted Value Iteration and Stochastic Approximation (SA) [Borkar, 2008]. In this paper we use SA to develop OS-Dyna, but we point out that other algorithms and techniques can also be applied to develop other versions of OS-Dyna. The key step in SA is to use samples to form an unbiased estimate of the intended update value. For a step in the true environment leading to (Xt, At, Rt, X t) tuple, we can have the estimate Yt = Rt + γV (X t) γEX ˆ P( |Xt,At)[V (X )], where the expectation can also be estimated by

samples from ˆP( |Xt, At). This estimate Yt of the update rule can then be used to update r. As an example, for a finite state-action problem, the update rule is

r(Xt, At) r(Xt, At) + αt(Yt r(Xt, At)), (5.4)

where αt is the learning rate. The final procedure of OS-Dyna is presented in Algorithm 1.

6 Experiments

We evaluate both OS-VI and OS-Dyna in a finite MDP and compare them with existing methods. Here we present the results for the Control problem on a modified cliffwalk environment in a 6 6 grid with 4 actions (UP, DOWN, LEFT, RIGHT). We postpone studying the PE problem, the results for other environments, and other relevant details to the supplementary material. Our convergence analysis shows that the convergence rates of our algorithms depend on the accuracy of ˆP. To test OS-VI and OS-Dyna with models of different accuracies, we introduce the smoothed model ˆP of transitions P with smoothing parameter λ as ˆP( |x, a; P, λ) = (1 λ)P( |x, a) + λU {x |P(x |x, a) > 0} , (6.1)

0 5 10 15 20 25 Iterations (k)

Normalized ||V k V * ||1

VI Model ( = 0) Model ( = 0.1) Model ( = 0.5)

OS-VI ( = 0) OS-VI ( = 0.1)

OS-VI ( = 0.5)

0k 20k 40k 60k 80k

Environment Samples (t)

Q-Learning Dyna ( = 0)

Dyna ( = 0.1)

Dyna ( = 0.5)

OS-Dyna ( = 0)

OS-Dyna ( = 0.1)

OS-Dyna ( = 0.5)

Figure 2: (Left) Normalized error comparisons of OS-VI, VI, and the optimal policy of model ˆP. (Right) Comparison of OS-Dyna with Dyna and Q-Learning in the RL setting.

where U(A) for some set A is the uniform distribution over A. Here, λ allows making adjustments to the amount of error introduced in ˆP w.r.t. P. If λ = 0, ˆP = P will be the accurate model, and if λ = 1, ˆP will be uniform over possible next states in P.

The left plot in Figure 2 shows the convergence of OS-VI compared to VI and the solutions the model itself would lead to. The plot shows normalized error of V πk w.r.t V , i.e., V πk V 1/ V 1. It can be seen that OS-VI has faster convergence with more accurate models and introduces acceleration compared to VI across different model errors. Note that the convergence of OS-VI has been achieved despite the error in the model. The dashed lines show how a fully model-based algorithm, which only uses ˆP, would obtain a suboptimal solution.

We also compare OS-Dyna with Dyna and Q-Learning in the RL setting. At each iteration t, the algorithms are given a sample (Xt, At, Rt, X t) where Xt, At are selected uniformly at random. For OS-Dyna and Dyna we use the smoothed Maximum-likelihood Estimation (MLE) model. If PMLE is the current MLE estimation of the environment transitions, OS-Dyna and Dyna use ˆP(PMLE, λ) defined in (6.1) as their models. The learning rates are constant α for iterations t N and then decay in the form of αt = α/(t N) afterwards. We have fine-tuned the learning rate schedule for each algorithm separately for the best results.

The right plot in Figure 2 shows the results for the RL setting. We evaluate the expected return of the policy at iteration t in the initial state of the environment, i.e., V πt(0). Again, OS-Dyna has converged to the optimal policy much faster than Q-Learning. Unlike OS-Dyna, Dyna has failed to find the optimal policy in presence of model error. The results show that OS-Dyna can effectively converge faster than Q-Learning without introducing bias to the final solution due to model error.

7 Conclusion

This paper introduced the Operator Splitting Value Iteration (OS-VI) algorithm, which can benefit from an approximate model ˆP P to accelerate the convergence of the approximate value to the true value function in terms of the number of queries to the true model P. With a small model error, its convergence rate is exponentially faster compared to well-known dynamical programming algorithms such as Value Iteration and Policy Iteration. We also proposed OS-Dyna as a hybrid model-based/model-free algorithm that can bring in the benefits of a model-based RL algorithm without converging to a biased solution, as Dyna or any other purely model-based RL algorithm does. This paper opens up several future directions. Empirically studying the algorithms on problems with large state spaces, for which a function approximator such as a DNN is required, is an obvious one. This is postponed to a future work as our aim was to build the mathematical foundation and conducting experiments without worrying about challenges such as the optimization of a DNN. There are other algorithmic and theoretical directions to be pursued. One is exploring the space of splittings of I γPπ. The other is whether we can design Operator Splitting variants of other DP algorithms such as Policy Iteration and Modified Policy Iteration, and study their convergence behaviour.

Acknowledgments and Disclosure of Funding

We would like to thank the anonymous reviewers for their comments that helped us to improve the clarity of the paper, as well as other members of the Adaptive Agents Lab who provided feedback on a draft of this paper. AMF acknowledges the funding from the Canada CIFAR AI Chairs program, as well as the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) through the Discovery Grant program. AR was partially supported by Borealis AI through the Borealis AI Global Fellowship Award. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89 129, 2008. 6

Pierre-Luc Bacon. Temporal Representation Learning. Ph D thesis, Mc Gill University, 2018. 5

Pierre-Luc Bacon and Doina Precup. A matrix splitting perspective on planning with options. In Continual Learning and Deep Networks Workshop at NIPS. 2016. 5

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.

Vivek S. Borkar. Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press, 2008. 9

Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. 6

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research (JMLR), 6:503 556, 2005. 1

Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. A theoretical analysis of deep q-learning. ar Xiv:1901.00137v3, 2019. 6

Amir-massoud Farahmand, Rémi Munos, and Csaba Szepesvári. Error propagation for approximate policy and value iteration. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems (Neur IPS - 23), pages 568 576. 2010. 7

Amir-massoud Farahmand, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized policy iteration with nonparametric function spaces. Journal of Machine Learning Research (JMLR), 17(139):1 66, 2016. 6

Gene H. Golub and Charles F. Van Loan. Matrix Computations. The John Hopkins University Press, 4th edition, 2013. 1, 3, 7

Geoffrey Gordon. Stable function approximation in dynamic programming. In International Conference on Machine Learning (ICML), 1995. 1

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML), pages 267 274, 2002. 7

Harold J. Kushner and Allan J. Kleinman. Accelerated procedures for the solution of discrete Markov control problems. IEEE Transactions on Automatic Control, 16(2):147 152, April 1971. 5

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, February 2015. 1

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey. ar Xiv:2006.16712v4, 2022. 9

Rémi Munos. Error bounds for approximate policy iteration. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 560 567, 2003. 7

Rémi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM Journal on Control and Optimization, pages 541 561, 2007. 7, 8

Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research (JMLR), 9:815 857, 2008. 1, 6

Evan L. Porteus. Bounds and transformations for discounted finite markov decision chains. Operations Research, 23(4):761 784, 1975. 5

Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics (SIAM), 2nd edition, 2003. 1, 3

Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research (JMLR), 16(49):1629 1676, 2015. 7

Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3 (1):9 44, 1988. 1

Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (ICML), 1990. 2

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2019. 2

Csaba Szepesvári. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010. 2

Richard S. Varga. Matrix Iterative Analysis. Springer-Verlag, 2nd edition, 2000. 1

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] In the end of Section 4 we show that the convergence and speed up need an accurate enough model. (c) Did you discuss any potential negative societal impacts of your work? [N/A] There are no immediate potential negative societal impacts. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The details are in the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] The details are in the supplementary material. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] The experiments are simple and can be run on a personal computer. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]