# regularized_qlearning__3633fccc.pdf

Regularized Q-Learning

Han-Dong Lim Electrical Engineering, KAIST limaries30@kaist.ac.kr

Donghwan Lee Electrical Engineering, KAIST donghwan@kaist.ac.kr

Q-learning is widely used algorithm in reinforcement learning (RL) community. Under the lookup table setting, its convergence is well established. However, its behavior is known to be unstable with the linear function approximation case. This paper develops a new Q-learning algorithm, called Reg Q, that converges when linear function approximation is used. We prove that simply adding an appropriate regularization term ensures convergence of the algorithm. Its stability is established using a recent analysis tool based on switching system models. Moreover, we experimentally show that Reg Q converges in environments where Q-learning with linear function approximation was known to diverge. An error bound on the solution where the algorithm converges is also given.

1 Introduction

Recently, RL has shown great success in various fields. For instance, Mnih et al. [2015] achieved human level performance in several video games in the Atari benchmark [Bellemare et al., 2013]. Since then, researches on deep RL algorithms have shown significant progresses [Lan et al.]. Although great success has been achieved in practice, there is still gap between theory and the practical success. Especially when off-policy, function approximation, and bootstrapping are used together, the algorithm may show unstable behaviors. This phenomenon is called the deadly triad [Sutton and Barto, 2018]. Famous counter-examples are given in Baird [1995], Tsitsiklis and Van Roy [1997].

For policy evaluation, especially for temporal-difference (TD) learning algorithm, there has been several algorithms to resolve the deadly triad issue. Bradtke and Barto [1996] uses the least-square method to compute a solution of TD-learning, but it suffers from O(h2) time complexity, where h is number of features. Maei [2011], Sutton et al. [2009] developed gradient descent based methods which minimize the mean square projected Bellman error. Ghiassian et al. [2020] added regularization term to TD Correction (TDC) algorithm, which uses a single time scale step-size. Lee et al. [2022] introduced several variants of the gradient TD (GTD) algorithm under control theoretic frameworks. Sutton et al. [2016] re-weights some states to match the on-policy distribution to stabilize the off-policy TD-learning. Bharadwaj Diddigi et al. [2020] uses l2 regularization to propose a new convergent off-policy TD-learning algorithm. Mahadevan et al. [2014] studied regularization on the off-policy TD-learning through the lens of primal dual method.

First presented by Watkins and Dayan [1992], Q-learning also suffers from divergence issues under the deadly triad. While there are convergence results under the look-up table setting [Watkins and Dayan, 1992, Jaakkola et al., 1994, Borkar and Meyn, 2000, Lee and He, 2020], even with the simple linear function approximation, the convergence is only guaranteed under strong assumptions [Melo et al., 2008, Lee and He, 2020, Yang and Wang, 2019].

The main goal of this paper is to propose a practical Q-learning algorithm, called regularized Qlearning (Reg Q), that guarantees convergence under linear function approximation. We prove its convergence using the ordinary differential equation (O.D.E) analysis framework in Borkar and Meyn [2000] together with the switching system approach developed in Lee and He [2020]. As in Lee

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

and He [2020], we construct upper and lower comparison systems, and prove its global asymptotic stability based on switching system theories. Compared to the standard Q-learning in Watkins and Dayan [1992], a difference lies in the additional l2 regularization term, which makes the algorithm relevantly simple. Moreover, compared to the previous works in Carvalho et al. [2020], Maei et al. [2010], our algorithm is single time-scale, and hence, shows faster convergence rates experimentally. Our algorithm directly uses bootstrapping rather than circumventing the issue in the deadly triad. Therefore, it could give a new insight into training reinforcement learning algorithms with function approximation without using the so-called target network technique introduced in Mnih et al. [2015]. The main contributions of this paper are summarized as follows:

1. A new single time-scale Q-learning algorithm with linear function approximation is proposed.

2. We provide a theoretical analysis on the solution of the projected Bellman equation where a regularization term is included.

3. We prove the convergence of the proposed algorithm based on the O.D.E approach together with the switching system model in Lee and He [2020].

4. We experimentally show that our algorithm performs faster than other two time-scale Qlearning algorithms in Carvalho et al. [2020], Maei et al. [2010].

Related works:

Several works [Melo et al., 2008, Lee and He, 2020, Yang and Wang, 2019] have relied on strong assumptions to guarantee convergence of Q-learning under linear function approximation. Melo et al. [2008] adopts an assumption on relation between behavior policy and target policy to guarantee convergence, which is not practical in general. Lee and He [2020] assumes a similar assumption to that of Melo et al. [2008] to ensure the convergence with the so-called switching system approach. Yang and Wang [2019] considered a transition matrix that can be represented by the feature values, which restricts the class of Markov chain.

Motivated by the empirical success of the deep Q-learning in Mnih et al. [2015], recent works in Zhang et al. [2021], Carvalho et al. [2020], Agarwal et al., Chen et al. [2023] use the target network to circumvent the bootstrapping issue and guarantee convergence. Carvalho et al. [2020] designed a two time-scale learning method motivated by the target network method. Zhang et al. [2021] uses l2 regularization with the target network, while a projection step is involved, which makes it difficult to implement practically. Moreover, it also relies on a two time-scale learning method. Chen et al. [2023] used target network and truncation method to address the divergence issue. Agarwal et al. additionally uses the so-called experience replay technique with the target network. Furthermore, the optimality is only guaranteed under a specific type of Markov chain. Even though, the target network update can guarantee stability, it often leads to slow convergence rate [Kim et al., 2019].

Maei et al. [2010] suggested the so-called Greedy-GQ (gradient Q-learning) algorithm, but due to non-convexity of the objective function, it could converge to a local optima. Lu et al. [2021] used linear programming approach [Manne, 1960] to design convergent Q-learning algorithm under deterministic control systems. Devraj and Meyn [2017] proposed a Q-learning algorithm that minimizes asymptotic variance. However, it requires the assumption that the number of changes of policy are finite, and involves matrix inversion at each iteration. Meyn [2023] introduced an optimistic training scheme with modified Gibbs policy for Q-learning with linear function approximation, which guarantees existence of a solution of the projected Bellman equation, but not the convergence. Geist et al. [2019], Xi et al. [2024] considered regularization on the policy which address a different scenario than the regularization in our work.

l2 regularization has been actively explored in the RL literature. Farahm et al. [2016] proposed a regularized policy iteration algorithm that addresses a regularized policy evaluation problem, followed by a policy improvement step. The authors derived a performance error bound. Zhang et al. [2021] studied regularized projected Bellman equation and proves that inside a certain ball, the solution of the regularized projected Bellman equation exist and is unique. Manek and Kolter [2022] studied fixed points of off-policy TD-learning algorithm with regularization showing that the bias of the solution caused by the regularization can be large under certain scenario. Nonetheless, the regularization method has been widely used in practice Farebrother et al. [2018], Piché et al. [2021].

2 Preliminaries and Notations

2.1 Markov Decision Process

We consider an infinite horizon Markov Decision Process (MDP), which consists of a tuple M = (S, A, P, r, γ), where the state space S and action space A are finite sets, P denotes the transition probability, r : S A S R is the reward, and γ (0, 1) is the discount factor. Given a stochastic policy π : S P(A), where P(A) is the set of probability distributions over A, agent at the current state sk selects an action ak π( |sk), then the agent s state changes to the next state sk+1 P( |sk, ak), and receives reward rk+1 := r(sk, ak, sk+1). A deterministic policy is a special stochastic policy, which can be defined simply as a mapping π : S A.

The objective of MDP is to find a deterministic optimal policy, denoted by π , such that the cumulative discounted rewards over infinite time horizons is maximized, i.e., π := arg maxπ E P k=0 γkrk π , where (s0, a0, s1, a1, . . .) is a state-action trajectory generated by the Markov chain under policy π, and E[ |π] is an expectation conditioned on the policy π. The Q-function under policy π is defined as Qπ(s, a) = E P k=0 γkrk s0 = s, a0 = a, π , (s, a) S A, and the optimal Q-function is defined as Q (s, a) = Qπ (s, a) for all (s, a) S A. Once Q is known, then an optimal policy can be retrieved by the greedy action, i.e., π (s) = arg maxa A Q (s, a). Throughout, we assume that the Markov chain is time homogeneous so that the MDP is well posed, which is standard in the literature. It is known that the optimal Q-function satisfies the so-called Bellman equation expressed as follows:

Q (s, a) = E rk+1 + max ak+1 A γQ (sk+1, ak+1) (sk, ak) = (s, a) := T Q (s, a), (1)

where T is called the Bellman operator.

2.2 Notations

In this paper, we will use an O.D.E. model [Borkar and Meyn, 2000] of Q-learning to analyze its convergence. To this end, it is useful to introduce some notations in order to simplify the overall expressions. Throughout the paper, ea and es denote a-th and s-th canonical basis vectors in R|A| and R|S|, respectively, and stands for the Kronecker product. Let us introduce the following notations:

P1 ... P|A|

R|S||A| |S|, R :=

R1 ... R|A|

R|S||A|, Q :=

Q1 ... Q|A|

d(1, a) ... d(|S|, a)

R|S| |S|, D :=

D1 ... D|A|

R|S||A| |S||A|,

where Pa R|S| |S|, a A is the state transition matrix whose i-th row and j-th column component denotes the probability of transition to state j when action a is taken at state i, P π R|S||A| |S||A| represents the state-action transition matrix under policy π, i.e.,

(es ea)T P π(es ea ) = P[sk+1 = s , ak+1 = a |sk = s, ak = a, π],

Qa = Q( , a) R|S|, a A and Ra(s) := E[r(s, a, s )|s, a], s S. Moreover, d( , ) is the stateaction visit distribution, where i.i.d. random variables {(sk, ak)} k=0 are sampled, i.e., d(s, a) = P[sk = s, ak = a], (s, a) S A. With a slight abuse of notation, d will be also used to denote the vector d R|S||A| such that d T (es ea) = d(s, a), (s, a) S A. In this paper, we represent a policy in a matrix form in order to formulate a switching system model. In particular, for a given policy π, define the matrix Ππ R|S| |S||A|:

Ππ := (eπ(1) e1) (eπ(2) e2) (eπ(|S|) e|S|) .

Then, we can prove that for any deterministic policy, π, we have ΠπQ = [Q(1, π(1)) Q(2, π(2)) Q(|S|, π(|S|))]T . For simplicity, let ΠQ := Ππ when π(s) = arg maxa A Q(s, a). Moreover, we can prove that for any deterministic policy π, P π = PΠπ R|S||A| |S||A|, where P π is the state-action transition probability

matrix. Using the notations introduced, the Bellman equation in (1) can be compactly written as Q = γPΠQ Q + R =: T Q , where πQ is the greedy policy defined as πQ (s) = arg maxa A Q (s, a).

2.3 Q-learning with linear function approximation

Q-learning is widely used model-free learning to find Q , whose updates are given as

Qk+1(sk, ak) Qk(sk, ak) + αkδk, (2)

where δk = rk+1 +γ maxa A Qk(sk+1, a) Qk(sk, ak) is called the TD error. Each update uses an i.i.d. sample (sk, ak, rk+1, sk+1), where (sk, ak) is sampled from a state-action distribution d( , ).

Here, we assume that the step-size is chosen to satisfy the so-called the Robbins-Monro condition [Robbins and Monro, 1951], αk > 0, P k=0 αk = , P k=0 α2 k < . When the state-spaces and action-spaces are too large, then the memory and computational complexities usually become intractable. In such a case, function approximation is commonly used to approximate Q-function [Mnih et al., 2015, Hessel et al., 2018]. Linear function approximation is one of the simplest function approximation approaches. In particular, we use the feature matrix X R|S||A| h and parameter vector θ Rh to approximate Q-function, i.e., Q Xθ, where the feature matrix is expressed as X := [x(1, 1) x(1, |A|) x(|S|, |A|)]T R|S||A| h. Here, x( , ) Rh is called the feature vector, and h is a positive integer with h << |S||A|. The corresponding greedy policy becomes πXθ(s) = arg maxa A x(s, a)T θ. Note that the number of policies characterized by the greedy policy is finite. This is because the policy is invariant under constant multiplications, and there exists a finite number of sectors on which the policy is invariant. Next, we summarize some standard assumptions adapted throughout this paper. Assumption 2.1. The state-action visit distribution is positive, i.e., d(s, a) > 0 for all (s, a) S A. Assumption 2.2. The feature matrix, X, has full column rank, and is a non-negative matrix. Moreover, columns of X are orthogonal. Assumption 2.3 (Boundedness on feature matrix and reward matrix). There exists constants, Xmax > 0 and Rmax > 0, such that max(||X|| , ||XT || ) < Xmax and ||R|| < Rmax.

We note that except for the orthogonality of the feature matrix in Assumption 2.2, the assumptions in the above are commonly adopted in the literature, e.g. Carvalho et al. [2020], Lee and He [2020]. Moreover, under Assumption 2.1, D is a nonsingular matrix with strictly positive diagonal elements.

Lemma 2.4 (Gosavi [2006]). Under Assumption 2.3, Q , is bounded, i.e., ||Q || Rmax

The proof of Lemma 2.4 comes from the fact that under the discounted infinite horizon setting, Q can be expressed as an infinite sum of a geometric sequence.

2.4 Switching System

In this paper, we consider a particular system, called the switched linear system [Liberzon, 2003],

xt = Aσtxt, x0 = z Rn, t R+, (3)

where xt Rn is the state, M := {1, 2, . . . , M} is called the set of modes, σt M is called the switching signal, and {Aσ, σ M} are called the subsystem matrices. The switching signal can be either arbitrary or controlled by the user under a certain switching policy.

Stability and stabilization of (3) have been widely studied for decades. Still, finding a practical and effective condition for them is known to be a challenging open problem. Contrary to linear time-invariant systems, even if each subsystem matrix Aσ is Hurwitz, the overall switching system may not be stable in general. This tells us that tools in linear system theories cannot be directly applied to conclude the stability of the switching system.

Another approach is to use the Lyapunov theory [Khalil, 2002]. From standard results in control system theories, finding a Lyapunov function ensures stability of the switching system. If the switching system consists of matrices with strictly negatively row dominant diagonals, defined in Definiiton A.5 in the Appendix, or negative-definite matrices, we can always find a common (piecewise) quadratic Lyapunov function to ensure its stability. We use this fact to prove the convergence of the proposed algorithm.

Lemma 2.5. Consider a switched system in (3). Suppose one of the following two conditions hold:

1) Each Aσ for σ M has a strictly negatively row dominating diagonal, i.e., [Aσ]ii + P j {1,2,...,n}\{i} |[Aσ]ij| < 0 for all 1 i n. 2) Aσ + A σ 0 for all σ M.

Then, the origin of (3) is asymptotically stable.

The proof is given in Appendix A.4

3 Projected Bellman equation

In this section, we introduce the notion of projected Bellman equation with a regularization term, and establish connections between it and the proposed algorithm. Moreover, we briefly discuss the existence and uniqueness of the solution of the projected Bellman equation. We will also provide an example to illustrate the existence and uniqueness of the solution.

3.1 Projected Bellman equation (PBE)

When using the linear function approximation, since the true action value may not lie in the subspace spanned by the feature vectors, a solution of the Bellman equation may not exist in general. To resolve this issue, a standard approach is to consider the projected Bellman equation (PBE) defined as

Xθ = ΓT Xθ , (4)

where Γ := X(XT DX) 1XT D is the weighted Euclidean projection with respect to state-action visit distribution onto the subspace spanned by the feature vectors, and T Xθ = γPΠXθ Xθ + R. In this case, there are more chances for a solution satisfying the PBE to exist. Still, there may exist cases where the PBE does not admit a solution. To proceed, letting

AπXθ := XT DX γXT DPΠXθ X, b = XT DR,

we can rewrite (4) equivalently as

Xθ = X(XT DX) 1XT D(γPΠXθ Xθ + R) AπXθ θ = b, (5)

Furthermore, we use the simplified notation C := XT DX. A potential deterministic algorithm to solve the above equation is

θk+1 = θk + αk(b AπXθk θk). (6)

It iteratively solves the linear or nonlinear equation, which is a widely used algorithm called a Richardson iteration [Kelley, 1995]. If it converges, i.e., θk θ as k , then it is clear that θ solves (5). In this paper, the proposed algorithm is a stochastic algorithm that solves the modified equation

b (AπXθ η + ηI)θ η = 0, (7)

where I is the h h identity matrix, and η 0 is a weight on the regularization term. We can use ηC instead of ηI as the regularization term but ηC is known to solve a MDP with modified discount factor Chen et al. [2023]. Similar to (6), the corresponding deterministic algorithm is

θk+1 = θk + αk(b (AπXθk + ηI)θk). (8)

If it converges, i.e., θk θ η as k , then it is clear that θ η solves (7).

3.2 Regularized projected Bellman equation

The equation (7) can be written as the regularized projected Bellman equation (RPBE)

Xθ η = ΓηT Xθ η, (9)

Γη := X(X DX + ηI) 1X D. (10)

(a) Regularized projection: C(X) means the range space of X

(b) Regularized projection: One dimensional case

(c) Boundedness of the projection

Figure 1: Illustrative explanation on the regularized projection. The Figure 1c implies that as η , γΓη can potentially move outside of the unit ball satisfying ||x|| 1, and this phase is indicated with the term blowing up phase. The quantity γΓη actually blows up initially as η . However, since limη γΓη = 0, we know that γΓη will eventually converge to the origin and move inside the unit ball. This behavior is indicated by the shrinking phase in the figure.

The proof of the equivalence between (7) and (9) are given in Lemma A.12 in the Appendix Section A.3. The matrix Γη can be viewed as a modified projection operator which will be called the regularized projection. It can be interpreted as the projection with a regularization term Γη(x) = arg minθ Rh 1 2 x Xθ 2 D + η

2 θ 2 2 . The concept is illustrated in Figure 1a. Before moving forward, some natural questions that arise here are as follows: How does θ and θ η differ? Furthermore, which conditions can determine the existence and uniqueness of the solution of (4) and (9)? Partial answers are given in the sequel.

First, let us assume that the solution of (4) and (9), θ and θ η, respectively, exist and are unique. To understand the difference between θ and θ η, an important property of Γη is introduced: Lemma 3.1. (a) The projection Γη satisfies the following properties: lim η Γη = 0 and lim η 0 Γη = Γ.

(b) We have Γη X D 2 X 2 (X DX) 1 2 p

|S A| for all η 0.

The proof is given in Appendix A.5. From the above result, one can observe that as η , the projection is attracted to the origin as illustrated in Figure 1b. Moreover, as η 0, we will expect that θ η θ . Furthermore, one can observe that the bound in item (b) of Lemma 3.1 cannot be controlled by simply scaling the feature function, and therefore, it more depends on the inherent structures of the feature matrix X. The concept is illustrated in Figure 1c. We will provide a more in-depth discussion on the error bound of θ η θ in Section 3.3.

Now, we will discuss the existence and uniqueness of the solutions. Considering the non-existence of the solution of (4) [De Farias and Van Roy, 2000], (9) may not also have a solution. However, for RPBE in (9), we can prove that under mild conditions, its solution exists and is unique. We provided an example where the solution does not exist for (4) but does exist for (9) in Appendix A.14. Let us first state a general condition such that the solution of (9) exists and is unique:

γ||Γη|| < 1, (11)

Lemma 3.2. Suppose that (11) holds. Then the solution of RPBE in (9) exists and is unique.

The proof is given in Appendix A.6, which uses Banach fixed-point theorem [Agarwal et al., 2018]. From Lemma 3.2, we can see that the condition, γ||Γη|| < 1, is important to guarantee the uniqueness and existence of the solution. We will clarify under what situations the condition (11) can be met, and provide related discussions in Lemma 3.3, 3.4, and 3.5, where each lemma illustrate different scenarios when (11) is met. In particular, Lemma 3.3 shows that with simple feature scaling, η can be easily chosen such that (11) holds. Furthermore, Lemma 3.4 considers a case when η is in a small neighborhood of zero, and Lemma 3.5 considers the case when (11) should hold for all η 0.

We note that Zhang et al. [2021] also studied the solution of the regularized projected Bellman equation. Nonetheless, the result of Zhang et al. [2021] only ensures a unique solution within a certain ball whereas we consider the whole Rh space. Lemma 3.3. For η > γ||X D|| ||X|| + ||X DX|| , we have γ Γη < 1.

The proof is given in Lemma A.11 in Appendix. From Lemma 3.3, we can satisfy the condition in (11) with scaling the values of the feature matrix X. For example, if max(||X|| , ||X || ) < 1, it is enough to choose η > 2 to meet the condition in Lemma 3.3. It is worth noting that scaling the values of feature matrix is a commonly employed technique in the both theoretical literature or in practice. Lemma 3.4. Suppose γ||Γ|| < 1 so that the solution point of PBE in (4) exists and is unique. Then,

the condition 0 η < (1 γ||Γ|| )||(XT DX) 1|| 1 γ||(XT DX) 1|| ||X|| ||XT D|| +(1 γ||Γ|| ) implies γ||Γη|| < 1.

The proof is in Appendix A.7. We note that the condition, γ||Γ|| < 1 in Lemma 3.4 is used to guarantee the existence and uniqueness of the solution of PBE in (4), which is provided in Melo et al. [2008]. Therefore, without any special conditions, we can guarantee the existence and uniqueness of the solution in the neighborhood of η = 0. Lemma 3.5. Suppose the feature vector satisfies X DX = a I for a positive real number a such that a|S||A| 1. Assume that ||X||2 1 and D = 1/(|S||A|)I. Then, (11) holds for all η > 0.

The proof is given in Appendix A.8. A simple example where the above statement holds is by letting X = I. This is only a conceptual example and there could be many other examples in existence.

3.3 Error analysis

As promised in the previous section, we provide discussion on the behavior and quality of θ η, i.e., the error bound analysis in θ θ η depending on η. Even though Manek and Kolter [2022] provided a specific example when the bias can be large in the policy evaluation case, throughout the analysis, we show that the error can be small under particular scenarios.

Let us first examine the case when η 0 and η . As discussed in Section 3.2, we can consider η 0 if we can guarantee the existence of θ η and θ when η is nearby the origin, for example in the case of Lemma 3.4 and 3.5. As η 0, (4) and (9) coincide, implying that θ η θ .

Furthermore, as η gets larger, by Lemma 3.3, we can always guarantee existence and uniqueness of θ η after a certain threshold. As from the discussion of Lemma 3.1, we expect θ η 0, which is stated in the following lemma whose proof is given in Appendix A.9: Lemma 3.6. We have limη θ η = 0.

Note that even if a solution satisfying (7) exists, Xθ η may be different from Q . However, we can derive a bound on the error, Xθ η Q , using simple algebraic inequalities and contraction property of the Bellman operator. We present the error bound of the solution in the following lemma: Lemma 3.7. Suppose (11) holds. Then, we have :||Xθ η Q || 1 1 γ||Γη|| ||ΓηQ Q || .

The proof is given in Appendix A.10. We provide a discussion on the error bound in the following:

1) η 0: Consider the case when θ η and θ exists and unique, for example the condition in Lemma 3.4 is satisfied. Since Γη Γ from Lemma 3.1, we exactly recover the error bound by fixed point of original projected Bellman equation (η = 0) in (4), which is ||ΓQ Q ||

1 γ||Γ|| provided in Melo et al. [2008]. Thus, our bound in Lemma 3.7 is tight when η 0.

2) η : As from Lemma 3.3, θ η always exist when η gets larger than certain value. Noting that Γη 0, we have ||Xθ η Q || ||Q || . Considering that θ η 0 as η from Lemma 3.6, we should have ||X 0 Q || = ||Q || . Thus, our bound in Lemma 3.7 is tight when η .

3) The error bound is close to zero: An upper bound on Lemma 3.7 can be obtained by simple algebraic manipulation:

||Xθ η Q || ||ΓηQ Q ||

1 γ||Γη|| 1 1 γ||Γη||

||ΓηQ ΓQ || | {z } (T1)

+ ||ΓQ Q || | {z } (T2)

Suppose that the features are well designed such that (T2) in (12) will be small. For example, if Q is in the range space of X, then the error term in (T2) vanishes. Moreover, we can make (T1) arbitrarily

small as follows: as η 0, we have ||Γη Γ|| 0 while 1 γ||Γη|| > 0. This yields (T1) in (12) to be sufficiently small. In the end, we will have ||Xθ η Q || ϵ for any ϵ 0.

4) When the PBE does not admit a fixed point around η = 0: In this case, we should always choose η > 0 greater than a certain number, and (T1) cannot be entirely vanished, while (T2) can be arbitrarily close to zero when Q is close to the range space of X. The error in (T1) cannot be overcame because it can be seen as a fundamental error caused by the regularization for PBE. However, (T1) can be still small enough in many cases when ||Γ Γη|| is small.

4 Algorithm

In this section, we will introduce our main algorithm, called Reg Q, and elaborate the condition on the regularization term to make the algorithm convergent. The proposed algorithm is motivated by TD-learning. In particular, for on-policy TD-learning, one can establish its convergence using the property of the stationary distribution. On the other hand, for an off-policy case, the mismatch between the sampling distribution and the stationary distribution could cause its divergence [Sutton et al., 2016]. To address this problem, Bharadwaj Diddigi et al. [2020] adds a regularization term to TD-learning in order to make it convergent. Since Q-learning can be interpreted as an off-policy TD-learning, we add a regularization term to Q-learning update motivated by Bharadwaj Diddigi et al. [2020]. This modification leads to the proposed Reg Q algorithm as follows:

θk+1 = θk + αk(x(sk, ak)δk ηθk) (13)

The pseudo-code is given in Appendix A.16. Note that it can be viewed as a gradient descent step applied to the TD-loss L(θ) := 1 2(yk Qθ(sk, ak))2 + 1

2η θ 2 2, where yk = rk+1 + γmaxa AQθk(sk+1, a) is the TD-target, and Qθk = Xθk. Furthermore, letting η = 0, the above update is reduced to the standard Q-learning with linear function approximation in (2). The proposed Reg Q is different from Bharadwaj Diddigi et al. [2020] in the sense that a regularization term is applied to Q-learning instead of TD-learning. Rewriting the stochastic update in a deterministic manner, it can be written as follows:

θk+1 = θk + αk(b (AπXθk + ηI)θk + mk+1), (14)

where mk+1 = δkx(sk, ak) ηθk (b (AπXθk + ηI)θk) is a Martingale difference sequence. Without mk+1, (14) is reduced to the deterministic version in (8). In our convergence analysis, we will apply the O.D.E. approach, and in this case, AπXθk + ηI will determine the stability of the corresponding O.D.E. model, and hence, convergence of (13). Note that (14) can be interpreted as a switching system defined in (3) with stochastic noises. As mentioned earlier, proving the stability of a general switching system is challenging in general. However, we can find a common Lyapunov function to prove its asymptotic stability. In particular, we can make (AπXθk +ηI) to have a strictly negatively row dominant diagonal or negative-definite under the following condition:

γ||X D|| ||X|| + ||X DX|| | {z } (S1)

max π Θ (s,a) S A

γd T P π(ea es)

2d(s, a) 2 γ

| {z } (S2)

The conditions in (S1) and (S2), which make (AπXθk + ηI) to have strictly negatively row dominant diagonal or negative definite matrix , respectively, do not necessarily imply each others, which are discussed in Appendix A.15. Now, we can use the Lyapunov argument to establish stability of the overall system. Building on the fact, in the next section, we prove that under the stochastic update (13), we have θk θ η as k with probability one, where θ η satisfies RPBE in (7). If η = 0 satisfies (15), we can guarantee convergence to an optimal policy without errors.

5 Convergence Analysis

Recently, Lee and He [2020] suggested a switching system framework to prove the stability of Q-learning in the linear function approximation cases. However, its assumption on the behavior

policy and feature matrix seems too stringent to check in practice. Here, we develop more practical Q-learning algorithm by adding an appropriately preconditioned regularization term. We prove the convergence of the proposed Q-learning with regularization term (13) following lines similar to Lee and He [2020]. Our proof mainly relies on Borkar-Meyn theorem. Therefore, we first discuss about the corresponding O.D.E. for the proposed update in (13), which is θt = (XT DX + ηI)θt + γXT DPΠXθt Xθt + XT DR := f(θt). (16)

Then, using changes of coordinates, the above O.D.E. can be rewritten as d dt(θt θ η) =( AπXθt ηI)(θt θ η) + γXT DP(ΠXθt ΠXθ η)Xθ η, (17)

where θ η satisfies (7). Here, we assume that an equilibrium point exists and is unique. We later prove that if an equilibrium exists, then it is unique. To apply Borkar-Meyn theorem in Lemma A.1, we discuss about the asymptotic stability of the O.D.E. in (17). Note that (17) includes an affine term, i.e., it cannot be expressed as a matrix times vector θt θ η. It is in general hard to establish asymptotic stability of switched linear system with affine term compared to switched linear system (3). To circumvent this difficulty, Lee and He [2020] proposed upper and lower comparison systems, which upper bounds and lower bounds the original system. Then, the stability of the original system can be established by proving the stability of the upper and lower systems, which are easier to analyze. Following similar lines, to check global asymptotic stability of the original system, we also introduce upper and lower comparison systems. Then, we prove global asymptotic stability of the two bounding systems. Since upper and lower comparison systems can be viewed as switched linear system and linear system, respectively, the global asymptotic stability is easier to prove. We stress that although the switching system approach in Lee and He [2020] is applied in this paper, the detailed proof is entirely different and nontrivial. In particular, the upper and lower comparison systems are given as follows: θu t = ( XT DX ηI + γXT DPΠXθu t X)θu t , θl t = ( XT DX ηI + γXT DPΠXθ ηX)θl t,

where θu t and θl t denote the states of the upper and lower systems, respectively. We defer the detailed construction of each system to Appendix A.12. The stability of overall system can be proved by establishing stability of the upper and lower comparison systems. Theorem 5.1. Suppose η satisfies (15), and Assumption 2.1, 2.2, and 2.3 hold. Moreover, assume that a solution of RPBE in (7) exists. Then, it is also unique, and the origin is the unique globally asymptotically stable equilibrium point of (17).

The detailed proof is given in Appendix A.12. Building on the previous results, we now use Borkar and Meyn s theorem in Lemma A.1 to establish the convergence of Reg Q. The full proof of the following theorem is given in Appendix A.13. Theorem 5.2. Suppose η satisfies (15), then with Assumption 2.1, 2.2, and 2.3 holds. Assume that solution of RPBE in (7) exists. Then, θ η is unique, and under the stochastic update (13), θk θ η as k with probability one, where θ η satisfies (7).

We note that if η is larger than the term (S1) in (15), then θ η exists and is unique by Lemma 3.3.

6 Experiments

In this section, we briefly present the experimental results under well-known environments in Tsitsiklis and Van Roy [1996], Baird [1995], where Q-learning with linear function approximation diverges. As from Figure 2b, our algorithm shows faster convergence rate than other algorithms. Further details on the experiments are deferred to Appendix B. In Appendix B.6, we also compare performance under the Mountain Car environment [Sutton and Barto, 2018] where Q-learning performs well. In Appendix B.5, we show experimental results under various step-size and η. Moreover, the trajectories of upper and lower systems to illustrate the theoretical results are given in Appendix B.7.

7 Conclusion

In this paper, we presented a new convergent Q-learning with linear function approximation (Reg Q), which is simple to implement. We provided theoretical analysis on the proposed Reg Q, and demonstrated its performance on several experiments, where the original Q-learning with linear function

(a) Results in θ 2θ

(b) Results in Baird seven star counter example

Figure 2: Experiment results

approximation diverges. Developing a new Q-learning algorithm with linear function approximation without bias would be one interesting future research topic. Moreover, considering the great success of deep learning, it would be interesting to develop deep reinforcement learning algorithms with appropriately preconditioned regularization term instead of using the target network.

8 Acknowledgements

The work was supported by the Institute of Information Communications Technology Planning Evaluation (IITP) funded by the Korea government under Grant 2022-0-00469, and the BK21 FOUR from the Ministry of Education (Republic of Korea).

Naman Agarwal, Syomantak Chaudhuri, Prateek Jain, Dheeraj Mysore Nagaraj, and Praneeth Netrapalli. Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs. In International Conference on Learning Representations.

Praveen Agarwal, Mohamed Jleli, and Bessem Samet. Fixed Point Theory in Metric Spaces. Recent Advances and Applications, 2018.

Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30 37. Elsevier, 1995.

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253 279, 2013.

Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, and Shalabh Bhatnagar. A convergent off-policy temporal difference algorithm. In ECAI 2020, pages 1103 1110. IOS Press, 2020.

Vivek S Borkar and Sean P Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447 469, 2000.

Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1):33 57, 1996.

Diogo Carvalho, Francisco S Melo, and Pedro Santos. A new convergent variant of Q-learning with linear function approximation. Advances in Neural Information Processing Systems, 33: 19412 19421, 2020.

Zaiwei Chen, John-Paul Clarke, and Siva Theja Maguluri. Target network and truncation overcome the deadly triad in-learning. SIAM Journal on Mathematics of Data Science, 5(4):1078 1101, 2023.

Daniela Pucci De Farias and Benjamin Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105(3):589 608, 2000.

Adithya M Devraj and Sean Meyn. Zap Q-learning. Advances in Neural Information Processing Systems, 30, 2017.

Amir-massoud Farahm, Mohammad Ghavamzadeh, Csaba Szepesvári, and Shie Mannor. Regularized policy iteration with nonparametric function spaces. Journal of Machine Learning Research, 17 (139):1 66, 2016.

Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. ar Xiv preprint ar Xiv:1810.00123, 2018.

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160 2169. PMLR, 2019.

Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, and Martha White. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, pages 3524 3534. PMLR, 2020.

Abhijit Gosavi. Boundedness of iterates in Q-learning. Systems & control letters, 55(4):347 349, 2006.

William W Hager. Updating the inverse of a matrix. SIAM review, 31(2):221 239, 1989.

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, 2018.

Morris W Hirsch and Hal Smith. Monotone dynamical systems. In Handbook of differential equations: ordinary differential equations, volume 2, pages 239 357. Elsevier, 2006.

RA Horn and CR Johnson. Matrix analysis second edition, 2013.

Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185 1201, 1994.

Carl T Kelley. Iterative methods for linear and nonlinear equations. SIAM, 1995.

Hassan K Khalil. Nonlinear systems; 3rd ed. Prentice-Hall, Upper Saddle River, NJ, 2002. URL https://cds.cern.ch/record/1173048. The book can be consulted by contacting: PH-AID: Wallet, Lionel.

Seungchan Kim, Kavosh Asadi, Michael Littman, and George Konidaris. Deepmellow: removing the need for a target network in deep q-learning. In Proceedings of the twenty eighth international joint conference on artificial intelligence, 2019.

Qingfeng Lan, Yangchen Pan, Alona Fyshe, and Martha White. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning. In International Conference on Learning Representations.

Donghwan Lee and Niao He. A unified switching system perspective and convergence analysis of Q-learning algorithms. Advances in Neural Information Processing Systems, 33:15556 15567, 2020.

Donghwan Lee, Han-Dong Lim, Jihoon Park, and Okyong Choi. New Versions of Gradient Temporal Difference Learning. IEEE Transactions on Automatic Control, 68(8):5006 5013, 2022.

Daniel Liberzon. Switching in systems and control. Springer Science & Business Media, 2003.

Fan Lu, Prashant G Mehta, Sean P Meyn, and Gergely Neu. Convex Q-learning. In 2021 American Control Conference (ACC), pages 4749 4756. IEEE, 2021.

Hamid Reza Maei. Gradient temporal-difference learning algorithms. 2011.

Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S Sutton. Toward off-policy learning control with function approximation. In ICML, 2010.

Sridhar Mahadevan, Bo Liu, Philip Thomas, Will Dabney, Steve Giguere, Nicholas Jacek, Ian Gemp, and Ji Liu. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. ar Xiv preprint ar Xiv:1405.6757, 2014.

Gaurav Manek and J Zico Kolter. The pitfalls of regularization in off-policy td learning. Advances in Neural Information Processing Systems, 35:35621 35631, 2022.

Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259 267, 1960.

Francisco S Melo, Sean P Meyn, and M Isabel Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on Machine learning, pages 664 671, 2008.

Sean Meyn. Stability of Q-learning Through Design and Optimism. ar Xiv preprint ar Xiv:2307.02632, 2023.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015.

Alexander P Molchanov and Ye S Pyatnitskiy. Criteria of asymptotic stability of differential and difference inclusions encountered in control theory. Systems & Control Letters, 13(1):59 64, 1989.

Alexandre Piché, Joseph Marino, Gian Maria Marconi, Christopher Pal, and Mohammad Emtiyaz Khan. Beyond target networks: Improving deep q-learning with functional regularization. 2021.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400 407, 1951.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993 1000, 2009.

Richard S Sutton, A Rupam Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1): 2603 2631, 2016.

John N Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1):59 94, 1996.

John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. IEEE transactions on automatic control, 42(5):674 690, 1997.

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279 292, 1992.

Jiachen Xi, Alfredo Garcia, and Petar Momcilovic. Regularized Q-learning with Linear Function Approximation. ar Xiv preprint ar Xiv:2401.15196, 2024.

Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning, pages 6995 7004. PMLR, 2019.

Shangtong Zhang, Hengshuai Yao, and Shimon Whiteson. Breaking the deadly triad with a target network. In International Conference on Machine Learning, pages 12621 12631. PMLR, 2021.

A.1 O.D.E analysis

The dynamic system framework has been widely used to prove convergence of reinforcement learning algorithms, e.g., Sutton et al. [2009], Maei et al. [2010], Borkar and Meyn [2000], Lee and He [2020]. Especially, Borkar and Meyn [2000] is one of the most widely used techniques to prove stability of stochastic approximation using O.D.E. analysis. Consider the following stochastic algorithm with a nonlinear mapping f : Rn Rn:

θk+1 = f(θk) + mk, (18)

where mk Rn is an i.i.d. noise vector. For completeness, results in Borkar and Meyn [2000] are briefly reviewed in the sequel. Under Assumption A.2 given in Appendix A.2, we now introduce Borkar and Meyn theorem below.

Lemma A.1 (Borkar and Meyn theorem). Suppose that Assumption A.2 in the Appendix A.2 holds, and consider the stochastic algorithm in (18). Then, for any initial θ0 Rn, supk 0 ||θk|| < with probability one. In addition , θk θe as k with probability one, where θe satisfies f(θe) = 0.

The main idea of Borkar and Meyn theorem is as follows: iterations of a stochastic recursive algorithm follow the solution of its corresponding O.D.E. in the limit when the step-size satisfies the Robbins Monro condition. Hence, by proving the asymptotic stability of the O.D.E., we can induce the convergence of the original algorithm. In this paper, we will use an O.D.E. model of Q-learning, which is expressed as a special nonlinear system called a switching system.

A.2 Assumption for Borkar and Meyn Theorem

Assumption A.2. 1. The mapping f : Rn Rn is globally Lipschitz continuous, and there exists a function f : Rn Rn such that

lim c f(cx)

c = f (x), x Rn. (19)

2. The origin in Rn is an asymptotically stable equilibrium for the O.D.E. xt = f (xt).

3. There exists a unique globally asymptotically stable equilibrium θe Rn for the O.D.E. xt = f(xt) , i.e., xt θe as t .

4. The sequence {mk, Gk}k 1 where Gk is sigma-algebra generated by {(θi, mi, k i}, is a Martingale difference sequence. In addition , there exists a constant C0 < such that for any initial θ0 Rn , we have E[||mk+1||2|Gk] C0(1 + ||θk||2), k 0.

5. The step-sizes satisfies the Robbins-Monro condition [Robbins and Monro, 1951] :

k=0 α2 k < .

A.3 Auxiliary lemmas

Lemma A.3 (Woodbury matrix identity [Hager, 1989]). For A, B Rn n, suppose A and I+A 1B is invertible, then A + B is invertible and we have

(A + B) 1 = A 1 A 1B(I + A 1B) 1A 1.

Lemma A.4 (Gelfand s formula, Corollay 5.6.14 in Horn and Johnson [2013]). For any matrix norm || ||, for A Rn n, we have

ρ(A) = lim k ||Ak|| 1 k ,

where ρ( ) denotes the spectral radius of a given matrix.

Definition A.5 (Theorem 3 in Molchanov and Pyatnitskiy [1989]). A matrix A Rn n is said to have strictly negatively row dominating diagonal if [A]ii + P

j {1,2,...,n}\{i}[A]ij < 0 for all 1 i n.

Lemma A.6 (Theorem 3 in Molchanov and Pyatnitskiy [1989]). Consider a switched system in (3) where M =: {1, 2, . . . , M} is the set for switching modes. If there exists a number m n, a full-row rank matrix L Rn m and a set of matrices {Lσ Rm m}σ M such that

1) Each Lσ for σ M has a strictly negatively row dominating diagonal:

j {1,2,...,m}\{i} |[L]ij| < 0.

2) The following holds for all σ M:

A σ L = LL σ .

Then, the origin of (3) is asymptotically stable.

Lemma A.7 (Gerschgorin circle theorem [Horn and Johnson, 2013]). Let A Rn m whose i-th row and j-th column element is aij. Let Ri(A) = P

j {1,2,...,m}\{i} aij. Consider the Gerschgorin circles

{z C : |z aii| Ri(A)}, i = 1, . . . , n.

The eigenvalues of A are in the union of Gerschgorin discs

G(A) = n i=1{z C : |z aii| Ri(A)}.

Now, we state the lemma to guarantee positive definiteness of AπXθ + ηI. Instead we prove positive definiteness of AπXθ + η λmax(C)C. We follow the similar lines in Bharadwaj Diddigi et al. [2020].

Lemma A.8. Let

M πXθ := D 1 + η λmax(C)

Under the following condition:

η > λmax(C) max π Θ (s,a) S A

γd T P πXθ(ea es)

2d(s, a) 2 γ

where Θ is the set of all deterministic policies, and is the Kronecker product, M πXθ is positive definite.

Proof. For simplicity of the notation, we will denote di = d(s, a) and ei = ea ea for some i {1, 2, . . . , |S||A|} where i corresponds to some s, a S A.

We use Gerschgorin circle theorem for the proof. First, denote mij = [M πXθ]ij. Then, one gets

1 + η λmax(C)

γe T i P πXθei

mij = diγe T i P πXθej for i = j.

Except for the diagonal element, the row and column sums, respectively, become X

j Si |mij| = γdi(1 e T i P πXθei),

j Si |mji| = γd T P πXθei γdie T i P πXθei,

where Si = {1, 2, . . . , |S||A|} \ {i}. We need to show that M πXθ + M πT Xθ is positive definite. To this end, we use Lemma A.7 to have the following inequality:

j Si |mij| + X

j Si |mji|.

Considering the lower bound of λ, we have

j Si |mij| X

1 + η λmax(C)

γe T i P πXθei

γdi(1 e T i P πXθei) (γd T P πXθei γdie T i P πXθei)

= η 2di λmax(C) + (2 γ)di γd T P πXθei.

Hence, for λ > 0, we should have

η > λmax(C) γd T P πXθei

Taking η > λmax(C) max π Θ i {1,2,...,|S||A|}

γd T P πXθ ei

2 , we can make M πXθ always positive

definite. This completes the proof.

We first introduce a lemma to bound the inverse of a matrix norm:

Lemma A.9. [Page 351 in Horn and Johnson [2013]] If M Rn n satisfies ||M|| < 1 for some matrix norm || ||, then I M is non-singular, and

(I M) 1 1 1 M .

Lemma A.10. Suppose that X DX < η.

Then, we have (X DX + ηI) 1 1 η X DX .

Proof. We have

(X DX + ηI) 1 =

η X DX + I 1

η X DX + I 1

= 1 η X DX .

The first inequality follows from Lemma A.9. This completes the proof.

Lemma A.11. For η > γ||X D|| ||X|| + ||X DX|| , we have

Proof. From the definition of Γη in (10), we have

γ Γη =γ X D(X DX + ηI) 1X

γ X D ||X|| 1 η X DX <1.

The first inequality follows from Lemma A.10. The last inequality follows from the condition η > γ||X D|| ||X|| + ||X DX|| . This completes the proof.

Lemma A.12. The equation (7) can be written as

Xθ η = ΓηT Xθ η.

Proof. Let us expand the terms in (7):

X DR = (ηI + X DX)θ η γX DPΠXθ ηXθ η X D(R + γPΠXθ ηXθ η) = (ηI + X DX)θ η X(ηI + X DX) 1X D(R + γPΠXθ ηXθ η) = Xθ η.

The last line follows from that X is full-column rank matrix. This completes the proof.

Lemma A.13. For any θ Rh, if η > γ||X D|| ||X|| + ||X DX|| , then AπXθ ηI has strictly negatively row dominating diagonal.

Proof. For 1 i h, we have

[ AπXθ ηI]ii + X

j {1,2,...,h}\{i} |[AπXθ + ηI]ij| η +

j=1 |[AπXθ]ij|

η + AπXθ η + X DX + γ X D X <0.

The second last inequality follows the fact that ||PΠXθ|| 1.

Lemma A.14 (Continuity of θ η with respect to η). Let η0 be a non-negative real valued constant. Suppose γ||Γη0|| < 1. Then, θ η is continuous at η0.

Proof. Note that Γη is continuous function of η, and we have,

Γη0+η = Γη0 + O(η),

where O( ) stands for the big O notation. Therefore,

||Xθ η0+η Xθ η0|| =||Γη0+ηT Xθ η0+η Γη0T Xθ η0|| Γη0T X(θ η0+η θ η0) + O(η)

γ Γη0 ||X(θ η0+η θ η0)|| + O(η).

The first equality follows from the definition of θ η0+η and θ η0. The second inequality follows from triangle inequality. The last inequality follows from the contraction property of the Bellman operator. Therefore, we have

||θ η0+η θ η0|| C||Xθ η0+η Xθ η0|| O(η),

where the first inequality holds because X is full-column rank matrix, and C is a universal constant. This completes the proof.

A.4 Proof of Lemma 2.5

Proof. The first item follows from Lemma A.6. The second item follows from the fact that if all the subsystem matrices are negative-definite, then it will have V (x) = ||x||2 2 as a common Lyapunov function. This completes the proof.

A.5 Proof of Lemma 3.1

Proof. Let us prove the first item. When, η 0 , we have (X DX + ηI) 1 (X DX) 1. Therefore, we have Γη Γ as η 0.

Moreover, note that from Lemma A.10, for sufficiently large η, we have (X DX + ηI) 1 is invertible. Therefore, we get Γη = X D(X DX + ηI) 1X

X D ||X|| 1 η X DX ,

where the first inequality follows from Lemma A.10. As η , we get Γη 0. This completes the proof of the first item.

Now, we will prove the second item. First of all, using Woodbury matrix identity [Hager, 1989] in Lemma A.3, we have (X DX + ηI) 1 =(X DX) 1 (X DX + η 1(X DX)(X DX)) 1

where the inequality comes from the fact that (X DX + η 1(X DX)(X DX)) 1 is positive semidefinite. Then, we have Γη = X D(X DX + ηI) 1X = p

|S A| X D(X DX + ηI) 1X 2 p

|S A| X D 2 X 2 (X DX + ηI) 1 2 . Next, since the spectral norm is monotone, for any two symmetric positive semidefinite matrices A and B, A B implies A 2 B 2, which comes from the properties of the spectral norm for symmetric positive semidefinite matrices. Therefore, one gets

Γη X D 2 X 2 (X DX + ηI) 1 2 p

X D 2 X 2 (X DX) 1 2 p

|S A|, which is the desired conclusion.

A.6 Proof of Lemma 3.2

Proof. To show the existence and uniqueness of the solution of (9), we use Banach fixed-point theorem. Note that it is enough show the existence and uniqueness of the solution of the following equation: y = Γη(R + γPΠyy), y Rh. (20)

This is because a solution y Rh satisfying the above equation is in the image of X. We can find a unique θ such that Xθ = y because X is a full-column rank matrix. To this end, we will apply the Banach fixed point theorem: ||y1 y2|| = ||X(XT DX + ηI) 1(γXT DPΠy1y1 γXT DPΠy2y2)|| γ||X(XT DX + ηI) 1X D|| ||Πy1y1 Πy2y2|| γ||X(XT DX + ηI) 1X D|| ||y1 y2|| < ||y1 y2|| . The second inequality follows from the non-expansiveness property of the max-operator. Now, we can use Banach fixed-point theorem to conclude existence and uniqueness of (20). This completes the proof.

A.7 Proof of Lemma 3.4

For the proof, suppose that γ Γ < 1. If the condition 0 η <

(1 γ Γ ) (XT DX) 1 1

γ (XT DX) 1 X XT D +(1 γ Γ ) holds, it ensures η(XT DX) 1 < 1 since

(1 γ Γ ) γ (XT DX) 1 X XT D +(1 γ Γ ) < 1. Then, using Gelfand s formula in Lemma A.4,

we can easily prove that the spectral radius of η(XT DX) 1 is less than one. Next, note that for any two square matrices A and B, (A B) 1 = P i=0(A 1B)i A 1 if the spectral radius of A 1B is less than one. Using this fact, one has

γ Γη = γX(XT DX + ηI) 1XT D

i=0 ( η(XT DX) 1) i(XT DX) 1XT D

γ X(XT DX) 1XT D + γ

(XT DX) 1 X

i=1 ηi X( XT DX) i XT D

γ X(XT DX) 1XT D + γη (XT DX) 1 2 X XT D

η(XT DX) 1 i

γ Γ + γη (XT DX) 1 2 X XT D 1 η (XT DX) 1 ,

where the second line uses the matrix inverse property. Therefore, γ Γη < 1 holds if

γ Γ + γη (XT DX) 1 2

X XT D 1 η (XT DX) 1 < 1.

Rearranging terms, one gets the desired conclusion.

A.8 Proof of Lemma 3.5

From the definition of Γη in (10), we have

γ Γη γ X (X DX + ηI) 1 X D

γ 1 |S||A| 1 a + η <1.

The second inequality follows from the assumption that X DX = a I and ||X||2 1. The last inequality follows from the condition a|S||A| 1.

A.9 Proof of Lemma 3.6

Proof. Since we are going to consider the case η , assume that η > X DX + γ X D X . From (7), we have

θ η = (X DX + ηI) 1(X DR + γX DPΠXθ ηXθ η)

X DR + γX DPΠXθ ηXθ η

X DR + 1 η X DX

X D X θ η .

The first inequality follows from Lemma A.10. Therefore, considering that η > X DX + γ X D X , we have

η X DX γ X D X η ||X DX||

θ η < 1 η X DX

which leads to θ η 1 η X DX γ X D X

As η , the right-hand side of the above equation goes to zero, i.e., θ η 0.

A.10 Proof of Lemma 3.7

Proof. The bias term of the solution can be obtained using simple algebraic inequalities.

||Xθ η Q || ΓηT (Xθ η) ΓQ + ΓηQ Q Γη T (Xθ η) Q + ΓηQ Q = Γη T (Xθ η) T (Q ) + ΓηQ Q γ Γη Xθ η Q + ΓηQ Q .

The first inequality follows from triangle inequality. The third equality follows from the fact that Q is the solution of optimal Bellman equation. The last inequality follows from the contraction property of the Bellman operator. Noting that γ Γη < 1, we have

Xθ η Q 1 1 γ Γη ΓηQ Q .

This finishes the proof.

A.11 Proofs to check Assumption A.2 for Theorem 5.2.

In this section, we provide omitted proofs to check Assumption A.2 in Appendix Section A.2 to apply the Borkar and Meyn Theorem in Lemma A.1 in the Appendix.

First of all, Lipschitzness of f(θ) ensures the unique solution of the O.D.E.. Lemma A.15 (Lipschitzness). Let

f(θ) = (XT DX + ηI)θ + γXT DPΠXθXθ + XT DR. (21)

Then, f(θ) is globally Lipschitzness continuous.

Proof. Lipschitzness of f(θ) can be proven as follows:

||f(θ) f(θ )|| ||(XT DX + ηI)(θ θ )|| + γ||XT DP(ΠXθXθ ΠXθ Xθ )|| ||XT DX + ηI|| ||θ θ || + γ||XT DP|| ||ΠXθXθ ΠXθ Xθ || (||XT DX + ηI|| + γ||XT DP|| ||X|| )||θ θ ||

The last inequality follows from non-expansiveness property of max-operator. Therefore f(θ) is Lipschitz continuous with respect to the || || ,

Next, the existence of limiting O.D.E. of (16) can be proved using the fact that policy is invariant under constant multiplication when linear function approximation is used. Lemma A.16 (Existence of limiting O.D.E. and stability). Let

f(θ) = ( XT DX ηI)θ + γXT DPΠXθXθ + XT DR. (22)

If η satisfies (15), there exists limiting O.D.E. of (22) and its origin is asymptotically stable.

Proof. The existence of limiting O.D.E. can be obtained using the homogeneity of policy, ΠX(cθ) = ΠXθ.

f(cθ) = (XT DX + ηI)(cθ) + γXT DPΠX(cθ)X(cθ) + XT DR,

lim c f(cx)

c = ( XT DX ηI + γXT DPΠXθX)θ

This can be seen as switching system and, from Lemma A.8 and Lemma A.13, we can apply Lemma 2.5. Therefore, the origin is asymptotically stable.

Lastly, we check conditions for martingale difference sequences. Lemma A.17 (Martingale difference sequence, mk, and square integrability). We have

E[mk+1|Fk] = 0,

E[||mk+1||2 2|Fk] < C0(1 + ||θk||2 2),

where C0 := max(12X2 max R2 max, 12γX4 max + 4η2).

Proof. To show {mk, k N} is a martingale difference sequence with respect to the sigma-algebra generated by Gk, we first prove expectation of mk+1 is zero conditioned on Gk:

E[mk+1|Gk] = 0.

This follows from definition of b, C and AπXθ.

The boundedness E[||mk||2] < for k N also follows from simple algebraic inequalities. Therefore {mk, k N} is martingale difference sequence.

Now, we show that the following hods:

E[||mk+1||2 2|Gk] C0(||θk||2 2 + 1).

Using simple algebraic inequalities, we have

E[||mk+1||2 2|Gk] = E[||δkx(sk, ak) + ηθk Eµ[δkx(sk, ak) + ηθk]||2 2|Gt]

E[||δkx(sk, ak) + ηθk||2 2 + ||Eµ[δkx(sk, ak) + ηθk]||2 2|Gt]

2E[||δkx(sk, ak) + ηθk]||2 2|Gt]

4E[||δkx(sk, ak)||2 2|Gt] + 4η2E[||θk||2 2|Gt]

12X2 max E[||rk||2 2 + ||γ max x(sk, ak)θk||2 2 + ||x(sk, ak)θk||2 2|Gt] + 4η2||θk||2 2 12X2 max R2 max + 12γX4 max||θk||2 2 + ||θk||2 2 + 4η2||θk||2 2 C0(1 + ||θk||2 2),

where C0 := max(12X2 max R2 max, 12γX4 max + 4η2). The fourth inequality follows from the fact that ||a + b + c||2 2 3||a||2 2 + 3||b||2 2 + 3||c||2 2. Together with the above inequality, the square integrability of mk for k N follows from the recursive update of θk in (14). This completes the proof.

A.12 Proof of Theorem 5.1

Before moving onto the proof of Theorem 5.1, in order to prove the stability using the upper and lower systems, we need to introduce some notions such as the quasi-monotone function and vector comparison principle. We first introduce the notion of quasi-monotone increasing function, which is a necessary prerequisite for the comparison principle for multidimensional vector system. Definition A.18 (Quasi-monotone function). Consider a vector-valued function f : Rn Rn with f := [f1 f2 fn]T where fi : Rn R for i {1, 2, . . . , n}. f is said to be quasi-monotone increasing if fi(x) fi(y) holds for all i {1, 2, . . . , n} and x, y Rn such that xi = yi and xj yj for all j = i.

Based on the notion of quasi-monotone function, we introduce the vector comparison principle. Lemma A.19 (Vector Comparison Principle, Hirsch and Smith [2006]). Suppose that f, f are globally Lipschitz continuous. Let xt be a solution of the system

d dtxt = f(xt), x0 Rn, t 0.

Assume that f is quasi-monotone increasing, and let vt be a solution of the system

d dtvt = f(vt), v0 < x0, t 0,

where f(v) f(v) holds for any v Rn. Then, vt xt for all t 0.

The vector comparison lemma can be used to bound the state trajectory of the original system by those of the upper and lower systems. Then, proving global asymptotic stability of the upper and lower systems leads to global asymptotic stability of original system. We now give the proof of Theorem 5.1.

Proof. First we construct the upper comparison part. Noting that

γXT DPΠXθ ηXθ η γXT DPΠXθXθ η (23)

and γXT DPΠX(θ θ η)X(θ θ η) γXT DPΠXθX(θ θ η), (24)

we define f(y) and f(y) as follows:

f(y) = ( XT DX ηI + γXT DPΠXy X)y,

f(y) = ( XT DX ηI + γXT DPΠX(y+θ η)X)y + γXT DP(ΠX(y+θ η) ΠXθ η)Xθ η.

Using (23) and (24), we have f(y) f(y).

f is the corresponding O.D.E. of original system in (17) and f becomes O.D.E. of the upper system. f becomes switched linear system.

Now, consider the O.D.E. systems

d dtθu t = f(θu t ), θu 0 > θ0,

d dtθt = f(θt).

Next, we prove quasi-monotone increasing property of f. For any y Rh, consider a non-negative vector p Rh such that its i-th element is zero. Then, for any 1 i h, we have

e T i f(y + p) = e T i ( XT DX ηI + γXT DPΠX(y+p)X)(y + p)

= e T i (XT DX + ηI)y ηe T i p + γe T i XT DPΠX(y+p)X(y + p)

e T i (XT DX + ηI)y + γe T i XT DPΠXy Xy

= e T i f(y),

where the inequality comes from e T i XT DXp = 0 due to Assumption 2.2 and e T i p = 0 since i-th element of p is zero.

Therefore by Lemma A.19, we can conclude that θt θu t . The condition (S1) in (15) ensures the switching matrices to have strictly negatively row dominating diagonal. Therefore, from Lemma A.13, and from Lemma 2.5, the global asymptotically stability of the origin follows. Likewise, the condition (S2) in (15) ensures that the matrices are all negative-definite, implying that the switching system shares V (θ) = ||θ||2 2 as common Lyapunov function. Therefore, we can conclude that the upper comparison system is globally asymptotically stable.

For the lower comparison part, noting that

γXT DPΠXθXθ γXT DPΠXθ ηXθ,

we can define f(y) and f(y) such that f(y) f(y) as follows: f(y) = XT DXy ηy + γXT DPΠXy Xy + XT DR,

f(y) = XT DXy ηy + γXT DPΠXθ ηXy + XT DR. The corresponding O.D.E. system becomes d dtθt = f(θt),

d dtθl t = f(θl t), θl 0 < θ0. (25)

Proving the quasi-monotonicity of f is similar to previous step. Consider a non-negative vector p Rh such that its i-th element is zero. Then, we have e T i f(y + p) = e T i ( (XT DX + ηI)(y + p) + γXT DPΠX(y+p)X(y + p) + XT DR)

= e T i ( (XT DX + ηI)y + γXT DPΠX(y+p)X(y + p) + XT DR)

e T i ( (XT DX + ηI)y + γXT DPΠXy Xy + XT DR)

= e T i f(y). The second equality holds since XT DX is diagonal matrix and pi = 0. Therefore by Lemma A.19, we can conclude that θl t θt. The lower comparison part is linear system without affine term. Hence, following the similar lines as in proving the stability of upper comparison system, we can conclude that (25) is globally asymptotically stable.

To prove uniqueness of the equilibrium point, assume there exists two different equilibrium points θe 1 and θe 2. The global asymptotic stability implies that regardless of initial state, θt θe 1 and θt θe 2. However this becomes contradiction if θe 1 = θe 2. Therefore, the equilibrium point is unique.

A.13 Proof of Theorem 5.2

Proof. To apply Lemma A.1, let us check Assumption A.2.

1. First and second statement of Assumption A.2 follows from Lemma A.16.

2. Third statement of Assumption A.2 follows from Theorem 5.1.

3. Fourth statement of Assumption A.2 follows from Lemma A.17.

Since we assumed Robbins Monro step-size, we can now apply Lemma A.1 to complete the proof.

A.14 Example for non-existence of solution of PBE

Let us define a MDP whose state transition diagram is given as in Figure 3. The cardinality of state space and action space are |S| = 3, |A| = 2 respectively. The corresponding state transition matrix, and other parameters are given as follows:

1 0 2 0 0 1 0 1 2 0 0 1

0 1 0 1 4 1 4 1 2 1 4 1 2 1 4

0 0 1 1 4 1 4 1 2 1 4 1 2 1 4

γ = 0.99, d(s, a) = 1

6, s S, a A,

Figure 3: State transition diagram

where the order of elements of each column follows the orders of the corresponding definitions. Note that for this Markov decision process, taking action a = 1 and action a = 2 at state s = 2 have the same transition probabilities and reward. It is similar for the state s = 3. In this MDP, there are only two deterministic policies available, denoted by π1 and π2, that selects action a = 1 and action a = 2 at state s = 1, respectively, i.e., π1(1) = 1 and π2(1) = 2. The actions at state s = 2 and s = 3 do not affect the overall results.

The motivation of this MDP is as follows. Substitute πXθ in (5) with π1 and π2. Then each of its solution becomes

θe1 := θe1 1 θe1 2

, θe2 := θe2 1 θe2 2

If π1 is the corresponding policy to the solution of (5), it means that action a = 1 is greedily selected at state s = 1. Therefore, Qπ1(1, 1) > Qπ1(1, 2) should be satisfied. However, since Qπ1(1, 1) = x(1, 1)T θe1 0.85 and Qπ1(1, 2) = x(1, 2)T θe1 0.72, this is contradiction. The same logic applies to the case for π2. Therefore, neither of them becomes a solution of (5). On the other hand, considering (7) with η = 4 which satisfies (15), the solution for each policy becomes θe1 1 0.069, θe1 2 0.032 and θe2 1 0.069, θe2 2 0.035, respectively. For π1 and π2, we have Qπ1(1, 1) < Qπ2(1, 2) and Qπ1(1, 1) < Qπ1(1, 2) respectively. Hence, θe2 satisfies (7) and becomes the unique solution.

A.15 Discussion on (15)

In this section, we provide further discussion on (15). The two conditions (S1) and (S2) are to make a matrix to have strictly negatively row dominating diagonal or negative definite, respectively. We first provide a case where a matrix with strictly negatively row dominating diagonal is not necessarily a negative definite matrix, and vice versa. We will consider MDPs with γ = 0.99.

A matrix with strictly negatively row dominating diagonal but not negative-definite: Consider the following MDP with only single action for each state:

X = 13 4 1 8

, P = 0 1 0 1

, Π = 1 0 0 1

where the matrix Π represents the policy. Then, we have

M := X DX + γX DPΠX 78 77 24 24.2

This is a matrix with strictly negatively row dominating diagonal but M +M is not negative definite matrix.

A negative-definite matrix but not with a strictly negatively row dominating diagonal: Consider the following MDP with single action for each state:

X = 1 4 0 5

2 1 2 1 2 1 2

, Π = 1 0 0 1

Then, we have

M := X DX + γX DPΠX 0.25 2.25 2.25 20

M +M is negative definite matrix but M does not have strictly negatively row dominating diagonal.

Now, we provide an example where the condition (S1) and (S2) in (15) does not imply each other, i.e., there are cases such that (S1) (S2) or (S2) (S1):

As for the condition in (15), consider the following MDP with single action for each state:

100 0 0 99 100

2 1 2 1 2 1 2

, P2 = 0 1 0 1

, Π = 1 0 0 1

where P1 and P2 are two different transition matrices and the matrix Π represents the policy. First, considering P1, we have

(S1) 7.9, (S2) 196.

Meanwhile, considering P2 as the transition matrix, we have

(S1) 7.9, (S2) 2.

Therefore, (S1) and (S2) does not necessarily imply each other.

A.16 Pseudo-code

Algorithm 1 Regularized Q-learning

1: Initialize θ0 Rh. 2: Set the step-size (αk) k=0, and the behavior policy µ. 3: for iteration k = 0, 1, . . . do 4: Sample sk dµ and ak µ. 5: Sample s k P(sk, ak, ) and rk+1 = r(sk, ak, s k). 6: Update θk using (13). 7: end for

B Experiment

B.1 Experiments

In this section, we present experimental results under well-known environments in Tsitsiklis and Van Roy [1996], Baird [1995], where Q-learning with linear function approximation diverges. In Appendix B.6, we also compare performance under the Mountain Car environment [Sutton and Barto, 2018] where Q-learning performs well. In Appendix B.5, we show experimental results under various step-sizes and η. We also show trajectories of the O.D.E. of upper and lower comparison systems to illustrate the theoretical results.

B.2 θ 2θ, Tsitsiklis and Van Roy [1996]

Even when there are only two states, Q-learning with linear function approximation could diverge [Tsitsiklis and Van Roy, 1996]. Depicted in Figure 4a in Appendix B.4, from state one (θ), the transition is deterministic to absorbing state two (2θ), and reward is zero at every time steps. Therefore, the episode length is fixed to be two. Learning rate for Greedy GQ (GGQ) and Coupled Q Learning (CQL), which have two learning rates, are set as 0.05 and 0.25, respectively as in Carvalho et al. [2020], Maei et al. [2010]. Since CQL requires normalized feature values, we scaled the feature value with 1

2 as in Carvalho et al. [2020], and initialized weights as one. We implemented Q-learning with target network [Zhang et al., 2021], which also have two learning rates, without projection for practical reason (Qtarget). We set the learning rate as 0.25 and 0.05 respectively, and the weight η as two. For Reg Q, we set the learning rate as 0.25, and the weight η as two. It is averaged over five runs. In Figure 2a, we can see that Reg Q achieves the fastest convergence rate.

B.3 Baird Seven Star Counter Example, Baird [1995]

Baird [1995] considers an overparameterized example, where Q-learning with linear function approximation diverges. The overall state transition is depicted in Figure 4b given in Appendix B.4.

There are seven states and two actions for each state, which are solid and dash action. The number of features are h = 15. At each episode, it is initialized at random state with uniform probability. Solid action leads to seventh state while dashed action makes transition uniformly random to states other than the seventh state. At seventh state, the episode ends with probability 1 100. The behavior policy selects dashed action with probability 5

6, and solid action with probability 1

6. Since CQL in Carvalho et al. [2020] converges under normalized feature values, we scaled the feature matrix with 1

5. The weights are set as one except for θ7 = 2. The learning rates and the weight η are set as same as the previous experiment. As in Figure 2b, Our Reg Q shows the fastest convergence compared to other convergent algorithms.

B.4 Diagrams for θ 2θ and Baird Seven Star Counter Example

The state transition diagrams of θ 2θ and Baird seven-star example are depicted in Figure 4a and Figure 4b respectively.

(b) Baird seven star counter example

Figure 4: Counter-examples where Q-learning with linear function approximation diverges

B.5 Experiments with varying hyperparameters

(a) Theta Two Theta learning rate 0.01

(b) Theta Two Theta learning rate 0.05

(c) Baird learning rate 0.01

(d) Baird learning rate 0.05

Figure 5: Learning curve under different learning rate and regularization coefficient

In Figure 5, we have ran experiments under η {2 2, 2 1, 1, 2}, and learning rate 0..01, 0.05. Overall, we can see that the convergence rate gets faster as η increases.

B.6 Mountain car [Sutton and Barto, 2018] experiment

Mountain Car is environment where state consists of position, and velocity, which are both continuous values. The actions are discrete, accelerating to left, staying neutral, and accelerating to the right. The goal is to reach the top of the mountain quickly as agent gets -1 reward every time step. We use tile-coding [Sutton and Barto, 2018] to discretize the states. We experimented under various tiling numbers and with appropriate η, it achieves performance as Q-learning does. We ran 1000 episodes for the training process, and the episode reward was averaged for 100 runs during test time. From Table 1, with appropriate η, Reg Q performs comparable to Q-learning.

Table 1: Result of episode reward, step size = 0.1. The columns correspond to η, and rows correspond to number of tiles.

0 0.01 0.05 0.1 2 2 199.993 0.005 200.0 0.0 199.28 0.074 199.993 0.005 4 4 196.631 0.179 189.903 0.225 194.178 0.166 196.631 0.179 8 8 185.673 0.305 163.08 0.248 185.103 0.219 185.673 0.305 16 16 166.893 0.33 158.152 0.251 167.934 0.238 166.893 0.33

B.7 O.D.E. experiment

Let us consider a MDP with |S| = 2, |A| = 2, and the following parameters:

1 0 0 2 1 0 0 2

1 4 0 0 0 0 1 4 0 0 0 0 1 4 0 0 0 0 1 4

0.5 0.5 1 0 0.5 0.5 0.25 0.75

, γ = 0.99.

For this MDP, we will illustrate trajectories of the upper and lower system. Each state action pair is sampled uniformly random and reward is one for every time step. η = 2.25 is chosen to satisfy conditions of Theorem 5.1. From Figure 6, we can see that the trajectory of the original system is bounded by the trajectories of lower and upper system.

(a) θ1 θe 1

(b) θ2 θe 2

Figure 6: O.D.E. results

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We have rigorously proved the result of a convergent Q-learning algorithm, and provided thorough analysis on the quality of its solution. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We have cleary stated the Assumptions used througout the paper. Moreover, as a limitation, we discuseed that the convergence solution is biased, and provided thorough analysis on its error bound. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We have provided thorough analysis in the Appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: The experimantal envirnoments are fairly simple and well-known in the community, thus making it easy to reproduce.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have attached the code in the supplementary files. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We have stated the choice of step-size, which is the only hyper-parameter. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We have plotted the error bar. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: NA Justification: Our experiments can simply run on normal computer because we do not require any heavy computation including using GPU. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The authors have checked Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The paper is a theoretical paper. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This is a theoretical paper.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification:We did not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: There are no such risks. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.