# backstepping_temporal_difference_learning__93ccddb5.pdf Published as a conference paper at ICLR 2023 BACKSTEPPING TEMPORAL DIFFERENCE LEARNING Han-Dong Lim Department of Electrical Engineering KAIST, Daejeon, 34141, South Korea limaries30@kaist.ac.kr Donghwan Lee Department of Electrical Engineering KAIST, Daejeon, 34141, South Korea donghwan@kaist.ac.kr Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence issue when the off-policy scheme is used together with linear function approximation. To overcome the divergent behavior, several off-policy TD-learning algorithms, including gradient-TD learning (GTD), and TD-learning with correction (TDC), have been developed until now. In this work, we provide a unified view of such algorithms from a purely control-theoretic perspective, and propose a new convergent algorithm. Our method relies on the backstepping technique, which is widely used in nonlinear control theory. Finally, convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable. 1 INTRODUCTION Since Mnih et al. (2015), which has demonstrated that deep reinforcement learning (RL) outperforms human in several video games (Atari 2600 games), significant advances has been made in RL theory and algorithms. For instance, Van Hasselt et al. (2016); Lan et al. (2020); Chen et al. (2021) proposed some variants of the so-called deep Q-network (Mnih et al., 2015) that achieves higher scores in Atari games than the original deep Q-network. An improved deep RL was developed in Badia et al. (2020) that performs better than average human scores across 57 Atari games. Not only performing well in video games, but Schrittwieser et al. (2020) also have shown that an RL agent can self-learn chess, Go, and Shogi. Furthermore, RL has shown great success in real world applications, e.g., robotics (Kober et al., 2013), healthcare (Gottesman et al., 2019), and recommendation systems (Chen et al., 2019). Despite the practical success of deep RL, there is still a gap between theory and practice. One of the notorious phenomena is the deadly triad (Sutton & Barto, 2018), the diverging issue of the algorithm when function approximation, off-policy learning, and bootstrapping are used together. One of the most fundamental algorithms, the so-called temporal-difference (TD) learning (Sutton, 1988), is known to diverge under the deadly triad, and several works have tried to fix this issue for decades. In particular, the seminar works Sutton et al. (2008; 2009) introduced the so-called GTD, gradient-TD2 (GTD2), and TDC, which are off-policy, and have been proved to be convergent with linear function approximation. More recently, Ghiassian et al. (2020) suggested regularized version of TDC called TD learning with regularized correction (TDRC), and showed its favorable features under off-policy settings. Moreover, Lee et al. (2021) developed several variants of GTD based on primal dual formulation. On the other hand, backstepping control (Khalil, 2015) is a popular method in designing stable controllers for nonlinear systems with special structures. The design technique offers a wide range of stable controllers, and is proved to be robust under various settings. It has been used in various fields including quadrotor helicopters (Madani & Benallegue, 2006), mobile robots (Fierro & Lewis, 1997), and ship control (Fossen & Strand, 1999). Using backstepping control technique, in this paper, we develop a new convergent off-policy TD-learning which is a single time-scale algorithm. In particular, the goal of this paper is to introduce a new unifying framework to design off-policy TDlearning algorithms under linear function approximation. The main contributions are summarized as follows: Published as a conference paper at ICLR 2023 We propose a systemic way to generate off-policy TD-learning algorithms including GTD2 and TDC from control theoretic perspective. Using our framework, we derive a new TD-learning algorithm, which we call backstepping We experimentally verify its convergence and performance under various settings including where off-policy TD has known to be unstable. In particular, most of the previous works on off-policy TD-learning algorithms (e.g., GTD2 and TDC) are derived based on optimization perspectives starting with an objective function. Then, the convergence is proved by proving stability of the corresponding O.D.E. models. In this paper, we follow reversed steps, and reveal that an off-policy TD-learning algorithm (called backstepping TD) can be derived based on control theoretic motivations. In particular, we develop stable O.D.E. models first using the backstepping technique, and then recover back the corresponding off-policy TD-learning algorithms. The new analysis reveals connections between off-policy TD-learning and notions in control theory, and provides additional insights on off-policy TD-learning with simple concepts in control theory. This sound theoretical foundation established in this paper can potentially motivate further analysis and developments of new algorithms. Finally, we briefly summarize TD learning algorithms that guarantee convergence under linear function approximation. GTD (Sutton et al., 2008), GTD2 and TDC (Sutton et al., 2009) have been developed to approximate gradient on mean squared projected Belllman error. Later, GTD and GTD2 has been discovered to solve minimax optimization problem (Macua et al., 2014; Liu et al., 2020). Such sadde-point view point of GTD has led to many interesting results including Du et al. (2017); Dai et al. (2018); Lee et al. (2021). TDRC (Ghiassian et al., 2020) adds an additional term similar to regularization term to one-side of parameter update, and tries to balance between the performance of TD and stability of TDC. TDC++ (Ghiassian et al., 2020) also adds an additional regularization term on both sides of the parameter update. Even though TDRC shows good performance, it uses additional parameter condition to ensure convergence, whereas TDC++ does not. 2 PRELIMINARIES 2.1 NONLINEAR SYSTEM THEORY Nonlinear system theory will play an important role throughout this paper. Here, we briefly review basics of nonlinear systems. Let us consider the continuous-time nonlinear system xt = f(xt, ut), x0 2 Rn, (1) where x0 2 Rn is the initial state, t 2 R, t 0 is the time, xt 2 Rn is the state, ut 2 Rn is the control input, and f : Rn Rn ! Rn is a nonlinear mapping. An important concept in dealing with nonlinear systems is the equilibrium point. Considering the state-feedback law ut = µ(xt), the system can be written as xt = f(xt, ut) = f(xt, µ(xt)) =: f(xt), and a point x = xe in the state-space is said to be an equilibrium point of (1) if it has the property that whenever the state of the system starts at xe, it will remain at xe (Khalil, 2015). For xt = f(xt), the equilibrium points are the real roots of the equation f(x) = 0. The equilibrium point xe is said to be globally asymptotically stable if for any initial state x0 2 Rn, xt ! xe as t ! 1. An important control design problem is to construct a state-feedback law ut = µ(xt) such that the origin becomes the globally asymptotically stable equilibrium point of (1). To design a statefeedback law to meet such a goal, control Lyapunov function plays a central role, which is defined in the following definition. Definition 2.1 (Control Lyapunov function (Sontag, 2013)). A positive definite function V : Rn ! R is called a control Lyapunov function (CLF) if for all x 6= 0, there exists a corresponding control input u 2 Rm that satisfies the inequality, rx V (x)>f(x, u) < 0 for all x 6= 0. Once such a CLF is found, then it guarantees that there exists the control law that stabilizes the system. Moreover, the corresponding state-feedback control law can be extracted from the CLF, e.g., µ(x) = arg minu rx V (x)>f(x, u) provided that the minimum exists and unique. The concept of control Lyapunov function will be used in the derivations of our main results. For the autonomous Published as a conference paper at ICLR 2023 system, xt = f(xt), and Lypaunov function V : Rn ! R, Lie derivative is defined as Lf V (x) := rx V (x)>f(x) so that V (xt) = Lf V (xt) along the solution. 2.2 STOCHASTIC APPROXIMATION AND O.D.E. APPROACH Including Q-learning (Watkins & Dayan, 1992) and TD-learning (Sutton, 1988), reinforcement learning algorithms can be considered as stochastic approximation (Robbins & Monro, 1951) described by xk+1 = xk + k(f(xk) + k), (2) where f : Rn ! Rn is a nonlinear mapping, and k is an i.i.d. noise. Borkar and Meyn theorem (Borkar & Meyn, 2000) is a well-known method to bridge the asymptotic convergence of stochastic approximation and the stability of its corresponding O.D.E. model, which can be expressed as xt = f(xt), x0 2 Rn, (3) where x0 2 Rn is initial state, and t 2 R, t 0 is the time. Borkar and Meyn theorem (Borkar & Meyn, 2000) states that under the conditions in Assumption 7.1 in the Appendix, global asymptotic stability of the O.D.E. (3) leads to asymptotic convergence of the stochastic approximation update (2), which is formally stated in the following lemma. Lemma 2.1 (Borkar and Meyn theorem (Borkar & Meyn, 2000)). Suppose that Assumption 7.1 in the Appendix holds, and consider the stochastic approximation in (2). Then, for any initial x0 2 Rn, supk 0 ||xk|| < 1 with probability one. In addition , xk ! xe as k ! 1 with probability one, where xe is the unique equilibrium point of the O.D.E. in (3). The main idea of Borkar and Meyn theorem is as follows: iterations of a stochastic recursive algorithm follow the solution of its corresponding O.D.E. in the limit when the step-size satisfies the so-called Robbins-Monro condition (Robbins & Monro, 1951) in (33) in the Appendix. Therefore, by proving asymptotic stability of the O.D.E., we can induce convergence of the original algorithm. In this paper, we will use an O.D.E. model of TD-learning, which is expressed as a linear timeinvariant system. 2.3 BACKSTEPPING CONTROL This section provides the concept of the backstepping control (Kokotovic, 1992; Khalil, 2015), which will be the main tool in this paper to derive TD-learning algorithms. The backstepping technique is a popular tool for generating a CLF (control Lyapunov function) for nonlinear systems with specific structures. In particular, let us start with the following general nonlinear system: yt = f(yt) + g(yt)xt (4) xt = ut, where yt 2 Rm, xt 2 Rm are the states, ut 2 Rm is the input, and f : Rm ! Rm and g : Rm ! R are continuous functions. The first system is a nonlinear system with a particular affine structure, and the second system is simply an integrator. It can be seen as a cascade interconnection of two systems, where the second system s state is injected to the input of the first system. The backstepping control technique gives us a systematic way to generate a CLF for such particular nonlinear systems provided that the first system admits a CLF independently. To this end, we suppose that the first system admits a CLF. Through the backstepping approach, designing a stable control law for the above system can be summarized in the following steps: Step 1. Consider xt in (4) as virtual input x(yt) (state-feedback controller), and consider the fol- lowing system: λt = f(yt) + g(yt) x(yt). Design x(yt) such that the above system admits a CLF V , i.e., it admits a positive definite and radially unbounded function V such that its time derivative is negative definite, i.e., V (yt) < 0, 8yt 6= 0. Step 2. Denote the error between the virtual state-feedback controller x(yt) and state variable xt as zt := xt x(yt). Now, rewrite the original O.D.E. in (4) with the new variable (yt, zt): f(yt) + g(yt) x(yt) + g(yt)zt Published as a conference paper at ICLR 2023 Step 3. Design the control input ut such that the above system is stable. One popular choice is to consider the CLF Vc(yt, zt) := V (yt) + ||zt||2/2, where V (yt) is defined in Step 1. Then choose ut such that the time derivative of Vc(yt, zt) to be negative definite. A simple example of designing stabilizing control law by backstepping technique is given in Appendix Section 7.3. 2.4 MARKOV DECISION PROCESS In this paper, we consider a Markov decision process (MDP) characterized by the tuple (S, A, P, γ, r), where S := {1, 2, . . . , |S|} stands for the set of finite state space, |S| denotes the size of S, A := {1, 2, . . . , |A|} denotes the set of finite action space, |A| is the size of A, γ 2 (0, 1) is the discount factor, P : S A S ! [0, 1] denotes the Markov transition kernel, and r : S A S ! R means the reward function. In particular, if an agent at state s 2 S, takes action a 2 A, then the current state transits to the next state s0 2 S with probability P(s, a, s0), and the agent receives reward r(s, a, s0). Each element of the state to state transition matrix under policy , denoted by P 2 R|S| |S| is [P ]ij := P (a|i)P(i, a, j), 1 i, j |S|, where [P ]ij corresponds to i-th row and j-th column element of matrix P . Moreover, the stationary state distribution induced by policy µ, is denoted as dµ : S ! [0, 1], i.e., dµ>P µ = dµ>. With the above setup, we define the following matrix notations: 75 2 R|S| |S|, R = Ea [r(s, a, s0)|s = 1] Ea [r(s, a, s0)|s = 2] ... Ea [r(s, a, s0)|s = |S|] 775 2 R|S|, where Dµ is a diagonal matrix of the state distribution induced by behavior policy µ, each element of R is the expected reward under policy at the corresponding state. The policy evaluation problem aims to approximate the value function at state s 2 S, v (s) := E k=0 γkr(Sk, Ak, Sk+1) , where the trajectory is generated under policy : S A ! [0, 1]. In this paper, we consider the linear function approximation to approximate the value function v (s). In particular, we parameterize the value function v (s) with φ>(s) , where φ : S ! Rn is a pre-selected feature vector with φ(s) := [φ1(s) φn(s)], φ1, . . . , φn : S ! R are feature functions, and 2 Rn is the learning parameter. The goal of the policy evaluation problem is then to approximate the value function v (s) using this linear parameterization, i.e., φ>(s) v (s). Moreover, using the matrix notation Φ := [φ(1), φ(2), , φ(|S|)]> 2 R|S| n, called the feature matrix, the linear parameterization can be written in the vector form Φ . We also assume that Φ is full column rank matrix throughout the paper, which is a standard assumption (Sutton et al., 2008; 2009; Ghiassian et al., 2020; Lee et al., 2021). 2.5 TEMPORAL DIFFERENCE LEARNING This section provides a brief background on TD-learning (Sutton, 1988). Suppose that we have access to stochastic samples of state sk from the state stationary distribution induced by the behavior policy µ, i.e., sk dµ( ), and action is chosen under behavior policy µ, i.e., ak µ( |sk). Then, we observe the next state s0 k following s0 k P( , ak, sk), and receive the reward rk := r(sk, ak, s0 k). Using the simplified notations for the feature vectors φk := φ(sk), φ0 k). the TD-learning update at time step k with linear function approximation can be expressed as k+1 = k + k kδk( k)φk, where k > 0 is the step-size, δk( k) := rk +γφ0> k k is called the temporal difference or temporal difference error (TD-error), and k := (sk, ak) = (ak|sk) µ(ak|sk) is called the importance sampling ratio (Precup et al., 2001). The importance sampling ratio reweights the TD-error to handle the mismatch between the behavior policy µ and target policy . It is known that TD-learning with linear function approximation and off-policy learning scheme does not guarantee convergence in general. The above stochastic approximation aims to find fixed point of the following projected Bellman equation, which is, after some manipulations, expressed as: Φ>DµΦ γΦ>DµP Φ = Φ>DµR . (5) Published as a conference paper at ICLR 2023 To simplify the expressions, let use introduce one more piece of notations: A := Es dµ(s),s0 P (s0|s)[φ(s)(φ(s) γφ(s0))>] = Φ>DµΦ γDµP Φ 2 Rn n, b := Es dµ(s),a (a|s),s0 P (s0|s,a)[r(s, a, s0)φ(s)] = Φ>DµR 2 Rn 1. Even though we can use arbitrary distribution, for simplicity we assume stationary distribution of µ. Now, we can rewrite (5) compactly as The corresponding O.D.E. for TD-learning can be written as t = A t b, 0 2 Rn. Using the coordinate transform xk := k , we get the O.D.E. xt = Axt, x0 2 Rn, whose origin is globally asymptotically stable equilibrium point if (s, a) = (a|s) µ(a|s) = 1 for all (s, a) 2 S A. Throughout the paper we will use the vector xk := k to represent the coordinate transform of k to the origin, and will use t and xt to denote the corresponding continuous-time counterparts of k and xk, respectively. 2.6 GRADIENT TEMPORAL DIFFERENCE LEARNING To fix the instability issue of off-policy TD-learning under linear function approximation, Sutton et al. (2008) and Sutton et al. (2009) introduced various stable off-policy TD-learning algorithms, called GTD (gradient TD-learning), GTD2, and TDC (temporal difference correction). The idea behind these algorithms is to minimize the mean-square error of projected Bellman equation (MSPBE) min 2Rn 1 2||Φ>Dµ(R +γP Φ Φ )||2 (Φ>DµΦ) 1, where ||x||D := x>Dx, and the global minimizer of MSPBE corresponds to the solution of (6). The core idea of the algorithms is to introduce an additional variable λk 2 Rn to approximate the stochastic gradient descent method for MSPBE as an objective function. In particular, GTD2 update can be written as λk+1 = λk + k( φ> k λk + kδk( k))φk, k+1 = k + k(φ> k λkφk kγφ> We denote λt to denote continuous time part of λk. Since the fixed point for λk is zero, it doesn t require coordinate transformation. It is a single time-scale algorithm because it uses a single stepsize k. The corresponding O.D.E. is expressed as λt = Cλt Axt, xt = A>λt, where C := Es dµ(s)[φ(s)φ>(s)] = Φ>DµΦ 2 Rn n. Similarly, TDC update can be written as λk+1 = λk + k( φ> k λk + kδk( k))φk (7) k+1 = k + βk( kγφ> k + kδk( k)φk), (8) where the step-sizes, k and βk, satisfy k/βk ! 0 as k ! 1 and the Robbins and Monro step-size condition (Robbins & Monro, 1951) in (33) in Appendix. It is a two time-scale algorithm because it uses two time-steps, k and βk. 3 DESIGNING TD-LEARNING THROUGH BACKSTEPPING We briefly explain the motivation for our algorithmic development. Borkar and Meyn theorem (Borkar & Meyn, 2000) in Lemma 2.1 is a typical tool to prove convergence of Qlearning (Borkar & Meyn, 2000; Lee & He, 2019) and TD-learning (Sutton et al., 2009; Lee et al., 2021). Most of the previous works on off-policy TD-learning algorithms (e.g., GTD2 and TDC) first start with an objective function, and then derive GTD algorithms based on optimization perspectives. Then, the convergence is proved using the corresponding O.D.E. models and stability theory of linear time-invariant systems. A natural question arises is, can we derive off-policy TDlearning algorithms following a reversed step? In other words, can we develop a stable O.D.E. model first using tools in control theory, and then recover back the corresponding off-policy TD-learning algorithms? In this paper, we reveal that a class of off-policy TD-learning algorithms can be derived based on purely control theoretic motivations following such a reversed process. By doing so, this work provides additional insights on off-policy TD-learning algorithms and gives a sound theoretical foundation on off-policy TD-learning algorithms for further developments of new algorithms. Designing stabilizing control laws for continuous-time nonlinear system has been successful over the past decades (Khalil, 2015). One such technique, so called backstepping, is a popular controller Published as a conference paper at ICLR 2023 design method in non-linear control literature (Khalil, 2015). With the help of the backstepping method (Khalil, 2015), we design stabilizing control laws for continuous-time systems, and then the corresponding off-policy TD-learning algorithms are derived, and are shown to be convergent via Borkar and Meyn theorem (Borkar & Meyn, 2000) in Lemma 2.1. The brief procedure is explained in the following steps: Step 1) Choose an appropriate continuous-time dynamic model such that (a) we can recover the TD-fixed point in (6) via its equilibrium point; (b) the corresponding stochastic approximation algorithm can be implementable only through transitions of MDP and accessible data.; Step 2) Using the backstepping method, design a control input to stabilize the dynamic model chosen in Step 1). 3.1 BACKSTEPPING TD Now, we introduce a new off-policy TD-learning algorithm, which we call Backstepping TD (BTD). Firstly, we will develop a stabilizing control law for the following the continuous-time system: λt = ( C + A)λt Axt (9) xt = ut (10) The idea stems from finding a control system for which we can easily apply the backstepping techinque. In details, the backstepping techinqiue can be applied to the two interconnected systems where one subsystem, namely (4), can be stabilized with xt in (4) as a control input. Therefore, our first aim is to find such a system. To this end, we can try a natural choice of O.D.E. to solve the TD problem, i.e., λt = Aλt, which is however unstable in the off-policy case. Therefore, we can develop a modified O.D.E. λt = ( C + A)λt Axt, where xt is the control input, the negative definite matrix C is introduced to stabilize the system, and > 0 is introduced to provide additional degrees of freedom in design. Now, the constructed system can be stabilized through the state-feedback controller xt = λt and admits the simple control Lypaunov function V (λ) = ||λ||2. Moreover, A should be included in the right-hand side in order to implement the corresponding algorithm without knowing the solution because xk = k and should be removed using A = b in the final step. Simply setting xt = λt may cancel out A in the right-hand side, the O.D.E. becomes λt = Cλt, Therefore, as mentioned before, we can apply the backstepping technique by adding an additional dynamic controller. As the next step, the backstepping technique is applied, and one needs to observe what would be the final form of the control system. In summary, if we consist f(λt) with the combination of A and C (not necessarily C, it may be I) , it can be a reasonable candidate to apply the backstepping technique. Cancelling A with virtual input only leaves C, which guarantees stability from its negative definiteness. Therefore, (9) and (10) is a reasonable candidate for the dynamics where we can apply the backstepping technique. In particular, our aim is to design an appropriate control input ut for the above system such that the origin is the unique asymptotically stable equilibrium point, i.e., (λt, xt) ! 0 as t ! 1 for any (λ0, x0) 2 Rn Rn. The overall procedure is depicted in Figure 1 in the Appendix, and we show how to choose the control input ut in the following lemma. Lemma 3.1. Consider the O.D.E. in (9) and (10). If we choose the control input ut := (A>+ 2A C)λt Axt, then the above O.D.E. has globally asymptotically stable origin, i.e., (λt, xt) ! (0, 0) as t ! 1 for any (λ0, x0) 2 Rn Rn. Proof sketch. The proof follows the steps given in the backstepping scheme in Section 3. First, substituting xt in (9) with a virtual controller x(λt), we will design a control law x(λt) that stabilizes the following new virtual system: λt = ( C + A)λt A x(λt). (11) One natural choice of the virtual controller is x(λt) = λt. Plugging it into (11) leads to λt = Cλt, and we can verify the global asymptotic stability of the above system with the following Lyapunov function: V (λt) := ||λt||2 We now consider the original O.D.E. in (9) and (10). Applying simple algebraic manipulations yield λt = Cλt A(xt λt), xt = ut. The error between xt and the virtual controller x(λt) can Published as a conference paper at ICLR 2023 be expressed as new variable zt, which is zt := xt x(λt) = xt λt. Rewriting the O.D.E. in (9) and (10) with (λt, zt) coordinates, we have λt = Cλt Azt (13) zt = ut + Cλt + Azt. To prove the global asymptotic stability of the above system, consider the function Vc(λt, zt) := V (λt)+ 1 2 where V (λt) is defined in (12). By taking ut as ut = A>λt Cλt Azt, we can apply La Sall es invariance principle in Lemma 7.1. The full proof is in Appendix Section 7.4.1. Using the relation zt := xt λt, the control input in the original coordinate (λt, xt) can be written as ut := A>λt Cλt Azt = (A> + 2A C)λt Axt. Plugging this input into the original open-loop system in (9) and (10), the closed-loop system in the original coordinate (λt, xt) can written as λt = ( C + A)λt Axt (14) xt = (A> + 2A C)λt Axt, (15) whose origin is also globally asymptotically stable according to Lemma 3.1. Recovering back from xt to t, we have d C + A A A> + 2A C A . The corresponding stochastic approximation of the O.D.E. in Theorem 3.1 becomes λk+1 = λk + k((( 1 + )φ> k )λk + kδk( k))φk (16) k+1 = k + k((( + 2)φ> k )λkφk + kδk( k)φk + (φ> k λkφk kγφ> The equilibrium point of the above O.D.E. is (0, ). Hence, we only need to transform the coordinate of t to xt = t , which results to the O.D.E. in (14) and (15). With the above result, we are now ready to prove convergence of Algorithm 1. The proof simply follows from Borkar and Meyn theorem in Lemma 2.1, of which the details can be found in Sutton et al. (2009). Theorem 3.1. Under the step size condition (33) , with Algorithm 1 in Appendix, k ! as k ! 1 with probability one, where is the fixed point of (6). Proof. The proof is done by checking Assumption 7.1 in Appendix. Remark 3.1. Theorem 3.1 doesn t require any condition on . Therefore, we can set = 0, which results to GTD2 developed in Sutton et al. (2009). 3.2 RECOVERING SINGLE TIME-SCALE TDC In this section, we derive a single-time scale version of TDC (Sutton et al., 2009) through the backstepping design in the previous section. TDC (Sutton et al., 2009) was originally developed as a two-time scale algorithm in Sutton et al. (2009). Even though the two time-scale method provides theoretical guarantee for a larger class of algorithms, the single time-scale scheme provides more simplicity in practice, and shows faster convergence empirically. Subsequently, Maei (2011) provided a single-time scale version of TDC by multiplying a large enough constant > 0 to the faster time scale part (7), which leads to λk+1 = λk + βk ( φ> k λk + kδk( k))φk (18) k+1 = k + βk( kγφ> k + kδk( k)φk), (19) C 1(A + A>)/2 Here, we derive another version of single-time TDC by multiplying a constant to the slower timescale part in (8), which results in λk+1 = λk + k( φ> k λk + kδk( k))φk (21) k+1 = k + kβ(φ> k λkφk kγφ> k + kδk( k)φk), (22) Published as a conference paper at ICLR 2023 where β satisfies 0 < β < λmin(C) λmin(A) if λmin(A) < 0, else β > 0. (23) We can derive the above algorithm following similar steps as in Section 3.1. Let us first consider the following dynamic model: λt = Cλt Axt (24) xt = ut (25) Using the backstepping technique, we can prove that the above system admits the origin as a global asymptotically stable equilibrium point with the control input ut := β (A> C)λt A t , which is shown in the following lemma: Lemma 3.2. Consider the O.D.E. in (24) and (25). Suppose that we choose the control input ut := β (A> C)λt A t ), and β satisfies condition (23). Then, the above O.D.E. has globally asymptotically stable origin, i.e., (λt, xt) ! (0, 0) as t ! 1. The proof of Lemma 3.2 is given in Appendix Section 7.4.2. By Borkar and Meyn theorem in Lemma 2.1, we can readily prove the convergence of Algorithm 2 in Appendix, which uses stochastic recursive update (21) and (22). Theorem 3.2. Consider Algorithm 2 in Appendix. Under the step size condition (33), and if β satisfies (23), k ! as k ! 1 with probability one, where is the fixed point of (6). We will call the Algorithm 4 as TDC-slow, and single-time version of TDC suggested by Maei (2011) as TDC-fast. Other than the multiplication of a constant reflecting two-time scale property, we can make TDC into a single-time algorithm, which we call a single time-scale TDC2, while the original version in Maei (2011) will be called the single time-scale TDC. The derivation is given in Appendix Section 7.5. The performance of such versions of TDC are evaluated in Appendix Section 7.9.1. Even though not one of the algorithms outperforms each other, TDC-slow and TDC2 shows better performance in general. 3.3 GENERALIZING TDC++ This section provides versions of TDC++ (Ghiassian et al., 2020), which is variant of TDC. With an additional regularization term k on both updates of TDC in (7) and (8), the update is written as follows: λk+1 = λk + k ( φ> k λk + kδk( k))φk βλk) (26) k+1 = k + k( kγφ> k βλk + kδk( k)φk), (27) where > 0 satisfies (20) and β > 0 is a new parameter. Note that TDC++ can be simply viewed as variant of TDC by adding the term βλk in the update, which can be seen as a regularization term. Therefore, letting β = 0 yields the original TDC. In this paper, we prove that our controller design leads to the following update: λk+1 = λk + k ( φ> k λk + kδk( k))φk βλk) (28) k+1 = k + k( kγφ> k λkφk β λk + k δk( k)φk), (29) where and β are new parameters and when = 1/ it becomes TDC++. The difference with the original TDC++ can be seen in their corresponding O.D.E. forms. The corresponding O.D.E. for (26) and (27) (original TDC++) can be expressed as: d dt (C + βI) A A> C βI A Meanwhile, the O.D.E. corresponding to (28) and (29) (new TDC++) becomes d dt = (C + βI) A A> (C + βI) A . We experiment under different of and to examine the behavior of new TDC++. The result shows that in general, smaller leads to better performance. The results are given in Appendix Section 7.9. Published as a conference paper at ICLR 2023 Lemma 3.3. Consider the following O.D.E.: λt = (C + βI)λt Axt (30) xt = ut. (31) Suppose that we choose the control input ut := (A> (C + βI))λt Axt. Assume > 0 and β and satisfies the following condition: β + λmin(A) > λmin(C). Then, the above O.D.E. has globally asymptotically stable origin, i.e., (λt, xt) ! (0, 0) as t ! 1. The proof is given in Appendix Section 7.4.3. With Lemma 2.1, we can prove the convergence of stochastic update with (28) and (29) whose pseudo code is given in Algorithm 5 in Appendix. Theorem 3.3. Consider Algorithm 5 in Appendix. Under the step-size condition (33) and if satisfies (20), then k ! as k ! 1 with probability one, where is the TD fixed point in (6). Remark 3.2. We can replace the regularization term with nonlinear terms satisfying certain conditions. The details are given in Appendix Section 7.6. 4 EXPERIMENTS We verify the performance and convergence of the proposed BTD under standard benchmarks to evaluate off-policy TD-learning algorithms, including Baird environment (Baird, 1995), Random Walk (Sutton et al., 2009) with different features, and Boyan chain (Boyan, 2002). The details about the environments are given in Appendix Section 7.7. From the experiments, we see how BTD behaves under different coefficients 2 { 0.5, 0.25, 0, 0.25, 0.5}. We measure the Root Mean-Squared Projected Bellman Error (RMSPBE) as the performance metric, and every results are averaged over 100 runs. From Table 1, the result with = 0.5 shows the best performance except at Baird, where = 0, corresponding to GTD2 performs best. There exist two aspects on the role of . First of all, it can be thought of as a parameter that can mitigate the effect of instability coming from matrix A in (9). For example, a smaller can stabilize the system. However, as a trade off, if is too small, then the update rate might be too small as well. As a result, the overall convergence can be slower. Furthermore, also controls the effect of C in (13) in the BTD update rules, where C corresponds to ( + 2)φ> k λkφk in (17). Note that the role of in the final BTD update rule in (17) shows different perspectives compared to that in (9). In particular, = 1/2 maximizes the effect of C in (17). From Table 1, it leads to reasonably good performances in most domains. Another natural choice is to multiply to C instead of A. However, in such cases, we need to introduce another constrain > 0, whereas in the current BTD, convergence is guaranteed for all 2 R. Finally, we note that simply multiplying C by a large positive constant does not lead to good results in general. This is because in this case, it may increase variance, and destabilize the algorithm. Overall results are given in Appendix Section 7.8. Table 1: Backstepping TD, step-size = 0.01 -0.5 -0.25 0 0.25 0.5 Boyan 1.51 0.66 1.481 0.656 1.452 0.647 1.428 0.64 1.408 0.635 Dependent 0.11 0.19 0.097 0.163 0.086 0.142 0.079 0.128 0.076 0.122 Inverted 0.21 0.25 0.173 0.218 0.151 0.193 0.139 0.177 0.136 0.172 Tabular 0.17 0.28 0.147 0.238 0.133 0.208 0.124 0.191 0.122 0.188 Baird 0.1 0.64 0.09 0.629 0.085 0.625 0.087 0.628 0.092 0.637 5 CONCLUSION In this work, we have proposed a new framework to design off-policy TD-learning algorithms from control-theoretic view. Future research directions would be extending the framework to non-linear function approximation setting. Published as a conference paper at ICLR 2023 6 ACKNOWLEDGEMENTS This work was supported by the National Research Foundation under Grant NRF2021R1F1A1061613, Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT)(No.2022-0-00469), and the BK21 FOUR from the Ministry of Education (Republic of Korea). (Corresponding author: Donghwan Lee.) Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning, pp. 507 517. PMLR, 2020. Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30 37. Elsevier, 1995. Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447 469, 2000. Justin A Boyan. Technical update: Least-squares temporal difference learning, 2002. Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song. Generative adversarial user model for reinforcement learning based recommendation system. In International Conference on Machine Learning, pp. 1052 1061. PMLR, 2019. Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross. Randomized ensembled double q-learning: Learning fast without a model. ar Xiv preprint ar Xiv:2101.05982, 2021. Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pp. 1125 1134. PMLR, 2018. Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, pp. 1049 1058. PMLR, 2017. Rafael Fierro and Frank L Lewis. Control of a nonholomic mobile robot: Backstepping kinematics into dynamics. Journal of robotic systems, 14(3):149 163, 1997. Thor I Fossen and Jan P Strand. Tutorial on nonlinear backstepping: Applications to ship control. Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, and Martha White. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, pp. 3524 3534. PMLR, 2020. Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16 18, 2019. Hassan K Khalil. Nonlinear control, volume 406. Pearson New York, 2015. Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238 1274, 2013. Petar V Kokotovic. The joy of feedback: nonlinear and adaptive. IEEE Control Systems Magazine, 12(3):7 17, 1992. Qingfeng Lan, Yangchen Pan, Alona Fyshe, and Martha White. Maxmin q-learning: Controlling the estimation bias of q-learning. ar Xiv preprint ar Xiv:2002.06487, 2020. Donghwan Lee and Niao He. A unified switching system perspective and ode analysis of q-learning algorithms. ar Xiv preprint ar Xiv:1912.02270, 2019. Published as a conference paper at ICLR 2023 Donghwan Lee, Han-Dong Lim, Jihoon Park, and Okyong Choi. Versions of gradient temporal difference learning. ar Xiv preprint ar Xiv:2109.04033, 2021. Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finite-sample analysis of proximal gradient td algorithms. ar Xiv preprint ar Xiv:2006.14364, 2020. Sergio Valcarcel Macua, Jianshu Chen, Santiago Zazo, and Ali H Sayed. Distributed policy evalu- ation under multiple behavior strategies. IEEE Transactions on Automatic Control, 60(5):1260 1274, 2014. Tarek Madani and Abdelaziz Benallegue. Backstepping control for a quadrotor helicopter. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3255 3260. IEEE, 2006. Hamid Reza Maei. Gradient temporal-difference learning algorithms. 2011. A Rupam Mahmood, Huizhen Yu, Martha White, and Richard S Sutton. Emphatic temporaldifference learning. ar Xiv preprint ar Xiv:1507.01569, 2015. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015. Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In ICML, pp. 417 424, 2001. Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathemati- cal statistics, pp. 400 407, 1951. Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020. Eduardo D Sontag. Mathematical control theory: deterministic finite dimensional systems, vol- ume 6. Springer Science & Business Media, 2013. Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3 (1):9 44, 1988. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Richard S Sutton, Csaba Szepesvári, and Hamid Reza Maei. A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation. Advances in neural information processing systems, 21(21):1609 1616, 2008. Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 993 1000, 2009. Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q- learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279 292, 1992.