# zap_qlearning_with_nonlinear_function_approximation__b98443b5.pdf Zap Q-learning with Nonlinear Function Approximation Shuhang Chen University of Florida shuhangchen@ufl.edu Adithya M. Devraj Stanford University adevraj@stanford.edu Fan Lu University of Florida fan.lu@ufl.edu Ana Buši c INRIA - École Normale Supérieure PSL Research University ana.busic@inria.fr Sean P. Meyn University of Florida meyn@ece.ufl.edu Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stopping. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from Open AI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture. 1 Introduction A primary goal of reinforcement learning (RL) is the creation of algorithms that are convergent, converge at the fastest possible rate, and result in a policy for control that has near optimal performance. This paper focuses on algorithm design to ensure stability of the algorithm, consistency, and techniques to obtain qualitative insight on the rate of convergence. One framework for algorithm design is the theory of stochastic approximation (SA). The main contribution of this work is a new class of Q-learning algorithms that are convergent even for nonlinear function approximation architectures, such as neural networks. Consider a Markov decision process (MDP) model with state-input sequence {(Xn, Un) : n 0}, and let {Q (x, u) : 2 Rd} denote a family of approximations of the Q-function; the vector 2 Rd might correspond to weights in a neural network. One popular formulation of Q-learning is defined by the recursion, n+1 = n + n+1Dn+1 n (1) in which { n+1} is the non-negative step-size sequence, {Dn+1} is the scalar sequence of temporal differences [recalled in eq. (19a)], and { n} the eligibility vectors: a typical choice is n = r Q (Xn, Un) The Q-learning algorithm of Watkins can be expressed as (1), with a linear function approximation Q = | , and the basis functions (x, u) being indicator functions of each state-input pair. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. The theory of Q-learning with function approximation has not caught up with the famous success stories in applications. Consistency of the Q-learning algorithm in the tabular setting was established in the seminal work of Watkins and Dayan [50]. Counter-examples soon followed: the recursion (1) may fail to converge, even in the linear function approximation setting [3, 31]. Moreover, even when convergence holds, Q-learning can be extremely slow [45, 18, 16]. The so-called ODE method of SA theory is typically regarded as a method of analysis for stochastic recursions. We take the opposite view, regarding an ODE as a first step in algorithm design. This is motivated in part by the recent work [42, 39] concerning the value of high-resolution approximations of ODEs for applications in optimization, and the enormous insight gained from a careful inspection of candidate ODEs. The Q-learning algorithms considered in the present work are designed to solve a root-finding problem of the form f( ) := E[ n Dn+1] = = 0. For ODE design, we let wt 2 Rd denote the state of the ODE at time t, and seek a vector field : Rd+1 ! Rd to define the evolution: d dtwt = (wt, t) The vector field is designed so that wt ! from each initial condition, and so that the ODE solutions can be efficiently approximated using a discrete-time algorithm driven by observations. One approach is to apply gradient descent to solve the non-convex optimization problem: 1 2f( )|Mf( ), with M > 0 (3) which results in the ODE with time homogeneous vector field: d dtwt = [@ f (wt)]|Mf(wt) (4) The GQ-learning algorithm of [31] can be regarded as a direct discrete-time translation of this ODE, using M = E[ n | This approach is discussed in Nesterov s monograph [34, Section 4.4.1] for general root finding problems, who warns that it can lead to numerical instability: ...if our system of equations is linear, then such a transformation squares the condition number of the problem . He goes on to warn that it can lead to a squaring the number of iterations to obtain the desired error bound. The main results of the present paper are related to the Newton-Raphson flow, defined by another time homogeneous vector field (w) = [@ f (w)] 1f(w): d dtwt = Gtf(wt) , with Gt = [@ f (wt)] 1 (5) A change of variables leads to the linear dynamics, d dtf(wt) = f(wt), with solution f(wt) = f(w0) e t , t 0 (6) Thus, provided solutions to (5) are bounded, the algorithm is consistent in the sense that the limit points of {wt} lie in the set of roots := { : f( ) = 0}. In most applications it is not possible to determine a-priori if the matrix @ f ( ) is full rank, which motivates a regularized Newton-Raphson flow: d dtwt = ["I + A(wt)|A(wt)] 1A(wt)|f(wt) , A(wt) = @ f (wt) (7) It is shown in Prop. A.6 that (7) is stable, provided V = kfk2 is a coercive function on Rd; V serves as a Lyapunov function for (7), giving lim t!1 f(wt) = 0 (8) Hence the limit points of solutions lie in the set . Details of the algorithms and the contributions of this paper require additional background. Consider the d-dimensional SA recursion of Robbins and Monro [38, 9]: n+1 = n + n+1f( n, Φn+1) (9) in which Φ is an irreducible Markov chain on a finite state space Z, { n} is a non-negative gain sequence, and f : Rd Z ! Rd. It is assumed that Φ has a unique invariant probability mass function (pmf) on Z. The algorithm is designed to approximate roots of the function f( ) = E[f( , Φn+1)] (with expectation in steady-state). Under mild conditions, the SA recursion shares the same limit points as the ODE d dtwt =f(wt) [30, 9, 6]. More recently it has been established that boundedness of the stochastic recursion follows from a stability condition for the ODE [10, 9, 37] (prior to this work, stability of the stochastic recursion required separate arguments [46]). One approach to obtain a rate of convergence in SA is through the linearization: En+1 = En + n+1[A En + n+1] , E0 = 0 (10) where A = @ f ( ) is called the linearization matrix. The sequence n+1 := f( , Φn+1) is assumed to admit a Central Limit Theorem (CLT) in the usual sense, with asymptotic covariance where the expectations are in steady state. The approximation En n := n holds under additional stability assumptions on the stochastic recursion (9), which in particular leads to a CLT for the scaled error pn n [30, 9, 6]. The asymptotic covariance in the CLT has a simple form, subject to the eigenvalue test: 2 for each eigenvalue λ of A (12) Under this assumption, is the unique solution of the Lyapunov equation, 2I + A ] + [ 1 2I + A ]| + = 0 (13) For a fixed but arbitrary initial condition (Φ0, E0), denote n = E[En E| n]. The following bounds were obtained in [12] for the linear recursion (10): (i) If (12) holds, then n = n 1 + O(n 1 δ) for some δ > 0. (ii) If there exists an eigenvalue of A with := Re (λ) < 1 2, and associated eigenvector v satisfying v 6= 0, then the convergence rate of n to zero is no faster than n 2 . Even though the recursion for Watkins Q-learning is of the form (1), with Dn+1 a non-linear function of n, techniques of [45] can be used to show that the estimates obtained using the non-linear recursion couple with the estimates of a linear recursion of the form (10). The slow convergence for Watkins algorithm can then be explained by the fact that we are in case (ii) for the linearized recursion, whenever the discount factor satisfies γ > 1 2: for a standard step-size rule, the maximal eigenvalue of A in Watkins Q-learning is λ = (1 γ), and the condition v 6= 0 holds under very mild conditions on the MDP [16]. It follows that the mean square error converges to zero at rate n 2(1 γ). For GQ-learning, it is shown in Appendix A.3 that the maximal eigenvalue is greater than (1 γ)2, which is consistent with Nesterov s warning. The slow convergence can be remedied by scaling the step-size by a constant g > 1 (sufficiently large so that the matrix 1 2I +g A is Hurwitz). For tabular Q-learning any value satisfying g > 1/[2(1 γ)] will suffice, while for GQ-learning the scaling must be increased beyond 1/[2(1 γ)2]. Unfortunately, this approach may lead to very high variance. Contributions (i) A generalization of the Zap SA algorithm of [16] is proposed. Zap SA Algorithm: Initialize 0 2 Rd, b A0 2 Rd d, " > 0. Update for n 0: b An+1 = b An + βn+1 An+1( n) b An , An+1( ) := @ f ( , Φn+1) (14a) n+1 = n + n+1Gn+1f( n, Φn+1), Gn+1 := ["I + b A| n+1 b An+1] 1 b A| The algorithm is designed so that it approximates the ODE (7), which requires n = o(βn). (ii) A special case of this new class of SA algorithms leads to a significant generalization of Zap Q-learning, for which convergence theory is obtained even in a nonlinear function approximation setting. The reliability in neural network function approximation architectures is tested through simulations. (iii) The main technical contribution of this paper is an extension of SA theory to Zap Q-learning, and as a byproduct also GQ-learning, by exploiting approximate convexity/concavity of the functions f and f defined implicitly in (14). Contribution (iii) resolves a significant challenge for both Zap Q-learning and GQ-learning: the approximation in stochastic approximation. Standard theory does not apply because A( ) := @ f( ) is not continuous. An ODE approximation for GQ-learning is obtained in [31] through the assumption that noise { n} defined below (10) is martingale-difference. Assumption (Q3) of [16] is introduced to obtain an ODE approximation without this restrictive assumption on noise. However, this implicit assumption cannot be tested a-priori. Literature review The observation that many RL algorithms can be cast as SA first appeared in [46, 22]. Soon after, SA theory was applied to obtain stability theory for TD-learning with linear function approximation under minimal assumptions [48]; the authors discussed challenges for nonlinear approximation architectures. In the case of Q-learning, ODE approximations are nonlinear and not understood outside of a few special cases (notably tabular, and optimal stopping with linear function approximation). There are many counterexamples showing that conditions on the function class are required in general, even in a linear function approximation setting [4] (also see [47, 43, 19]). There has also been progress for general linear function approximation: sufficient conditions for convergence of the basic Q-learning algorithm (1) was obtained in [32], with finite-n bounds appearing recently in [13], and stability of GQ-learning was established in [31] subject to assumptions slightly stronger than (A1) (A3) in the present paper. In particular, it is assumed in [31, Assumption L3] that @ f ( ) is everywhere nonsingular. In [23], the authors obtained regret bounds for Q-learning in an episodic setting, under a linear MDP (linear dynamics and linear rewards) assumption, stronger than the assumptions imposed here. Stability theory for off-policy TD-learning faces similar challenges as Q-learning. A consistent algorithm is introduced in [44] for linear function approximation, using the same ideas as in [31]; this theory is extended to non-linear function approximation in [8]. To the best of our knowledge, the ODE (5) was introduced in the economics literature, which led to the comprehensive analysis by Smale [41] for smooth f. The term Newton-Raphson flow for (5) was introduced in the deterministic control literature [40, 49]. The Zap SA algorithm was introduced at the same time, and based on the same ODE [15]. The motivation of [15] was centered entirely on optimizing the asymptotic covariance of stochastic approximation, and in particular Q-learning with tabular basis; see [30, 6] for history of convergence rate theory in SA, and [29, 28] for application to actor-critic methods. While the motivation here is stability, results in Section A.2 strongly suggest that the asymptotic covariance is approximately optimal for the regularized Zap Q-learning algorithm introduced here; a tightness argument is required to complete the proof. The analysis in this paper can be cast in the general framework of stochastic approximation based on differential inclusions (see [9, Chapter 5] and its references). This general framework guided the research reported here. New in this paper is the proof of convergence of Zap Q-learning via an ODE approximation, made possible by the special structure of the recursion. 2 Zap Q-learning with Nonlinear Function Approximation Preliminaries We restrict to a discounted reward optimal control problem, with finite state space X, finite input space U, reward function r : X U ! R, and discount factor γ 2 (0, 1). The Q-function is defined as the maximum over all possible input sequences {Un : n 1} of the total discounted reward: Q (x, u) := max γn E[r(Xn, Un) | X0 = x , U0 = u] , x 2 X, u 2 U (15) Extensions to other criteria are straightforward (e.g., average cost or weighted shortest path). Let Pu denote the state transition matrix when input u 2 U is taken. It is known that the Q-function is the unique solution to the Bellman equation [7]: Q (x, u) = r(x, u) + γ Pu(x, x0)Q (x0) (16) where Q(x) := maxu2U Q(x, u) for any function Q : X U ! R. Consider a (possibly nonlinear) parameterized family of candidate approximations {Q : 2 Rd}, wherein Q : X U ! R for each , and the associated family of policies φ (x) 2 arg maxu Q (x, u), x 2 X. To avoid ambiguities when the maximizer is not unique, we enumerate all stationary policies as {φ(i) : 1 i φ}, and specify φ := φ( ) , where := min{i : φ(i)(x) 2 arg max u Q (x, u), for all x 2 X} (17) The recursion (1) is designed to compute an approximate solution of (16), defined as the solution to the root-finding problem: f( ) = 0, with f( ) := E r(Xn, Un) + γQ (Xn+1) Q (Xn, Un) where the expectation is in steady state. It is convenient to denote Φn+1 := (Xn+1, Xn, Un+1, Un), with state space Z := X2 U2. It is assumed throughout the paper that n = ( n, Φn) for a function : Rd Z ! Rd. Algorithm The Zap SA algorithm (14) to solve (18) is obtained on specifying D( n, Φn+1) := r(Xn, Un) + γQ n(Xn+1) Q n(Xn, Un) (19a) f( n, Φn+1) := D( n, Φn+1) n (19b) At points of differentiability, the derivative of f has a simple form: A( ) := @ f( ) = E[ n γ@ Q (Xn+1, φ (Xn+1)) @ Q (Xn, Un) + D( , Φn+1)@ n] (20) The Zap SA algorithm for Q-learning is exactly as described in (14) with f defined in (19b), and An+1( ) defined to be the term inside the expectation (20): An+1 = n[γ@ Q n(Xn+1, φ n(Xn+1)) @ Q n(Xn, Un)] + D( n, Φn+1)@ n (21a) b An+1 = b An + βn+1 Gn+1 = ["I + b A| n+1 b An+1] 1 b A| n+1 = n + n+1Gn+1f( n, Φn+1) (21d) Recall that φ n is uniquely determined by (17) in the definition of An+1. The step-size sequences { n} and {βn} satisfy standard requirements for two-time-scale SA algorithms [9]: βn/ n ! 1 as n ! 1. For concreteness, in analysis we fix n = 1/(n + n0), βn = n , n 1 , with n0 1 , 2 (0.5, 1) (22) The theory in this paper is focused on decreasing step-size mainly because theory of SA is more mature in this context. For constant step-size, with n , βn β, with β = k for fixed k 1, convergence of the algorithm can proceed by viewing the joint process ( n, b An, Φn) as a time-homogeneous Markov chain. Based on [10, Theorem 2.3] and [9, Chapter 9], it is conjectured that there exists > 0 such that lim n!1 E[k n k2] = O( ) , 2 [0, ] . (23) Unfortunately, the mixing time of the Markov chain will increase with decreasing . Convergence Analysis Given that f in (19b) is non-smooth in , analysis is cast in the theory of generalized subgradients of non-smooth functions. Consider first the temporal difference term D : Rd Z ! R. For each z 2 Z, the set of generalized subgradients of D( , z) at 0 is a convex set of row vectors, denoted by @ D( 0, z) [14, Chapter 10]. A vector # 2 @ D( 0, z) has the defining property, D( 0 + sv, z) D( 0, z) s , v 2 Rd (24) The limit exists because D( , z) is the pointwise maximum of smooth functions [14, Theorem 10.22]. The generalized subgradient of f( , z) exists under additional assumptions. Recall that n = ( n, Φn) for each n. It is assumed henceforth that is differentiable in . In the presentation here we impose the additional assumption that the vector-valued function has non-negative entries. We then obtain a version of the chain rule: @ f( 0, z) = { ( 0, z)# + D( 0, z)@ ( 0, z) : # 2 @ D( 0, z)} (25) We obtain in Lemma A.13 a similar representation for the set of generalized subgradients of f: A 2 Rd d : Av lim f( + sv) f( ) s , v 2 Rdo Non-negativity is relaxed in the supplementary material, based on a signed decomposition of . Assumptions: (A1) The joint process (X, U) is an irreducible Markov chain with unique invariant pmf $. (A2) Q and are Lipschitz continuous and twice continuously differentiable in ; f( , z) is Lipschitz continuous for each z 2 Z; kfk is coercive; A|f( ) 6= 0 for 62 , A 2 A( ). (A3) The set is a singleton, so that there is a unique 2 Rd satisfying f( ) = 0. Assumption A1 rules out -greedy policies and other parameter-dependent choices. It is likely that the theory can be extended using the general theory in [25, 9, Sections 6.2 and 6.3]. The second and third assumptions are first applied to the ODE (7): the coercive assumption in A2 implies boundedness of solutions, and this with A3 implies global asymptotic stability of (7). It is also assumed throughout the paper that { n} is bounded. This assumption is made to simplify the overall analysis and make the treatment of non-smooth f accessible. In Section A.1 of the supplementary material, we provide additional sufficient conditions (which are easily satisfied for linear function approximation) to verify the boundedness assumption. We believe that (A1)-(A3) listed above suffice to establish boundedness of { n} via an extension of the Borkar-Meyn theorem introduced in [10]. The following summarizes the main results of this paper: Theorem 2.1. Let { n} be the parameter sequence obtained from the Zap Q-learning algorithm (21), with some fixed " > 0. If this sequence is bounded, then (i) If Assumptions A1 A2 hold, then limn!1 f( n) = 0 a.s.. (ii) If Assumptions A1 A3 hold, then limn!1 n = a.s.. Convergence Rate Establishing a CLT for the scaled error sequence {pn n} requires a tightness bound [9, Chapter 8, Lemma 5] and the following: (A4) f( , z) is smooth in a neighborhood of for z 2 Z, and A( ) = @ f( ) is non-singular. Tightness is used to justify an approximation of the algorithm with its linearization (10). The proof of tightness is left to future work. In Section A.2 of the supplementary material we consider the linearization, and show that the asymptotic covariance satisfies = + O("3), where is the asymptotic covariance obtained for Zap Q-learning (21), is the optimal covariance, and (2) is identified in Prop. A.3. Overview of Proof of Thm. 2.1 [complete proofs are found in the supplementary material] The first step is analysis of the ODE (7) that { n} aims to approximate. It is shown in Prop. A.6 that the ODE (7) admits at least one solution {wt : t 0} from each initial condition; this is non-trivial, since the right hand side is discontinuous. We then conclude under (A2) that limt!1 f(wt) = 0 for any solution. If in addition (A3) holds, then the ODE is globally asymptotically stable. These conclusions are obtained through a uniform approximation based on a family of smooth vector fields. Discontinuity of the ODE and the stochastic recursion presents a greater challenge when we turn to establishing solidarity between the ODE and its stochastic counterpart. The main idea in this part of the analysis is most easily described for a special case: a linear parameterization Q = | , and non-negativity of n so that the chain rule (25) holds. We then obtain subgradients of the components of f, which implies the following component-wise bounds: f( n + v, Φn+1) = max r(Xn, Un) + [γ (Xn+1, u) (Xn, Un)]|( n + v) f( n, Φn+1) + An+1v , v 2 Rd (27) The update equation for b An+1 in (21b) is used to obtain the averaged version of (27): f( n + v) f( n) + b An+1v + o(1) , v 2 Rd , kvk 1 (28) where o(1) ! 0 as n ! 1, uniformly in v. This implies that b An+1 is close to the set of the subgradients A( n), in the sense made precise in Prop. A.18. The arguments are considerably more complex when Q is non-linear, and the positivity assumption on n is relaxed. In particular, without positivity, neither f nor f admit the generalized subgradients. Fortunately, the techniques developed for the special case can be adapted to demonstrate that b An+1v approximates the directional derivative f 0( n; v) for each v and all large n. Once all of these technical results are established, we obtain solidarity between Zap algorithm (21) and the ODE (7) in the sense that limn!1 f( n) = limt!1 f(wt) = 0, along with the other conclusions of Thm. 2.1. 3 Numerical Results The Zap Q-learning algorithm was tested on three examples from Open AI gym: Mountain Car, Acrobot, and Cartpole [1]. The approximation of Q was obtained based on a neural network, so that the parameter 2 Rd represents weights in the neural network. Rather than achieving the best score for specific tasks, the objective of the experiments surveyed in this section was to investigate the stability and consistency of the Zap Q-learning algorithm across different domains, and varying neural network sizes. Common in each experiment: a feedforward neural network that is fully connected, using the Leaky Re LU activation function. The goal in each of the three examples is to collect as many rewards as possible before the state reaches a terminal set denoted S X. To avoid infinite values we introduce a deterministic upper bound 1, and consider the bounded horizon = min( , S) with S = min{n 1 : Xn 2 S}. The Q-function is denoted Q (x, u) := max E[ r(Xn, Un) | X0 = x , U0 = u] , which for = 1 solves the Bellman equation Q (x, u) = r(x, u) + E[I{X1 /2 S}Q (X1) | X0 = x, U0 = u] (29) Following the roadmap outlined in Section 2, we seek an approximate solution to (29) based on a root-finding problem analogous to (18): find such that f( ) = 0 , with f( ) := E r(Xn, Un) + I{Xn+1 /2 S}Q (Xn+1) Q (Xn, Un) where the distribution of X0 is given. The Zap-Q algorithm (21) is easily adapted to the modified definition of f in (30). 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 R(φ n) R(φ n) 80 NN: 18 14 6 NN: 16 10 8 NN: 24 16 8 NN: 20 12 18 NN: 24 16 10 NN: 30 24 16 0 1 2 3 3 10 0 1 2 3 3 10 0 1 2 3 3 10 3 10 Episodes 3 10 Episodes 3 10 Episodes 0 1 2 3 3 10 200 100 NN: 12 6 NN: 16 8 NN: 6 3 0 1 2 3 3 10 0 1 2 3 3 10 median 25-75 10-25 Mountain Car Acrobot Cartpole Figure 1: Average rewards for the three examples, shown by percentile. The performance of the greedy policy φ induced by Q is denoted r(Xn, φ (Xn)) Specifics of the meta-parameters and the details of how (31) was estimated are contained in Section A.6. Two minor modifications of the algorithm were used in these experiments to reduce complexity: 1. Periodic gain update. An integer Nd > 1 was fixed, and the gain Gn appearing in (21d) was updated only for integer multiples of Nd. In particular, the matrix inversion step was only performed at these iterations. Letting N denote the total number of iterations, the overall complexity of running this algorithm for N iterations is in the worst-case O(Nd3/Nd + Nd2) (see Section A.6 of the supplementary material for further discussion on complexity). We observed that Nd = 50 worked well for all experiments, and the performance was unchanged from Nd = 1. 2. Periodic eligibility update. The definition of the eligibility vector in (2) was modified to break -dependency: n := r Q i(n)(Xn, Un), with i(n) := (bn/N + 1c 1)N , and N = 2000. We then ignored the term D( n, Φn+1)@ n when computing An+1 based on (21a). For comparison, we performed experiments in which this term was included (increasing complexity considerably since second derivatives are required [11]), and saw no improvement in performance. Experiments were performed with both decreasing step-size (defined in (22)), and constant step-size. We found that constant step-size implementations were more reliable for the Mountain car and Acrobot examples, and the diminishing step-size gave better results for Cartpole. Figure 1 shows results obtained for these choices. The size of neural networks indicated in the figure refers only to hidden layers. To obtain the quantiles shown, each experiment was repeated 50 times, with parameters randomly initialized by the Kaiming uniform method [20, 36]. The first column shows results from the smallest network for which we obtained reliable results for the particular example. 4 Conclusions Zap Q-learning is provably consistent with nonlinear function approximation, under very general conditions. Theoretical questions remain, such as extension to more general exploration strategies, and convergence properties for more general step-size rules. There are also architectural questions. For example, the definition of the eligibility vector (2) is not sacred. Better overall performance, and simpler conditions for convergence may be achieved through alternatives (there are obvious choices for deterministic control systems rather than MDPs). Better algorithm design requires a better theory for function approximation in Q-learning. The parameter estimates in both Zap Q-learning and the basic algorithm (1) are solutions to the rootfinding problem (18). How do we bound the Bellman error, or the absolute error |Q Q |? The goal is to create a theory as complete as TD-learning with linear function approximation, for which vector space concepts bring crisp answers [48, Theorem 1]. The matrix-inversion step in the algorithm may be a barrier to application of Zap-Q in some problems. A simple approach to reduce complexity is described in the numerical results, and we expect to obtain much more efficient implementations, perhaps by applying a distributed implementation [2], and adapting techniques from stochastic optimization. We are currently exploring the application of the techniques in this paper to analyze Deep Q-learning [33], and application of Zap-SA to actor-critic methods. Acknowledgments: Financial support from ARO grant W911NF1810334 is gratefully acknowledged. Additional support from EPCN 1935389 & CPS 1646229, and French National Research Agency grant ANR-16-CE05-0008. Broader Impact This paper has focused on formulating Q-learning algorithms for which reliability is guaranteed by design. It is hoped that it will inspire the creation of a larger tool kit for ODE algorithm design, and methods to more efficiently translate the ODE to obtain better algorithms for reinforcement learning, especially in other contexts such as actor-critic methods. Given the importance of stochastic approximation in so many other fields, such as optimization, it is hoped that the impact will extend far beyond reinforcement learning. [1] Open AI Gym website. github.com/openai/gym/wiki. [2] R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Second order optimization made practical. ar Xiv preprint ar Xiv:2002.09018, 2020. [3] L. Baird. Residual algorithms: Reinforcement learning with function approximation. In A. Prieditis and S. Russell, editors, Machine Learning Proceedings 1995, pages 30 37. Morgan Kaufmann, San Francisco (CA), 1995. [4] L. Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30 37. Elsevier, 1995. [5] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approxima- tions, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [6] A. Benveniste, M. Métivier, and P. Priouret. Adaptive algorithms and stochastic approximations. Springer, 2012. [7] D. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Atena Scientific, Cambridge, Mass, 1996. [8] S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesvári. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in neural information processing systems, pages 1204 1212, 2009. [9] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008. [10] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447 469, 2000. (see also IEEE CDC, 1998). [11] W. L. Buntine and A. S. Weigend. Computing second derivatives in feed-forward networks: a review. IEEE Transactions on Neural Networks, 5(3):480 488, 1994. [12] S. Chen, A. M. Devraj, A. Buši c, and S. Meyn. Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation. ar Xiv e-prints, and to appear AISTATS, page ar Xiv:2002.02584, Feb. 2020. [13] Z. Chen, S. Zhang, T. T. Doan, S. T. Maguluri, and J.-P. Clarke. Performance of Q-learning with linear function approximation: Stability and finite-time analysis. ar Xiv: Optimization and Control, 2019. [14] F. Clarke. Functional analysis, calculus of variations and optimal control, volume 264. Springer Science & Business Media, 2013. [15] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. Ar Xiv e-prints, July 2017. [16] A. M. Devraj and S. P. Meyn. Zap Q-learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. [17] L. C. Evans. Weak convergence methods for nonlinear partial differential equations. Number 74 in CBMS Regional Conference Series. American Mathematical Soc., 3rd edition, 1990. [18] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1 25, 2003. [19] G. J. Gordon. Reinforcement learning with function approximation converges to a region. In Proc. of the 13th International Conference on Neural Information Processing Systems, pages 996 1002, Cambridge, MA, USA, 2000. MIT Press. [20] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026 1034, 2015. [21] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012. [22] T. Jaakola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6:1185 1201, 1994. [23] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:1907.05388, 2019. [24] T. Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980. [25] P. Karmakar and S. Bhatnagar. Dynamics of stochastic approximation with iterate-dependent Markov noise under verifiable conditions in compact state space with the stability of iterates not ensured. ar Xiv e-prints, page ar Xiv:1601.02217, Jan 2016. [26] H. K. Khalil. Nonlinear systems. Prentice-Hall, Upper Saddle River, NJ, 3rd edition, 2002. [27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [28] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approxi- mation. Ann. Appl. Probab., 14(2):796 819, 2004. [29] V. V. G. Konda. Actor-critic algorithms. Ph D thesis, Massachusetts Institute of Technology, [30] H. J. Kushner and G. G. Yin. Stochastic approximation algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, 1997. [31] H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton. Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, pages 719 726, USA, 2010. Omnipress. [32] F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with function approximation. In ICML 08: Proceedings of the 25th international conference on Machine learning, pages 664 671, New York, NY, USA, 2008. ACM. [33] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. [34] Y. Nesterov. Lectures on Convex Optimization. Springer Optimization and Its Applications. Springer International Publishing, 2018. [35] I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. Mc Kinney, T. Lattimore, C. Szepezvari, S. Singh, et al. Behaviour suite for reinforcement learning. ar Xiv preprint ar Xiv:1908.03568, 2019. [36] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in Py Torch. Open Review.net openreview. net/forum?id=BJJsrmf CZ, 2017. [37] A. Ramaswamy and S. Bhatnagar. A generalization of the Borkar-Meyn Theorem for stochastic recursive inclusions. Mathematics of Operations Research, 42(3):648 661, 2017. [38] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951. [39] B. Shi, S. S. Du, W. Su, and M. I. Jordan. Acceleration via symplectic discretization of high- resolution differential equations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5744 5752. Curran Associates, Inc., 2019. [40] S. Shivam, I. Buckley, Y. Wardi, C. Seatzu, and M. Egerstedt. Tracking control by the newton- raphson flow: Applications to autonomous vehicles. Co RR, abs/1811.08033, 2018. [41] S. Smale. A convergent process of price adjustment and global Newton methods. Journal of Mathematical Economics, 3(2):107 120, July 1976. [42] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510 2518, 2014. [43] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS 95, pages 1038 1044, Cambridge, MA, USA, 1995. MIT Press. [44] R. S. Sutton, C. Szepesvári, and H. R. Maei. A convergent o(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS 08, pages 1609 1616, Red Hook, NY, USA, 2008. Curran Associates Inc. [45] C. Szepesvári. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th International Conference on Neural Information Processing Systems, NIPS 97, pages 1064 1070, Cambridge, MA, USA, 1997. MIT Press. [46] J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185 202, 1994. [47] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):59 94, 1996. [48] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674 690, 1997. [49] Y. Wardi, C. Seatzu, M. Egerstedt, and I. Buckley. Performance regulation and tracking via lookahead simulation: Preliminary results and validation. In Proc. of the IEEE Conf. on Dec. and Control, pages 6462 6468, 2017. [50] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279 292, 1992.