# stabilizing_dynamical_systems_via_policy_gradient_methods__8c634a3b.pdf Stabilizing Dynamical Systems via Policy Gradient Methods Juan C. Perdomo University of California, Berkeley Jack Umenberger MIT Max Simchowitz MIT Stabilizing an unknown control system is one of the most fundamental problems in control systems engineering. In this paper, we provide a simple, model-free algorithm for stabilizing fully observed dynamical systems. While model-free methods have become increasingly popular in practice due to their simplicity and flexibility, stabilization via direct policy search has received surprisingly little attention. Our algorithm proceeds by solving a series of discounted LQR problems, where the discount factor is gradually increased. We prove that this method efficiently recovers a stabilizing controller for linear systems, and for smooth, nonlinear systems within a neighborhood of their equilibria. Our approach overcomes a significant limitation of prior work, namely the need for a pre-given stabilizing control policy. We empirically evaluate the effectiveness of our approach on common control benchmarks. 1 Introduction Stabilizing an unknown control system is one of the most fundamental problems in control systems engineering. A wide variety of tasks - from maintaining a dynamical system around a desired equilibrium point, to tracking a reference signal (e.g a pilot s input to a plane) - can be recast in terms of stability. More generally, synthesizing an initial stabilizing controller is often a necessary first step towards solving more complex tasks, such as adaptive or robust control design [Sontag, 1999, 2009]. In this work, we consider the problem of finding a stabilizing controller for an unknown dynamical system via direct policy search methods. We introduce a simple procedure based off policy gradients which provably stabilizes a dynamical system around an equilibrium point. Our algorithm only requires access to a simulator which can return rollouts of the system under different control policies, and can efficiently stabilize both linear and smooth, nonlinear systems. Relative to model-based approaches, model-free procedures, such as policy gradients, have two key advantages: they are conceptually simple to implement, and they are easily adaptable; that is, the same method can be applied in a wide variety of domains without much regard to the intricacies of the underlying dynamics. Due to their simplicity and flexibility, direct policy search methods have become increasingly popular amongst practitioners, especially in settings with complex, nonlinear dynamics which may be challenging to model. In particular, they have served as the main workhorse for recent breakthroughs in reinforcement learning and control [Silver et al., 2016, Mnih et al., 2015, Andrychowicz et al., 2020]. Despite their popularity amongst practitioners, model-free approaches for continuous control have only recently started to receive attention from the theory community [Fazel et al., 2018, Kakade et al., 2020, Tu and Recht, 2019]. While these analyses have begun to map out the computational and statistical tradeoffs that emerge in choosing between model-based and model-free approaches, Correspondence to jcperdomo@berkeley.edu 35th Conference on Neural Information Processing Systems (Neur IPS 2021). they all share a common assumption: that the unknown dynamical system in question is stable, or that an initial stabilizing controller is known. As such, they do not address the perhaps more basic question, how do we arrive at a stabilizing controller in the first place? 1.1 Contributions We establish a reduction from stabilizing an unknown dynamical system to solving a series of discounted, infinite-horizon LQR problems via policy gradients, for which no knowledge of an initial stable controller is needed. Our approach, which we call discount annealing, gradually increases the discount factor and yields a control policy which is near optimal for the undiscounted LQR objective. To the best of our knowledge, our algorithm is the first model-free procedure shown to provably stabilize unknown dynamical systems, thereby solving an open problem from Fazel et al. [2018]. We begin by studying linear, time-invariant dynamical systems with full state observation and assume access to inexact cost and gradient evaluations of the discounted, infinite-horizon LQR cost of a state-feedback controller K. Previous analyses (e.g., [Fazel et al., 2018]) establish how such evaluations can be implemented with access to (finitely many, finite horizon) trajectories sampled from a simulator. We show that our method recovers the controller K which is the optimal solution of the undiscounted LQR problem in a bounded number of iterations, up to optimization and simulator error. The stability of the resulting K is guaranteed by known stability margin results for LQR. In short, we prove the following guarantee: Theorem 1 (informal). For linear systems, discount annealing returns a stabilizing state-feedback controller which is also near-optimal for the LQR problem. It uses at most polynomially many, ε-inexact gradient and cost evaluations, where the tolerance ε also depends polynomially on the relevant problem parameters. Since both the number of queries and error tolerance are polynomial, discount annealing can be efficiently implemented using at most polynomially many samples from a simulator. Furthermore, our results extend to smooth, nonlinear dynamical systems. Given access to a simulator that can return damped system rollouts, we show that our algorithm finds a controller that attains near-optimal LQR cost for the Jacobian linearization of the nonlinear dynamics at the equilibrium. We then show that this controller stabilizes the nonlinear system within a neighborhood of its equilibrium. Theorem 2 (informal). Discount annealing returns a state-feedback controller which is exponentially stabilizing for smooth, nonlinear systems within a neighborhood of their equilibrium, using again only polynomially many samples drawn from a simulator. In each case, the algorithm returns a near optimal solution K to the relevant LQR problem (or local approximation thereof). Hence, the stability properties of K are, in theory, no better than those of the optimal LQR controller K . Importantly, the latter may have worse stability guarantees than the optimal solution of a corresponding robust control objective (e.g. H synthesis). Nevertheless, we focus on the LQR subroutine in the interest of simplicity, clarity, and in order to leverage prior analyses of model-free methods for LQR. Extending our procedure to robust-control objectives is an exciting direction for future work. Lastly, while our theoretical analysis only guarantees that the resulting controller will be stabilizing within a small neighborhood of the equilibrium, our simulations on nonlinear systems, such as the nonlinear cartpole, illustrate that discount annealing produces controllers that are competitive with established robust control procedures, such as H synthesis, without requiring any knowledge of the underlying dynamics. 1.2 Related work Given its central importance to the field, stabilization of unknown and uncertain dynamical systems has received extensive attention within the controls literature. We review some of the relevant literature and point the reader towards classical texts for a more comprehensive treatment [Sontag, 2009, Sastry and Bodson, 2011, Zhou et al., 1996, Callier and Desoer, 2012, Zhou and Doyle, 1998]. Model-based approaches. Model-based methods construct approximate system models in order to synthesize stabilizing control policies. Traditional analyses consider stabilization of both linear and nonlinear dynamical systems in the asymptotic limit of sufficient data [Sastry and Bodson, 2011, Sastry, 2013]. More recent, non-asymptotic studies have focused almost entirely on linear systems, where the controller is generated using data from multiple independent trajectories [Fiechter, 1997, Dean et al., 2019, Faradonbeh et al., 2018a, 2019]. Assuming the model is known, stabilizing policies may also be synthesized via convex optimization [Prajna et al., 2004] by combining a dual Lyapunov theorem [Rantzer, 2001] with sum-of-squares programming [Parrilo, 2003]. Relative to these analysis our focus is on strengthening the theoretical foundations of model-free procedures and establishing rigorous guarantees that policy gradient methods can also be used to generate stabilizing controllers. Online control. Online control studies the problem of adaptively fine-tuning the performance of an already-stabilizing control policy on a single trajectory [Dean et al., 2018, Faradonbeh et al., 2018b, Cohen et al., 2019, Mania et al., 2019, Simchowitz and Foster, 2020, Hazan et al., 2020, Simchowitz et al., 2020, Kakade et al., 2020]. Though early papers in this direction consider systems without pre-given stabilizing controllers [Abbasi-Yadkori and Szepesvári, 2011], their guarantees degrade exponentially in the system dimension (a penalty ultimately shown to be unavoidable by Chen and Hazan [2021]). Rather than fine-tuning an already stabilizing controller, we focus on the more basic problem of finding a controller which is stabilizing in the first place, and allow for the use of multiple independent trajectories. Model-free approaches. Model-free approaches eschew trying to approximate the underlying dynamics and instead directly search over the space of control policies. The landmark paper of Fazel et al. [2018] proves that, despite the non-convexity of the problem, direct policy search on the infinite-horizon LQR objective efficiently converges to the globally optimal policy, assuming the search is initialized at an already stabilizing controller. Fazel et al. [2018] pose the synthesis of this initial stabilizing controller via policy gradients as an open problem; one that we solve in this work. Following this result, there have been a large number of works studying policy gradients procedures in continuous control, see for example Feng and Lavaei [2020], Malik et al. [2019], Mohammadi et al. [2020, 2021], Zhang et al. [2021] just to name a few. Relative to our analysis, these papers consider questions of policy finite-tuning, derivative-free methods, and robust (or distributed) control which are important, yet somewhat orthogonal to the stabilization question considered herein. The recent analysis by Lamperski [2020] is perhaps the most closely related piece of prior work. It proposes a model-free, off-policy algorithm for computing a stabilizing controller for deterministic LQR systems. Much like discount annealing, the algorithm also works by alternating between policy optimization (in their case by a closed-form policy improvement step based on the Riccati update) and increasing a damping factor. However, whereas we provide precise finite-time convergence guarantees to a stabilizing controller for both linear and nonlinear systems, the guarantees in Lamperski [2020] are entirely asymptotic and restricted to linear systems. Furthermore, we pay special attention to quantifying the various error tolerances in the gradient and cost queries to ensure that the algorithm can be efficiently implemented in finite samples. 1.3 Background on stability of dynamical systems Before introducing our results, we first review some of the basic concepts and definitions regarding stability of dynamical systems. In this paper, we study discrete-time, noiseless, time-invariant dynamical systems with states xt Rdx and control inputs ut Rdu. In particular, given an initial state x0, the dynamics evolves according to xt+1 = G(xt, ut) where G : Rdx Rdu Rdx is a state transition map. An equilibrium point of a dynamical system is a state x Rdx such that G(x , 0) = x . As per convention, we assume that the origin x = 0 is the desired equilibrium point around which we wish to stabilize the system. This paper restricts its attention to static state-feedback policies of the form ut = Kxt for a fixed matrix K Rdu dx. Abusing notation slightly, we conflate the matrix K with its induced policy. Our aim is to find a policy K which is exponentially stabilizing around the equilbrium point. Time-invariant, linear systems, where G(x, u) = Ax+Bu are stabilizable if and only if there exists a K such that A+BK is a stable matrix [Callier and Desoer, 2012]. That is if ρ(A+BK) < 1, where ρ(X) denotes the spectral radius, or the largest eigenvalue magnitude, of a matrix X. For general nonlinear systems, our goal is to find controllers which satisfy the following general, quantitative definition of exponential stability (e.g Chapter 5.2 in Sastry [2013]). Throughout, denotes the Euclidean norm. Definition 1.1. A controller K is (m, α)-exponentially stable for dynamics G if there exist constants m, α > 0 such that if inputs are chosen according to ut = Kxt, the sequence of states xt+1 = G(xt, ut) satisfy xt m exp( α t) x0 . (1.1) Likewise, K is (m, α)-exponentially stable on radius r > 0 if (1.1) holds for all x0 such that x0 r. For linear systems, a controller K is stabilizing if and only if it is stable over the entire state space, however, the restriction to stabilization over a particular radius is in general needed for nonlinear systems. Our approach for stabilizing nonlinear systems relies on analyzing their Jacobian linearization about the origin equilibrium. Given a continuously differentiable transition operator G, the local dynamics can be approximated by the Jacobian linearization (Ajac, Bjac) of G about the zero equilibrium; that is Ajac := x G(x, u) (x,u)=(0,0), Bjac := u G(x, u) (x,u)=(0,0). (1.2) In particular, for x and u sufficiently small, G(x, u) = Ajacx + Bjacu + fnl(x, u), where fnl(x, u) is a nonlinear remainder from the Taylor expansion of G. To ensure stabilization via state-feedback is feasible, we assume throughout our presentation that the linearized dynamics (Ajac, Bjac) are stabilizable. 2 Stabilizing Linear Dynamical Systems We now present our main results establishing how our algorithm, discount annealing, provably stabilizes linear dynamical systems via a reduction to direct policy search methods. We begin with the following preliminaries on the Linear Quadratic Regulator (LQR). Definition 2.1 (LQR Objective). For a given starting state x, we define the LQR problem Jlin with discount factor γ (0, 1], dynamic matrices (A, B), and state feedback controller K as, Jlin(K | x, γ, A, B) := t=0 γt x t Qxt + u t Rut s.t. ut = Kxt, xt+1 = Axt + But, x0 = x. Here, xt Rdx, ut Rdu, and Q, R are positive definite matrices. Slightly overloading notation, we define Jlin(K | γ, A, B) := E x0 dx Sdx 1[Jlin(K | x, γ, A, B)], to be the same as the problem above, but where the initial state is now drawn from the uniform distribution over the sphere in Rdx of radius dx.2 To simplify our presentation, we adopt the shorthand Jlin(K | γ) := Jlin(K | γ, A, B) in cases where the system dynamics (A, B) are understood from context. Furthermore, we assume that (A, B) is stabilizable and that λmin(Q), λmin(R) 1. It is a well-known fact that K ,γ := arg min K Jlin(K | γ, A, B) achieves the minimum LQR cost over all possible control laws. We begin our analysis with the observation that the discounted LQR problem is equivalent to the undiscounted LQR problem with damped dynamics matrices.3 Lemma 2.1. For all controllers K such that Jlin(K | γ, A, B) < , Jlin(K | γ, A, B) = Jlin(K | 1, γ A, γ B). From this equivalence, it follows from basic facts about LQR that a controller K satisfies Jlin(0 | γ, A, B) < if and only if γ(A + BK) is stable. Consequently, for γ < ρ(A) 2, the zero 2This scaling is chosen so that the initial state distribution has identity covariance, and yields cost equivalent to x0 N(0, I). 3This lemma is folklore within the controls community, see e.g. Lamperski [2020]. Discount Annealing Initialize: Objective J( | ), γ0 (0, ρ(A) 2), K0 0, and Q I, R I For t = 0, 1, . . . 1. If γt = 1, run policy gradients once more as in Step 2, break, and return the resulting K . 2. Using policy gradients (see Eq. (2.1)) initialized at Kt, find K such that: Jlin(K | γt) min K Jlin(K | γt) dx. (2.2) 3. Update initial controller Kt+1 K . 4. Using binary or random search, find a discount factor γ [γt, 1] such that 2.5J(Kt+1 | γt) J(Kt+1 | γ ) 8J(Kt+1 | γt). (2.3) 5. Update the discount factor γt+1 γ . Figure 1: Discount annealing algorithm. The procedure is identical for both linear and nonlinear systems. For linear, we initialize J = Jlin( | γ0) and for nonlinear J = Jnl( | γ0, r ) where r is chosen as in Theorem 2. See Theorem 1, Theorem 2, and Appendix C for details regarding policy gradients and binary (or random) search. The constants above are chosen for convenience, any constants c1, c2 such that 1 < c1 < c2 suffice. controller is stabilizing and one can solve the discounted LQR problem via direct policy search initialized at K = 0 [Fazel et al., 2018]. At this point, one may wonder whether the solution to this highly discounted problem yields a controller which stabilizes the undiscounted system. If this were true, running policy gradients (defined in Eq. (2.1)) to convergence, on a single discounted LQR problem, would suffice to find a stabilizing controller. Kt+1 = Kt η KJlin(Kt | γ), K0 = 0, η > 0 (2.1) Unfortunately, the following proposition shows that this is not the case. Proposition 2.2 (Impossibility of Reward Shaping). Fix A = diag(0, 2). For any positive definite cost matrices Q, R and discount factor γ such that γA is stable, there exists a matrix B such that (A, B) is controllable (and thus stabilizable), yet the optimal controller K ,γ := arg min K J(K | γ, A, B) on the discounted problem is such that A + BK ,γ is unstable. We now describe the discount annealing procedure for linear systems (Figure 1), which provably recovers a stabilizing controller K. For simplicity, we present the algorithm assuming access to noisy, bounded cost and gradient evaluations which satisfy the following definition. Employing standard arguments from [Fazel et al., 2018, Flaxman et al., 2005], we illustrate how these evaluations can be efficiently implemented using polynomially many samples drawn from a simulator in Appendix C. Definition 2.2 (Gradient and Cost Queries). Given an error parameter ε > 0 and a function J : Rd R, ε-Grad(J, z) returns a vector e such that e J(z) F ε. Similarly, ε-Eval(J, z, c) returns a scalar v such that |v min{J(z), c}| ε. The procedure leverages the equivalence (Lemma 2.1) between discounted costs and damped dynamics for LQR, and the consequence that the zero controller is stabilizing if we choose γ0 sufficiently small. Hence, for this discount factor, we may apply policy gradients initialized at the zero controller in order to recover a controller K1 which is near-optimal for the γ0 discounted objective. Our key insight is that, due to known stability margins for LQR controllers, K1 is stabilizing for the γ1 discounted dynamics for some discount factor γ1 > (1 + c)γ0, where c is a small constant that has a uniform lower bound. Therefore, K1 has finite cost on the γ1 discounted problem, so that we may again use policy gradients initialized at K1 to compute a near-optimal controller K2 for this larger discount factor. By iterating, we have that γt (1 + c)tγ0 and can increase the discount factor up to 1, yielding a near-optimal stabilizing controller for the undiscounted LQR objective. The rate at which we can increase the discount factors γt depends on certain properties of the (unknown) dynamical system. Therefore, we opt for binary search to compute the desired γ in the absence of system knowledge. This yields the following guarantee, which we state in terms of properties of the matrix P , the optimal value function for the undiscounted LQR problem, which satisfies min K Jlin(K | 1) = tr [P ] (see Appendix A for further details). Theorem 1 (Linear Systems). Let Mlin := max{16tr [P ] , Jlin(K0 | γ0)}. The following statements are true regarding the discount annealing algorithm when run on linear dynamical systems: a) Discount annealing returns a controller b K which is ( p 2tr[P ], (4tr [P ]) 1)- exponentially stable. b) If γ0 < 1, the algorithm is guaranteed to halt whenever t is greater than 64tr [P ]4 log(1/γ0). Furthermore, at each iteration t: c) Policy gradients as defined in Eq. (2.1) achieves the guarantee in Eq. (2.2) using only poly(Mlin, A op, B op) many queries to ε-Grad( , Jlin( | γ)) as long as ε is less than poly(M 1 lin , A 1 op , B 1 op ). c) The noisy binary search algorithm (see Figure 2) returns a discount factor γ satisfying Eq. (2.3) using at most 4 log(tr [P ]) + 10 many queries to ε-Eval( , Jlin( | γ)) for ε = .1dx. We remark that since ε need only be polynomially small in the relevant problem parameters, each call to ε-Grad and ε-Eval can be carried out using only polynomially many samples from a simulator which returns finite horizon system trajectories under various control policies. We make this claim formal in Appendix C. Proof. We prove part b) of the theorem and defer the proofs of the remaining parts of to Appendix A. Define PK,γ to be the solution to the discrete-time Lyapunov equation. That is for γ(A + BK) stable, PK,γ solves: PK,γ := Q + K RK + γ(A + BK) PK,γ(A + BK). (2.4) Using this notation, P = PK ,1 is the solution to the above Lyapunov equation with γ = 1. The key step of the proof is Proposition A.4, which uses Lyapunov theory to verify the following: given the current discount factor γt, an idealized discount factor γ t+1 defined by γ t+1 := 1 8 PKt+1,γt 4op + 1 2 γt, satisfies Jlin(Kt+1 | γ t+1) = tr[PKt+1,γ t+1] 2tr PKt+1,γt = 2Jlin(Kt+1 | γt). Since the control cost is non-decreasing in γ, the binary search update in Step 4 ensures that the actual γt+1 also satisfies γt+1 1 8 PKt+1,γt 4op + 1 2 γt 128tr [P ]4 + 1 The following calculation (which uses dx tr [P ] for λmin(Q) 1) justifies the second inequality above: PKt+1,γt 4 op tr PKt+1,γt 4 = Jlin(Kt+1 | γt)4 (min K Jlin(K | γt) + dx)4 16tr [P ]4 . Therefore, γt (1/(128tr [P ]4) + 1)2tγ0. The precise bound follows from taking logs of both sides and using the numerical inequality log(1 + x) x to simplify the denominator. 3 Stabilizing Nonlinear Dynamical Systems We now extend the guarantees of the discount annealing algorithm to smooth, nonlinear systems. Whereas our study of linear systems explicitly leveraged the equivalence of discounted costs and damped dynamics, our analysis for nonlinear systems requires access to system rollouts under damped dynamics, since the previous equivalence between discounting and damping breaks down in nonlinear settings. More specifically, in this section, we assume access to a simulator which given a controller K, returns trajectories generated according to xt+1 = γGnl(xt, Kxt) for any damping factor γ (0, 1], where Gnl is the transition operator for the nonlinear system. While such trajectories may be infeasible to generate on a physical system, we believe these are reasonable to consider when dynamics are represented using software simulators, as is often the case in practice [Lewis et al., 2003, Peng et al., 2018]. The discount annealing algorithm for nonlinear systems is almost identical to the algorithm for linear systems. It again works by repeatedly solving a series of quadratic cost objectives on the nonlinear dynamics as defined below, and progressively increasing the damping factor γ. Definition 3.1 (Nonlinear Objective). For a state feedback controller K : Rdx Rdu, damping factor γ (0, 1], and an initial state x, we define: Jnl(K | x, γ) := t=0 x t Qxt + u t Rut (3.1) s.t ut = Kxt, xt+1 = γ Gnl(xt, ut), x0 = x. (3.2) Overloading notation as before, we let Jnl(K | γ, r) := Ex r Sdx 1 [Jnl(K | γ, x)] dx The normalization by dx/r2 above is chosen so that the nonlinear objective coincides with the LQR objective when Gnl is in fact linear. Relative to the linear case, the only algorithmic difference for nonlinear systems is that we introduce an extra parameter r which determines the radius for the initial state distribution. As established in Theorem 2, this parameter must be chosen small enough to ensure that discount annealing succeeds. Our analysis pertains to dynamics which satisfy the following smoothness definition. Assumption 1 (Local Smoothness). The transition map Gnl is continuously differentiable. Furthermore, there exist rnl, βnl > 0 such that for all (x, u) Rdx+du with x + u rnl, x,u Gnl(x, u) x,u Gnl(0, 0) op βnl( x + u ). For simplicity, we assume βnl 1 and rnl 1. Using Assumption 1, we can apply Taylor s theorem to rewrite Gnl as its Jacobian linearization around the equilibrium point, plus a nonlinear remainder term. Lemma 3.1. If Gnl satisfies Assumption 1, then all x, u for which x + u rnl, Gnl(x, u) = Ajacx + Bjacu + fnl(x, u), (3.3) where fnl(x, u) βnl( x 2+ u 2), fnl(x, u) βnl( x + u ), and where (Ajac, Bjac) are the system s Jacobian linearization matrices defined in Eq. (1.2). Rather than trying to directly understand the behavior of stabilization procedures on the nonlinear system, the key insight of our nonlinear analysis is that we can reason about the performance of a state-feedback controller on the nonlinear system via its behavior on the system s Jacobian linearization. In particular, the following lemma establishes how any controller which achieves finite discounted LQR cost for the Jacobian linearization is guaranteed to be exponentially stabilizing on the damped nonlinear system for initial states that are small enough. Throughout the remainder of this section, we define Jlin( | γ) := Jlin( | γ, Ajac, Bjac) as the LQR objective from Definition 2.1 where (A, B) = (Ajac, Bjac). Lemma 3.2 (Restatement of Lemma B.2). Suppose that CJ = Jlin(K | γ) < , then K is (C1/2 J , (4CJ) 1) exponentially stable on the damped system γGnl over radius r = rnl / (βnl C3/2 J ). The second main building block of our nonlinear analysis is the observation that if the dynamics are locally smooth around the equilibrium point, then by Lemma 3.1, decreasing the radius r of the initial state distribution x0 r Sdx 1 reduces the magnitude of the nonlinear remainder term fnl. Hence, the nonlinear system smoothly approximates its Jacobian linearization. More precisely, we establish that the difference in gradients and costs between Jnl(K | γ, r) and Jlin(K | γ) decrease linearly with the radius r. Proposition 3.3. Assume Jlin(K | γ) < . Then, for PK,γ defined as in Eq. (2.4): a) If r rnl 2βnl PK,γ 2op , then Jnl(K | γ, r) Jlin(K | γ) 8dxβnl PK,γ 4 op r. b) If r 1 12βnl PK,γ 5/2 op , then, KJnl(K | γ, r) KJlin(K | γ) F 48dxβnl(1 + B op) PK,γ 7 op r Lastly, because policy gradients on linear dynamical systems is robust to inexact gradient queries, we show that for r sufficiently small, running policy gradients on Jnl converges to a controller which has performance close to the optimal controller for the LQR problem with dynamic matrices (Ajac, Bjac). As noted previously, we can then use Lemma 3.2 to translate the performance of the optimal LQR controller for the Jacobian linearization to an exponential stability guarantee for the nonlinear dynamics. Using these insights, we establish the following theorem regarding discount annealing for nonlinear dynamics. Theorem 2 (Nonlinear Systems). Let Mnl := max{21tr [P ] , Jlin(K0 | γ0)}. The following statements are true regarding the discount annealing algorithm for nonlinear dynamical systems when r is less than a fixed quantity that is poly(1/Mnl, 1/ A op, 1/ B op, rnl/βnl) a) Discount annealing returns a controller b K which is ( p 2tr[P ], (8tr [P ]) 1)- exponentially stable over a radius r = rnl / (8βnltr [P ]2) b) If γ0 < 1, the algorithm is guaranteed to halt whenever t is greater than 64tr [P ]4 log(1/γ0). Furthermore, at each iteration t: c) Policy gradients achieves the guarantee in Eq. (2.2) using only poly(Mnl, A op, B op) many queries to ε-Grad( , Jnl( | γ)) as long as ε is less than some fixed polynomial poly(M 1 nl , A 1 op , B 1 op ). c) Let c0 denote a universal constant. With probability 1 δ, the noisy random search algorithm (see Figure 2) returns a discount factor γ satisfying Eq. (2.3) using at most c0 tr [P ]4 log(1/δ) queries to ε-Eval( , Jnl( | γ, r )) for ε = .01dx. We note that while our theorem only guarantees that the controller is stabilizing around a polynomially small neighborhood of the equilibrium, in experiments, we find that the resulting controller successfully stabilizes the dynamics for a wide range of initial conditions. Relative to the case of linear systems where we leveraged the monotonicity of the LQR cost to search for discount factors using binary search, this monotonicity breaks down in the case of nonlinear systems and we instead analyze a random search algorithm to simplify the analysis. 4 Experiments In this section, we evaluate the ability of the discount annealing algorithm to stabilize a simulated nonlinear system. Specifically, we consider the familiar cart-pole, with dx = 4 (positions and velocities of the cart and pole), and du = 1 (horizontal force applied to the cart). The goal is to stabilize the system with the pole in the unstable upright equilibrium position. For further details, including the precise dynamics, see Appendix D.1. The system was simulated in discrete-time with a simple forward Euler discretization, i.e., xt+1 = xt + Ts xt, where xt is given by the continuous time dynamics, and Ts = 0.05 (20Hz). Simulations were carried out in Py Torch [Paszke et al., 2019] and run on a single GPU. Table 1: Final region of attraction radius rroa as a function of the initial state radius r used during training (discount annealing). We report the [min, max] values of rroa over 5 independent trials. The optimal LQR policy for the linearized system achieved rroa = 0.703 when applied to the nonlinear system. We also synthesized an H optimal controller for the linearized dynamics, which achieved rroa = 0.506. r 0.1 0.3 0.5 0.6 0.7 rroa [0.702, 0.704] [0.711, 0.713] [0.727, 0.734] [0.731, 0.744] [0.769, 0.777] Setup. The discounted annealing algorithm of Figure 1 was implemented as follows. In place of the true infinite horizon discounted cost Jnl(K | γ, r) in Eq. (C.4) we use a finite horizon, finite sample Monte Carlo approximation as described in Appendix C, Jnl(K | γ) 1 i=1J(H) nl (K | γ, x(i)), x(i) r Sdx 1. Here, J(H) nl (K | γ, x) = PH 1 j=0 x t Qxt+u t Rut, is the length H, finite horizon cost of a controller K in which the states evolve according to the γ damped dynamics from Eq. (C.5) and ut = Kxt. We used N = 5000 and H = 1000 in our experiments. For the cost function, we used Q = Ts I and R = Ts. We compute unbiased approximations of the gradients using automatic differentiation on the finite horizon objective J(H) nl . Instead of using SGD updates for policy gradients, we use Adam [Kingma and Ba, 2014] with a learning rate of η = 0.01/r. Furthermore, we replace the policy gradient termination criteria in Step 2 (Eq. (2.2)) by instead halting after a fixed number (M = 200) of gradient descent steps. We wish to emphasize that the hyperparameters (N, H, η, M) were not optimized for performance. In particular, for r = 0.1, we found that as few as M = 40 iterations of policy gradient and horizons as short as H = 400 were sufficient. Finally, we used an initial discount factor γ0 = 0.9 Ajac 2 2 , where Ajac denotes the linearization of the (discrete-time) cart-pole about the vertical equilibrium. Results. We now proceed to discuss the performance of the algorithm, focusing on three main properties of interest: i) the number of iterations of discount annealing required to find a stabilizing controller (that is, increase γt to 1), ii) the maximum radius r of the ball of initial conditions x0 r Sdx 1 for which discount annealing succeeds at stabilizing the system, and iii) the radius rroa of the largest ball contained within the region of attraction (ROA) for the policy returned by discount annealing. Although the true ROA (the set of all initial conditions such that the closed-loop system converges asymptotically to the equilibrium point) is not necessarily shaped like a ball (as the system is more sensitive to perturbations in the position and velocity of the pole than the cart), we use the term region of attraction radius to refer to the radius of the largest ball contained in the ROA. Concerning (i), discount annealing reliably returned a stabilizing policy in less than 9 iterations. Specifically, over 5 independent trials for each initial radius r {0.1, 0.3, 0.5, 0.7} (giving 20 independent trials, in total) the algorithm never required more than 9 iterations to return a stabilizing policy. Concerning (ii), discount annealing reliably stabilized the system for r 0.7. For r 0.75, we observed trials in which the state of the damped system (γ < 1) diverged to infinity. For such a rollout, the gradient of the cost is not well-defined, and policy gradient is unable to improve the policy, which prevents discount annealing from finding a stabilizing policy. Concerning (iii), in Table 1 we report the final radius rroa for the region of attraction of the final controller returned by discount annealing as a function of the training radius r. We make the following observations. Foremost, the policy returned by discount annealing extends the radius of the ROA beyond the radius used during training, i.e. rroa > r. Moreover, for each r > .1, the rroa achieved by discount annealing is greater than the rroa = 0.703 achieved by the exact optimal LQR controller and the rroa = 0.506 achieved by the exact optimal H controller for the system s Jacobian linearization (see Table 1). (The H optimal controller mitigates the effect of worst-case additive state disturbances on the cost; cf. Appendix D.2 for details). One may hypothesize that this is due to the fact that discount annealing directly operates on the true nonlinear dynamics whereas the other baselines (LQR and H control), find the optimal controller for an idealized linearization of the dynamics. Indeed, there is evidence to support this hypothesis. In Figure 3 presented in Appendix D, we plot the error K pg(γt) K lin(γt) F between the policy K pg(γt) returned by policy gradients, and the optimal LQR policy K lin(γt) for the (damped) linearized system, as a function of the discount factor γt used in each iteration of the discount annealing algorithm. For small training radius, such as r = 0.05, K pg(γt) K lin(γt) for all γt. However, for larger radii (i.e r = 0.7), we see that K pg(γt) K lin(γt) F steadily increases as γt increases. That is, as discount annealing increases the discount factor γ and the closed-loop trajectories explore regions of the state space where the dynamics are increasingly nonlinear, K pg begins to diverge from K lin. Moreover, at the conclusion of discount annealing K pg(1) achieves a lower cost, namely [15.2, 15.4] vs [16.5, 16.8] (here [a, b] denotes [min, max] over 5 trials) and larger rroa, namely [0.769, 0.777] vs [0.702, 0.703], than K lin(1), suggesting that the method has indeed adapted to the nonlinearity of the system. Similar observations as to the behavior of controllers fine tuned via policy gradient methods are predicted by the theoretical results from Qu et al. [2020]. 5 Discussion This works illustrates how one can provably stabilize a broad class of dynamical systems via a simple model-free procedure based off policy gradients. In line with the simplicity and flexibility that have made model-free methods so popular in practice, our algorithm works under relatively weak assumptions and with little knowledge of the underlying dynamics. Furthermore, we solve an open problem from previous work [Fazel et al., 2018] and take a step towards placing model-free methods on more solid theoretical footing. We believe that our results raise a number of interesting questions and directions for future work. In particular, our theoretical analysis states that discount annealing returns a controller whose stability properties are similar to those of the optimal LQR controller for the system s Jacobian linearization. We were therefore quite surprised when in experiments, the resulting controller had a significantly better radius of attraction than the exact optimal LQR and H controllers for the linearization of the dynamics. It is an interesting and important direction for future work to gain a better understanding of exactly when and how model-free procedures are adaptive to the nonlinearities of the system and improve upon these model-based baselines. Furthermore, for our analysis of nonlinear systems, we require access to damped system trajectories. It would be valuable to understand whether this is indeed necessary or whether our analysis could be extended to work without access to damped trajectories. As a final note, in this work we reduce the problem of stabilizing dynamical systems to running policy gradients on a discounted LQR objective. This choice of reducing to LQR was in part made for simplicity to leverage previous analyses. However, it is possible that overall performance could be improved if rather than reducing to LQR, we instead attempted to run a model-free method that directly tries to optimize a robust control objective (which explicitly deals with uncertainty in the system dynamics). We believe that understanding these tradeoffs in objectives and their relevant sample complexities is an interesting avenue for future inquiry. Acknowledgments We would like to thank Peter Bartlett for many helpful conversations and comments throughout the course of this project, and Russ Tedrake for support with the numerical experiments. JCP is supported by an NSF Graduate Research Fellowship. MS is supported by an Open Philanthropy Fellowship grant. JU is supported by the National Science Foundation, Award No. EFMA-1830901, and the Department of the Navy, Office of Naval Research, Award No. N00014-18-1-2210. Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1 26, 2011. Open AI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3 20, 2020. Frank M Callier and Charles A Desoer. Linear system theory. Springer Science & Business Media, 2012. Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. In Conference on Learning Theory, pages 1114 1143. PMLR, 2021. Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only sqrt(t) regret. In International Conference on Machine Learning, pages 1300 1309. PMLR, 2019. Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188 4197, 2018. Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1 47, 2019. Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite-time adaptive stabilization of linear systems. IEEE Transactions on Automatic Control, 64(8):3498 3505, 2018a. Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. On optimality of adaptive linear-quadratic regulators. ar Xiv preprint ar Xiv:1806.10749, 2018b. Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Randomized algorithms for data-driven stabilization of stochastic linear systems. In 2019 IEEE Data Science Workshop (DSW), pages 170 174. IEEE, 2019. Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467 1476. PMLR, 2018. Han Feng and Javad Lavaei. Escaping locally optimal decentralized control polices via damping. In 2020 American Control Conference (ACC), pages 50 57. IEEE, 2020. Claude-Nicolas Fiechter. Pac adaptive control of linear systems. In Proceedings of the tenth annual conference on Computational learning theory, pages 72 80, 1997. Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan Mc Mahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 05, page 385 394, USA, 2005. Society for Industrial and Applied Mathematics. Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Algorithmic Learning Theory, pages 408 421. PMLR, 2020. Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control. In Advances in Neural Information Processing Systems, volume 33, pages 15312 15325, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Andrew Lamperski. Computing stabilizing linear controllers via policy iteration. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 1902 1907. IEEE, 2020. Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000. Frank L Lewis, Darren M Dawson, and Chaouki T Abdallah. Robot manipulator control: theory and practice. CRC Press, 2003. Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter Bartlett, and Martin Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2916 2925. PMLR, 2019. Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154 10164, 2019. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Hesameddin Mohammadi, Mahdi Soltanolkotabi, and Mihailo R Jovanovi c. On the linear convergence of random search for discrete-time lqr. IEEE Control Systems Letters, 5(3):989 994, 2020. Hesameddin Mohammadi, Armin Zare, Mahdi Soltanolkotabi, and Mihailo R Jovanovic. Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. IEEE Transactions on Automatic Control, 2021. Pablo A Parrilo. Semidefinite programming relaxations for semialgebraic problems. Mathematical programming, 96(2):293 320, 2003. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. ar Xiv preprint ar Xiv:1912.01703, 2019. Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3803 3810, 2018. doi: 10.1109/ICRA.2018.8460528. Juan C. Perdomo, Max Simchowitz, Alekh Agarwal, and Peter L. Bartlett. Towards a dimensionfree understanding of adaptive linear control. In Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3681 3770. PMLR, 2021. Stephen Prajna, Pablo A Parrilo, and Anders Rantzer. Nonlinear control synthesis by convex optimization. IEEE Transactions on Automatic Control, 49(2):310 314, 2004. Guannan Qu, Chenkai Yu, Steven Low, and Adam Wierman. Combining model-based and modelfree methods for nonlinear control: A provably convergent policy gradient approach. ar Xiv preprint ar Xiv:2006.07476, 2020. Anders Rantzer. A dual to lyapunov s stability theorem. Systems & Control Letters, 42(3):161 168, 2001. Shankar Sastry. Nonlinear systems: analysis, stability, and control, volume 10. Springer Science & Business Media, 2013. Shankar Sastry and Marc Bodson. Adaptive control: stability, convergence and robustness. Courier Corporation, 2011. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937 8948. PMLR, 2020. Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320 3436. PMLR, 2020. Eduardo D Sontag. Nonlinear feedback stabilization revisited. In Dynamical Systems, Control, Coding, Computer Vision, pages 223 262. Springer, 1999. Eduardo D. Sontag. Stability and feedback stabilization. Springer New York, New York, NY, 2009. ISBN 978-0-387-30440-3. doi: 10.1007/978-0-387-30440-3_515. Stephen Tu and Benjamin Recht. The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory, pages 3036 3083. PMLR, 2019. Kaiqing Zhang, Xiangyuan Zhang, Bin Hu, and Tamer Ba sar. Derivative-free policy optimization for risk-sensitive and robust control design: implicit regularization and sample complexity. ar Xiv preprint ar Xiv:2101.01041, 2021. Kemin Zhou and John Comstock Doyle. Essentials of robust control, volume 104. Prentice Hall, 1998. Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.