# neural_lyapunov_control_for_discretetime_systems__5a8fd4fa.pdf

Neural Lyapunov Control for Discrete-Time Systems

Junlin Wu Computer Science & Engineering Washington University in St. Louis St. Louis, MO 63130 junlin.wu@wustl.edu

Andrew Clark Electrical & Systems Engineering Washington University in St. Louis St. Louis, MO 63130 andrewclark@wustl.edu

Yiannis Kantaros Electrical & Systems Engineering Washington University in St. Louis St. Louis, MO 63130 ioannisk@wustl.edu

Yevgeniy Vorobeychik Computer Science & Engineering Washington University in St. Louis St. Louis, MO 63130 yvorobeychik@wustl.edu

While ensuring stability for linear systems is well understood, it remains a major challenge for nonlinear systems. A general approach in such cases is to compute a combination of a Lyapunov function and an associated control policy. However, finding Lyapunov functions for general nonlinear systems is a challenging task. To address this challenge, several methods have been proposed that represent Lyapunov functions using neural networks. However, such approaches either focus on continuous-time systems, or highly restricted classes of nonlinear dynamics. We propose the first approach for learning neural Lyapunov control in a broad class of discrete-time systems. Three key ingredients enable us to effectively learn provably stable control policies. The first is a novel mixed-integer linear programming approach for verifying the discrete-time Lyapunov stability conditions, leveraging the particular structure of these conditions. The second is a novel approach for computing verified sublevel sets. The third is a heuristic gradient-based method for quickly finding counterexamples to significantly speed up Lyapunov function learning. Our experiments on four standard benchmarks demonstrate that our approach significantly outperforms state-of-the-art baselines. For example, on the path tracking benchmark, we outperform recent neural Lyapunov control baselines by an order of magnitude in both running time and the size of the region of attraction, and on two of the four benchmarks (cartpole and PVTOL), ours is the first automated approach to return a provably stable controller. Our code is available at: https://github.com/jlwu002/nlc_discrete.

1 Introduction

Stability analysis for dynamical systems aims to show that the system state will return to an equilibrium under small perturbations. Designing stable control in nonlinear systems commonly relies on constructing Lyapunov functions that can certify the stability of equilibrium points and estimate their region of attraction. However, finding Lyapunov functions for arbitrary nonlinear systems is a challenging task that requires substantial expertise and manual effort [Khalil, 2015, Lavaei and Bridgeman, 2023]. To address this challenge, recent progress has been made in learning Lyapunov functions in continuous-time nonlinear dynamics [Abate et al., 2020, Chang et al., 2019, Zhou et al., 2022]. However, few approaches exist for doing so in discrete-time systems [Dai et al., 2021], and none for general Lipschitz-continuous dynamics. Since modern learning-based controllers often

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

take non-negligible time for computation, the granularity of control is effectively discrete-time, and developing approaches for stabilizing such controllers is a major open challenge.

We propose a novel method to learn Lyapunov functions and stabilizing controllers, represented as neural networks (NNs) with Re LU activation functions, for discrete-time nonlinear systems. The proposed framework broadly consists of a learner, which uses a gradient-based method for updating the parameters of the Lyapunov function and policy, and a verifier, which produces counterexamples (if any exist) to the stability conditions that are added as training data for the learner. The use of a verifier in the learning loop is critical to enable the proposed approach to return a provably stable policy. However, no prior approaches enable sound verification of neural Lyapunov stability for general discrete-time dynamical systems. The closest is Dai et al. [2021], who assume that dynamics are represented by a neural network, an assumption that rarely holds in real systems. On the other hand, approaches for continuous-time systems [Chang et al., 2019, Zhou et al., 2022] have limited efficacy in discrete-time environments, as our experiments demonstrate. To address this gap, we develop a novel verification tool that checks if the candidate NN Lyapunov function satisfies the Lyapunov conditions by solving Mixed Interger Linear Programs (MILPs). This approach, which takes advantage of the structure of discrete-time Lyapunov stability conditions, can soundly verify a broad class of dynamical system models.

However, using a sound verification tool in the learning loop makes it a significant bottleneck, severely limiting scalability. We address this problem by also developing a highly effective gradient-based technique for identifying counterexamples, resorting to the full MILP-based verifier only as a last resort. The full learning process thereby iterates between learning and verification steps, and returns only when the sound verifier is able to prove that the Lyapunov function and policy satisfy the stability conditions.

The final technical challenge stems from the difficulty of verifying stability near the origin [Chang et al., 2019], typically addressed heuristically by either adding a fixed tolerance to a stability condition [Dai et al., 2021], or excluding a small area around the origin from verification [Chang et al., 2019, Zhou et al., 2022]. We address it by adapting Lyapunov stability theory to ensure convergence to a small region near the origin, thereby achieving the first (to our knowledge) sound approach for computing a stable neural controller that explicitly accounts for such approximations near the origin.

We evaluate the proposed approach in comparison to state-of-the-art baselines on four standard nonlinear control benchmarks. On the two simpler domains (inverted pendulum and path following), our approach outperforms the state-of-the-art continuous-time neural Lyapunov control approaches by at least several factors and up to an order of magnitude in both running time and the size of the region of attraction. On the two more complex domains cartpole and PVTOL ours is the first automated approach that returns a provably stable controller. Moreover, our ablation experiments demonstrate that both the MILP-based verifier and heuristic counterexample generation technique we propose are critical to the success of our approach.

In summary, we make the following contributions:

A novel MILP-based approach for verifying a broad class of discrete-time controlled nonlinear systems. A novel approach for learning provably verified stabilizing controllers for a broad class of discretetime nonlinear systems which combines our MILP-based verifier with a heuristic gradient-based approximate counterexample generation technique. A novel formalization of approximate stability in which the controller provably converges to a small ball near the origin in finite time. Extensive experiments on four standard benchmarks demonstrate that by leveraging the special structure of Lyapunov stability conditions for discrete-time system, our approach significantly outperforms prior art.

Related Work Much prior work on learning Lyapunov functions focused on continuous-time systems [Abate et al., 2020, Ravanbakhsh and Sankaranarayanan, 2019, Chang et al., 2019, Zhou et al., 2022]. Common approaches have assumed that dynamics are linear [Donti et al., 2021, Tedrake, 2009] or polynomial [Ravanbakhsh and Sankaranarayanan, 2019]. Recently, Chang et al. [2019], Rego and de Araújo [2022], and Zhou et al. [2022] proposed learning Lyapunov functions represented as neural networks while restricting policies to be linear. These were designed for continuous-time dynamics, and are not effective in discrete-time settings, as we show in the experiments. Chen et al.

[2021a] learns convex Lyapunov functions for discrete-time hybrid systems. Their approach requires hybrid systems to admit a mixed-integer linear programming formulation, essentially restricting it to piecewise-affine systems, and does not learn stabilizing controllers for these. Our approach considers a much broader class of dynamical system models, and learns provably stable controllers and Lyapunov functions, allowing both to be represented as neural networks. Kolter and Manek [2019] learn stable nonlinear dynamics represented as neural networks, but do not provide stability guarantees with respect to the true underlying system. In addition, several approaches have been proposed that either provide only probabilistic stability guarantees [Berkenkamp et al., 2017, Richards et al., 2018], or do not guarantee stability [Choi et al., 2020, Han et al., 2020]. Several recent approaches propose methods for verifying stability in dynamical discrete-time systems. Chen et al. [2021b] compute an approximate region of attraction of dynamical systems with neural network dynamics, but assume that Lyapunov functions are quadratic. Dai et al. [2021] consider the problem of verifying Lyapunov conditions in discrete-time systems, as well as learning provably stable policies. However, they assume that dynamics are represented by neural networks with (leaky) Re LU activation functions. Our verification approach, in contrast, is for arbitrary Lipschitz continuous dynamics.

We consider a discrete-time nonlinear dynamical system xt+1 = f(xt, ut), (1) where xt X is state in a domain X and ut the control input at time t. We assume that f is Lipschitz continuous with Lipschitz constant Lf. This class of dynamical systems includes the vast majority of (non-hybrid) dynamical system models in prior literature. We assume that x = 0 is an equilibrium point for the system, that is, f(0, u0) = 0 for some u0. Let π(x) denote a control policy, with ut = π(xt). For example, in autonomous vehicle path tracking, x can measure path tracking error and u the steering angle. In the conventional setup, the goal is to learn a control policy π such that the dynamical system in Equation (1) converges to the equilibrium x = 0, a condition referred to as stability. To this end, we can leverage the Lyapunov stability framework [Tedrake, 2009]. Specifically, the goal is to identify a Lyapunov function V (x) and controller π that satisfies the following conditions over a subset of the domain R X: 1)V (0) = 0; 2)V (x) > 0, x R \ {0}; and 3)V (f(x, π(x))) V (x) < 0, x R. These conditions imply that the system is (locally) stable in the Lyapunov sense [Bof et al., 2018, Tedrake, 2009].

In practice, due to numerical challenges in verifying stability conditions near the origin, it is common to verify slight relaxations of the Lyapunov conditions. These typically fall into two categories: 1) Lyapunov conditions are only checked for x p ϵ [Chang et al., 2019, Zhou et al., 2022] (our main baselines, and the only other methods for learning neural Lyapunov control for general nonlinear dynamics), and/or 2) a small tolerance δ > 0 is added in verification, allowing small violations of Lyapunov conditions near the origin [Dai et al., 2021].

Our goal now is to formally investigate the implications of numerical approximations of this kind. We next define a form of stability which entails finite-time convergence to B(0, ϵ) = {x | x < ϵ}. Definition 2.1 (ϵ-stability). We call pair of (V, π) ϵ-stable within a region R when the following conditions are satisfied: (a) V (0) = 0; (b) there exists ζ > 0 such that V (f(x, π(x))) V (x) < ζ for all x R \ B(0, ϵ); and (c) V (x) > 0 for all x R \ B(0, ϵ).

A key ingredient to achieving stability is to identify a region of attraction (ROA), within which we are guaranteed to converge to the origin from any starting point. In the context of ϵ-stability, our goal is to converge to a small ball near the origin, rather than the origin; let ϵ-ROA be the set of initial inputs that has this property. In order to enable us to identify an ϵ-ROA, we introduce a notion of an invariant sublevel set. Specifically, we refer to a set D(R, ρ) = {x R|V (x) ρ} with the property that x D(R, ρ) f(x, π(x)) R as a R-invariant sublevel set. In other words, D(R, ρ) is a sublevel set which is additionally forward invariant with respect to R. We assume that B(0, ϵ) D(R, ρ).

Next, we formally prove that ϵ-stability combined with a sublevel set D(R, ρ) entails convergence in three senses: 1) that we reach an ϵ-ball around the origin in finite time, 2) that we reach it infinitely often, and 3) we converge to an arbitrarily small ball around the origin. We refer to the first of these senses as reachability. What we show is that for ϵ sufficiently small, reachability implies convergence in all three senses. For this, a key additional assumption is that V and π are Lipschitz continuous, with Lipschitz constants Lv and Lπ, respectively (e.g., in Re LU neural networks).

Theorem 2.2. Suppose V and π are ϵ-stable on a compact R, D(R, ρ) is an R-invariant set, and c1 such that π(0) u0 c1ϵ. Then if x0 D(R, ρ) \ B(0, ϵ):

(i) there exists a finite K such that x K B(0, ϵ), (ii) c2 such that if c2ϵ < ρ and B(0, c2ϵ) R, then there exists a finite K such that k K, xk D(R, c2ϵ) and the sequence {xk}k 0 reaches B(0, ϵ) infinitely often, and furthermore (iii) for any η > 0 such that B(0, η) R, ϵ and finite K such that π(0) u0 c1ϵ xk η k K.

Proof. We prove (i) by contradiction. Suppose that xk > ϵ k {0, . . . , V (x0)/ζ }. Then, when k = V (x0)/ζ , condition (b) of ϵ-stability and R-invariance of D(R, ρ) implies that V (xk) < V (x0) ζk < 0, contradicting condition (c) of ϵ-stability.

To prove (ii), fix the finite K from (i), and let k K be such that xk B(0, ϵ). By Lipschitz continuity of V and the fact that V (0) = 0 and V (x) 0 (conditions (a) and (c) of ϵ-stability), V (xk+1) Lv xk+1 , where is the ℓ norm here and below. Moreover, since xk+1 = f(xk, π(xk)), f(0, u0) = 0 (stability of the origin), and by Lipschitz continuity of f and π,

xk+1 Lf (xk, π(xk) u0) Lf max{ xk , π(xk) u0 }. By Lipschitz continuity of π, triangle inequality, and the condition that π(0) u0 c1ϵ, we have π(xk) u0 Lπ xk + c1ϵ (Lπ + c1)ϵ. Let c2 = max{Lv, 1}Lf max{1, Lπ + c1}. Then xk+1 c2ϵ and V (xk+1) c2ϵ. Thus, if c2ϵ < ρ and B(0, c2ϵ) R, and D(R, ρ) is R-invariant, then xk+1 D(R, c2ϵ) which is R-invariant. Additionally, either xk+1 B(0, ϵ), and by the argument above xk+2 D(R, c2ϵ), or xk+1 D(R, c2ϵ) \ B(0, ϵ), and by ϵ-stability xk+2 D(R, c2ϵ). Thus, by induction, we have that for all k K, xk D(R, c2ϵ). Finally, since for all k K, xk D(R, c2ϵ), if xk / B(0, ϵ), by (i) it must reach B(0, ϵ) in finite time. Consequently, there is an infinite subsequence {xk } of {xk}k 0 such that all xk B(0, ϵ), that is, the sequence {xk}k 0 reaches B(0, ϵ) infinitely often.

We prove part (iii) by contradiction. Fix η > 0 and define S = {x R : x > η}. Suppose that ϵ > 0 there exists x S such that V (x) c2ϵ where c2 is as in (ii). Then for any (infinite) sequence of {ϵt} we have {xt} such that V (xt) c2ϵt, where xt S. Now, consider a set S = {x R : x η}. Since S is compact and {xt} is an infinite sequence, there is an infinite subsequence {xtk} of {xt} such that limk xtk = x and x S. Since V is continuous, we have limk V (xtk) = V (x ). Now, we choose {ϵt} such that limt ϵt = 0. This means that limk V (xtk) = V (x ) 0, and since x S, this contradicts condition (c) of ϵ-stability. Since by (ii), there exists a finite K, V (xk) c2ϵ, k K, we have k K, xk η.

The crucial implication of Theorem 2.2 is that as long as we choose ϵ > 0 sufficiently small, verifying that V and π are ϵ-stable together with identifying an R-invariant set D(R, ρ) suffices for convergence arbitrarily close to the origin. One caveat is that we need to ensure that π(0) is sufficiently close to u0 for any ϵ we choose. In most domains, this can be easily achieved: for example, if u0 = 0 (as is the case in many settings, including three of our four experimental domains), we can use a representation for π with no bias terms, so that π(0) = 0 = u0 by construction. In other cases, we can simply check this condition after learning.

Another caveat is that we need to define R to enable us to practically achieve these properties. To this end, we define R parametrically as R(γ) = {x X | x γ}. Additionally, we introduce the following useful notation. Define R(ϵ, γ) = {x X | ϵ x γ}. Note that R(0, ) = X, R(ϵ, ) = {x X | x ϵ} and R(0, γ) = R(γ). Thus, for γ sufficiently large compared to ϵ, conditions such as B(0, ϵ) D(R, ρ), B(0, η) R, and B(0, c2ϵ) R will be easy to satisfy. Following conventional naming, we denote R(ϵ, γ) as ϵ-valid region, i.e., the region that satisfies conditions (a) (c) of ϵ-stability, and refer to a function V satisfying these conditions as an ϵ-Lyapunov function.

We assume that the ϵ-Lyapunov function as well as the policy can be represented in a parametric function class, such as by a deep neural network. Formally, we denote the parametric ϵ-Lyapunov function by Vθ(x) and the policy by πβ(x), where θ and β are their respective parameters. Let D(γ, ρ) D(R(γ), ρ). Our goal is to learn Vθ and πβ such we maximize the size of the R(γ)- invariant sublevel set D(γ, ρ) such that Vθ and πβ are provably ϵ-stable on R(γ). Armed with Theorem 2.2, we refer to this set simply as ROA below for simplicity and for consistency with prior work, e.g., [Chang et al., 2019, Zhou et al., 2022], noting all the caveats discussed above.

Define P(γ) as the set (θ, β) for which the conditions (a)-(c) of ϵ-stability are satisfied over the domain R(γ). Our main challenge below is to find (θ, β) P(γ) for a given γ. That, in turn, entails solving the key subproblem of verifying these conditions for given Vθ and πβ.

Next, in Section 3 we address the verification problem, and in Section 4 we describe our approach for jointly learning Vθ and πβ that can be verified to satisfy the ϵ-stability conditions for a given R(γ).

3 Verifying Stability Conditions

Prior work on learning ϵ-Lyapunov functions for continuous-time nonlinear control problems has leveraged off-the-shelf SMT solvers, such as d Real [Gao et al., 2013]. However, these solvers scale poorly in our setting (see Supplement C for details). In this section, we propose a novel approach for verifying the ϵ-Lyapunov conditions for arbitrary Lipschitz continuous dynamics using mixed-integer linear programming, through obtaining piecewise-linear bounds on the dynamics. We assume Vθ and πβ are Kand N-layer neural networks, respectively, with Re LU activations.

We begin with the problem of verifying condition (c) of ϵ-stability, which we represent as a feasibility problem: to find if there is any point x R(ϵ, γ) such that V ( x) 0. We can formulate it as the following MILP:

z K 0 (2a) zl+1 = gθl(zl), 0 l K 1 (2b) ϵ x γ, z0 = x, (2c)

where l refers to a layer in the neural network Vθ, z K = Vθ(x), and the associated functions gθl(zl) are either Wlzl + bl for a linear layer (with θl = (Wl, bl)) or max{zl, 0} for a Re LU activation. Any feasible solution x is then a counterexample, and if the problem is infeasible, the condition is satisfied. Re LU activations g(z) can be linearized by introducing an integer variable a {0, 1}, and replacing the z = g(z) terms with constraints z Ua and z L(1 a), where L and U are specified so that L z U (we deal with identifying L and U below).

Next, we cast verification of condition (b) of ϵ-stability, which involves the nonlinear control dynamics f, as the following feasibility problem:

z K z K ζ (3a) yi+1 = hβi(yi), 0 i N 1 (3b) zl+1 = gθl( zl), 0 l K 1 (3c) z0 = f(x, y N) (3d) y0 = x, constraints (2b) (2c), (3e)

where hβi() are functions computed at layers i of πβ, z K = Vθ(x), and z K = Vθ(f(x, πβ(x))).

At this point, all of the constraints can be linearized as before with the exception of Constraint (3d), which involves the nonlinear dynamics f. To deal with this, we relax the verification problem by replacing f with linear lower and upper bounds. To obtain tight bounds, we divide R(γ) into a collection of subdomains {Rk}. For each Rk, we obtain a linear lower bound flb(x) and upper bound fub(x) on f, and relax the problematic Constraint (3d) into flb(x) z0 fub(x), which is now a pair of linear constraints. We can then solve Problem (3) for each Rk.

Computing Linear Bounds on System Dynamics Recall that f : Rn 7 Rn is the Lipschitzcontinuous system dynamic xt+1 = f(xt, ut). For simplicity we omit ut = πβ(x) below. Let λ be the ℓ Lipschitz constant of f. Suppose that we are given a region of the domain Rk represented as a hyperrectangle, i.e., Rk = i[xi,l, xi,u], where [xi,l, xi,u] are the lower and upper bounds of x along coordinate i. Our goal is to compute a linear upper and lower bound on f(x) over this region. We bound fj(x) : Rn 7 R along each j-th dimension separately. By λ-Lipschitz-continuity, we can obtain fj(x) fj(x1,l, x2,l, . . . , xn,l) + λ P

i(xi xi,l) and fj(x) fj(x1,l, x2,l, . . . , xn,l) λ P

i(xi xi,l). The full derivation is provided in the Supplement Section B.

Alternatively, if fj is differentiable, convex over xi, and monotone over other dimensions, we can restrict it to calculating one-dimensional bounds by finding coordinates x i,u and x i,l such that fj(xi, x i,l) fj(xi, x i) fj(xi, x i,u) for any given x. Then fj(x) fj,i,ub(x) fj(xi,l, x i,u) + fj(xi,u,x i,u) fj(xi,l,x i,u)

xi,u xi,l (xi xi,l) and fj(x) fi,lb(x) fj(x i , x i,l) +

f j(x i , x i,l)(xi x i ), where we let x i = minxs [xi,l,xi,u] R xi,u xi,l [fj(xi, x i,l) fj(xs, x i,l) + f j(xs, x i,l)(xi xs)]dxs to minimize the error area between fj(x) and fi,lb(x). Note that the solution for x i can be approximated when conducting experiments, and this approximation has no impact on the correctness of the bound. A similar result obtains for where fj(x) is concave over xi. Furthermore, we can use these single-dimensional bounds to obtain tighter lower and upper bounds on fj(x) as follows: note that for any cl, cu with P

i cl i = 1 and P

i cu i = 1, P

i cl ifj,i,lb(x) fj(x) P

i cu i fj,i,ub(x), which means we can optimize cl and cu to minimize the bounding error. In practice, we can typically partition R(γ) so that the stronger monotonicity and convexity or concavity assumptions hold for each Rk.

Note that our linear bounds flb(x) and fub(x) introduce errors compared to the original nonlinear dynamics f. However, we can obtain tighter bounds by splitting Rk further into subregions, and computing tighter bounds in each of the resulting subregions, but at the cost of increased computation. To balance these considerations, we start with a relatively small collection of subdomains {Rk}, and only split a region Rk if we obtain a counterexample in Rk that is not an actual counterexample for the true dynamics f.

Computing Bounds on Re LU Linearization Constants In linearizing the Re LU activations, we supposed an existence of lower and upper bounds L and U. However, we cannot simply set them to some large negative and positive number, respectively, because Vθ(x) has no a priori fixed bounds (in particular, note that for any ϵ-Lyapunov function V and constant a > 1, a V is also a ϵ-Lyapunov function). Thus, arbitrarily setting L and U makes verification unsound. To address this issue, we use interval bound propogation (IBP) [Gowal et al., 2018] to obtain M = max1 i n{|Ui|, |Li|}, where Ui is the upper bound, and Li is the lower bound returned by IBP for the i-th layer, with inputs for the first layer the upper and lower bounds of f(R(γ)). Setting each L = M and U = M then yields sound verification.

Computing Sublevel Sets The approaches above verify the conditions (a)-(c) of ϵ-stability on R(γ). The final piece is to find the R(γ)-invariant sublevel set D(γ, ρ), that is, to find ρ. Let B(γ) max maxx R(γ) f(x, πβ(x)) , γ . We find ρ by solving

min x:γ x B(γ) V (x). (4)

We can transform both Problem (4) and computation of B(γ) into MILP as for other problems above. Theorem 3.1. Suppose that V and π are ϵ-stable on R(γ), π(0) u0 ϵ, and γ Lf max{1, Lπ + 1}ϵ. Let V be the optimal value of the objective in Problem (4), and ρ = V µ for any µ > 0. Then the set D(γ, ρ) = {x : x R(γ), V (x) ρ} is an R(γ)-invariant sublevel set.

Proof. If x > ϵ, V (f(x, π(x))) < V (x) ρ < V by ϵ-stability of V and π. Suppose that x = f(x, π(x)) / R(γ). Since V ( x) < V , V must not be an optimal solution to (4), a contradiction. If x ϵ, the argument is similar to Theorem 2.2 (ii).

4 Learning ϵ-Lyapunov Function and Policy

We cast learning the ϵ-Lyapunov function and policy as the following problem:

i S L(xi; Vθ, πβ), (5)

where L( ) is a loss function that promotes Lyapunov learning and the set S R(γ) is a finite subset of points in the valid region. We assume that the loss function is differentiable, and consequently training follows a sequence of gradient update steps (θ , β ) = (θ, β) µ P

i S θ,βL(xi; Vθ, πβ).

Clearly, the choice of the set S is pivotal to successfully learning Vθ and πβ with provable stability properties. Prior approaches for learning Lyapunov functions represented by neural networks use one of two ways to generate S. The first is to generate a fixed set S comprised of uniformly random samples from the domain [Chang and Gao, 2021, Richards et al., 2018]. However, this fails to learn verifiable Lyapunov functions. In the second, S is not fixed, but changes in each learning iteration, starting with randomly generated samples in the domain, and then using the verification tool, such as d Real, in each step of the learning process to update S [Chang et al., 2019, Zhou et al., 2022]. However, verification is often slow, and becomes a significant bottleneck in the learning process.

Our key idea is to combine the benefits of these two ideas with a fast gradient-based heuristic approach for generating counterexamples. In particular, our proposed approach for training Lyapunov control functions and associated control policies (Algorithm 1; DITL) involves five parts:

1. heuristic counterexample generation (lines 10-12), 2. choosing a collection S of inputs to update θ and β in each training (gradient update) iteration (lines 9-13), 3. the design of the loss function L, 4. initialization of policy πβ (line 3), and 5. warm starting the training (line 4).

We describe each of these next.

Algorithm 1 DITL Lyapunov learning algorithm.

1: Input: Dynamical system model f(x, u) and target valid region R(γ) 2: Output: Lyapunov function Vθ and control policy πβ 3: πβ = Initialize() 4: Vθ = Pre Train Lyapunov(N0, πβ) 5: B = Initialize Buffer() 6: W 7: while True do 8: for N iterations do 9: ˆS =Sample(r, B) //sample r points from B 10: T =Sample(q, R(γ)) //sample q points from R(γ) 11: T =PGD(T) 12: T =Filter Counter Examples(T ) 13: S = ˆS T W 14: B = B T

15: (θ, β) (θ, β) µ P

i S θ,βL(xi; Vθ, πβ) 16: end for 17: (success, ˆW) =Verify(Vθ, πβ) 18: if success then 19: return Vθ, πβ 20: else 21: W W ˆW 22: end if 23: end while 24: return FAILED // Timeout

Heuristic Counterexample Generation As discussed in Section 3, verification in general involves two optimization problems: minx R(γ) Vθ(x) Vθ(f(x, πβ(x))) and minx R(γ) Vθ(x). We propose a projected gradient descent (PGD) approach for approximating solutions to either of these problems. PGD proceeds as follows, using the second minimization problem as an illustration; the process is identical in the first case. Beginning with a starting point x0 R(γ), we iteratively update xk+1:

xk+1 = Π{xk + αksgn( Vθ(xk))},

where Π{} projects the gradient update step onto the domain R(γ) by clipping each dimension of xk, and αk is the PGD learning rate. Note that as f is typically differentiable, we can take gradients directly through it when we apply PGD for the first verification problem above.

We run this procedure for K iterations, and return the result (which may or may not be an actual counterexample). Let T =PGD(T ) denote running PGD for a set of T starting points for both of the optimization problems above, resulting in a corresponding set T of counterexamples.

Selecting Inputs for Gradient Updates A natural idea for selecting samples S is to simply add counterexamples identified in each iteration of the Lyapunov learning process. However, as S then grows without bound, this progressively slows down the learning process. In addition, it is important for S to contain a relatively broad sample of the valid region to ensure that counterexamples we generate along the way do not cause undesired distortion of Vθ and πβ in subregions not well represented in S. Finally, we need to retain the memory of previously generated counterexamples, as

these are often particularly challenging parts of the input domain for learning. We therefore construct S as follows.

We create an input buffer B, and initialize it with a random sample of inputs from R(γ). Let W denote a set of counterexamples that we wish to remember through the entire training process, initialized to be empty. In our implementation, W consists of all counterexamples generated by the sound verification tools described in Section 3.

At the beginning of each training update iteration, we randomly sample a fixed-size subset ˆS of the buffer B. Next, we take a collection T of random samples from R to initialize PGD, and generate T = PGD(T ). We then filter out all the counterexamples from T , retaining only those with either Vθ(x) 0 or Vθ(x) Vθ(f(x, πβ(x))) ϵ, where ϵ is a non-negative hyperparameter; this yields a set T . We then use S = ˆS T W for the gradient update in the current iteration. Finally, we update the buffer B by adding to it T . This process is repeated for N iterations.

After N iterations, we check to see if we have successfully learned a stable policy and a Lyapunov control function by using the tools from Section 3 to verify Vθ and πβ. If verification succeeds, training is complete. Otherwise, we use the counterexamples ˆW generated by the MILP solver to update W = W ˆW, and repeat the learning process above.

Loss Function Design Next, we design of the loss function L(x; Vθ, πβ) for a given input x. A key ingrediant in this loss function is a term that incentivizes the learning process to satisfy condition (b) of ϵ-stability: L1(x; Vθ, πβ) = Re LU(Vθ(f(x, πβ(x))) Vθ(x) + η), where the parameter η 0 determines how aggressively we try to satisfy this condition during learning.

There are several options for learning Vθ satisfying conditions (a) and (c) of ϵ-stability. The simplest is to set all bias terms in the neural network to zero, which immediately satisfies Vθ(0) = 0. An effective way to deal with condition (c) is to maximize the lower bound on Vθ(x). To this end, we propose to make use of the following loss term: L2(x; Vθ, πβ) = Re LU( V LB θ ), where V LB θ is the lower bound on the Lyapunov function over the target domain R(γ). We use use α, β-CROWN [Xu et al., 2020] to obtain this lower bound.

The downside of setting bias terms to zero is that we lose many learnable parameters, reducing flexibility of the neural network. If we consider a general neural network, on the other hand, it is no longer the case that Vθ(x) = 0 by construction. However, it is straigthforward to ensure this by defining the final Lyapunov function as Vθ(x) = Vθ(x) Vθ(0). Now, satisfying condition (c) amounts to satisfying the condition that Vθ(x) Vθ(0), which we accomplish via the following pair of loss terms: L3(x; Vθ, πβ) = Re LU(Vθ(0) Vθ(x) + µ min( x 2, ν)), where µ and ν are hyperparameters of the term µ min( x 2, ν) which effectively penalizes Vθ(x) for being too large, and L4(x; Vθ, πβ) = Vθ(0) 2 2.

The general loss function is then a weighted sum of these loss terms, L(x; Vθ, πβ) = L1(x; Vθ, πβ) + c2L2(x; Vθ, πβ) + c3L3(x; Vθ, πβ) + c4L4(x; Vθ, πβ). When we set bias terms to zero, we would set c3 = c4 = 0; otherwise, we set c2 = 0.

Initialization We consider two approaches for initializing the policy πβ. The first is to linearize the dynamics f around the origin, and use a policy computing based on a linear quadratic regulator (LQR) to obtain a simple linear policy πβ. The second approach is to use deep reinforcement, such as PPO [Liu et al., 2019], where rewards correspond to stability (e.g., reward is the negative value of the l2 distance of state to origin). To initialize Vθ, we fix πβ after its initialization, and follow our learning procedure above using solely heuristic counterexample generation to pre-train Vθ.

The next result now follows by construction. Theorem 4.1. If Algorithm 1 returns Vθ, πβ, they are guaranteed to satisfy ϵ-stability conditions.

5 Experiments

5.1 Experiment Setup

Benchmark Domains Our evaluation of the proposed DITL approach uses four benchmark control domains: inverted pendulum, path tracking, cartpole, and drone planar vertical takeoff and landing (PVTOL). Details about these domains are provided in the Supplement.

Baselines We compare the proposed approach to four baselines: linear quadratic regulator (LQR), sum-of-squares (SOS), neural Lyapunov control (NLC) [Chang et al., 2019], and neural Lyapunov control for unknown systems (UNL) [Zhou et al., 2022]. The first two, LQR and SOS, are the more traditional approaches for computing Lyapunov functions when the dynamics are either linear (LQR) or polynomial (SOS) [Tedrake, 2009]. LQR solutions (when the system can be stabilized) are obtained through matrix multiplication, while SOS entails solving a semidefinite program (we solve it using YALMIP with MOSEK solver in MATLAB 2022b). The next two baselines, NLC and UNL, are recent approaches for learning Lyapunov functions in continuous-time systems (no approach for learning provably stable control using neural network representations exists for discrete time systems). Both NLC and UNL yield provably stable control for general non-linear dynamical systems, and are thus the most competitive baselines to date. In addition to these baselines, we consider an ablation in which PGD-based counterexample generation for our approach is entirely replaced by the sound MILP-based method during learning (we refer to it as DITL-MILP).

Efficacy Metrics We compare approaches in terms of three efficacy metrics. The first is (serial) runtime (that is, if no parallelism is used), which we measure as wall clock time when only a single task is running on a machine. For inverted pendulum and path tracking, all comparisons were performed on a machine with AMD Ryzen 9 5900X 12-Core Processor and Linux Ubuntu 20.04.5 LTS OS. All cartpole and PVTOL experiments were run on a machine with a Xeon Gold 6150 CPU (64-bit 18-core x86), Rocky Linux 8.6. UNL and RL training for Path Tracking are the only two cases that make use of GPUs, and was run on NVIDIA Ge Force RTX 3090. The second metric was the size of the valid region, measured using ℓ2 norm for NLC and UNL, and ℓ norm (which dominates ℓ2) for LQR, SOS, and our approach. The third metric is the region of attraction (ROA). Whenever verification fails, we set ROA to 0. Finally, we compare all methods whose results are stochastic (NLC, UNL, and ours) in terms of success rate.

Verification Details For LQR, SOS, NLC, and UNL, we used d Real as the verification tool, as done by Chang et al. [2019] and Zhou et al. [2022]. For DITL verification we used CPLEX version 22.1.0.

5.2 Results

Figure 1: ROA plot of inverted pendulum (left) and path tracking (right). We select the best result for each method.

Inverted Pendulum For the inverted pendulum domain, we initialize the control policy using the LQR solution (see the Supplement for details). We train Vθ with non-zero bias terms. We set ϵ = 0.1 (< 0.007% of the valid region), and approximate ROA using a grid of 2000 cells along each coordinate using the level set certified with MILP (4). All runtime is capped at 600 seconds, except the UNL baseline, which we cap at 16 minutes, as it tends to be considerably slower than other methods. Our results are presented in Table 1. While LQR and SOS are fastest, our approach (DITL) is the next fastest, taking on average 8 seconds, with NLC and UNL considerably slower. However, DITL yields an ROA a factor >4 larger than the nearest baseline (LQR), and 100% success rate. Finally, the DITL-MILP ablation is two orders of magnitude slower (runtime > 300 seconds) and less effective (average ROA=42, 80% success rate) than DITL. We visualize maximum ROA produced by all methods in Figure 1 (left).

Path Tracking In path tracking, we initialize our approach using both the RL and LQR solutions, drawing a direct comparison between the two (see the Supplement for details). We set ϵ = 0.1. The running time of RL is 155 seconds. The results are provided in Table 2. We can observe that both RLand LQR-initialized variants of our approach outperform all prior art, with RL exhibiting a

Table 1: Inverted Pendulum

Valid Region Runtime (s) ROA Max ROA Success Rate

NLC (free) ||x||2 6.0 28 29 11 4.6 22 100% NLC (max torque 6.0) ||x||2 6.0 519 184 13 27 66 20% UNL (max torque 6.0) ||x||2 4.0 821 227 1 2 7 30% LQR ||x|| 5.8 < 1 14 14 success SOS ||x|| 1.7 < 1 6 6 success

DITL ||x|| 12 8.1 4.7 61 31 123 100%

Table 2: Path Tracking

Valid Region Runtime (s) ROA Max ROA Success Rate

NLC ||x||2 1.0 109 81 0.5 0.2 0.76 100% NLC ||x||2 1.5 151 238 1.4 0.9 2.8 80% UNL ||x||2 0.8 925 110 0.1 0.2 0.56 10% LQR ||x|| 0.7 < 1 1.02 1.02 success SOS ||x|| 0.8 < 1 1.8 1.8 success

DITL (LQR) ||x|| 3.0 9.8 4 8 3 12.5 100% DITL (RL) ||x|| 3.0 14 11 9 3.5 16 100%

factor of 5 advantage over the next best (SOS, in this case) in terms of ROA, and nearly a factor of 6 advantage in terms of maximum achieved ROA (NLC is the next best in this case). Moreover, our approach again has a 100% success rate. Our runtime is an order of magnitude lower than NLC or UNL. Overall, the RL-initialized variant slightly outperforms LQR initialization. The DITL-MILP ablation again performs far worse than DITL: running time is several orders of magnitude slower (at > 550 seconds), with low efficacy (ROA is 1.1, success rate 10%). We visualize comparison of maximum ROA produced by all methods in Figure 1 (right).

Cartpole For cartpole, we used LQR for initialization, and set bias terms of Vθ to zero. We set ϵ = 0.1 (0.01% of the valid region area) and the running time limit to 2 hours for all approaches. None of the baselines successfully attained a provably stable control policy and associated Lyapunov function for this problem. Failure was either because we could find counterexamples within the target valid region, or because verification exceeded the time limit. DITL found a valid region of x 1.0, in 1.6 hours with a 100% success rate, and average ROA of 0.021 0.012.

PVTOL The PVTOL setup was similar to cartpole. We set ϵ = 0.1 (0.0001% of the valid region area), and maximum running time to 24 hours. None of the baselines successfully identified a provably stable control policy. In contrast, DITL found one within 13 6 hours on average, yielding a 100% success rate. We identified a valid region of x 1.0, and ROA of 0.011 0.008.

6 Conclusion

We presented a novel algorithmic framework for learning Lyapunov control functions and policies for discrete-time nonlinear dynamical systems. Our approach combines mixed-integer linear programming verification tool with a training procedure that leverages gradient-based approximate verification. This combination enables a significant improvement in effectiveness compared to prior art: our experiments demonstrate that our approach yields several factors larger regions of attraction in inverted pendulum and path tracking, and ours is the only approach that successfully finds stable policies in cartpole and PVTOL.

Acknowledgments

This research was partially supported by the NSF (grants CNS-1941670, ECCS-2020289, IIS1905558, IIS-2214141, and CNS-2231257), AFOSR (grant FA9550-22-1-0054), ARO (grant W911NF-19-1-0241), and NVIDIA.

Alessandro Abate, Daniele Ahmed, Mirco Giacobbe, and Andrea Peruffo. Formal synthesis of lyapunov neural networks. IEEE Control Systems Letters, 5(3):773 778, 2020.

Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. Advances in neural information processing systems, 30, 2017.

Nicoletta Bof, Ruggero Carli, and Luca Schenato. Lyapunov theory for discrete time systems. ar Xiv preprint ar Xiv:1809.05289, 2018.

Ya-Chien Chang and Sicun Gao. Stabilizing neural control using self-learned almost lyapunov critics. In IEEE International Conference on Robotics and Automation, pages 1803 1809, 2021.

Ya-Chien Chang, Nima Roohi, and Sicun Gao. Neural lyapunov control. In Neural Information Processing Systems, 2019.

Shaoru Chen, Mahyar Fazlyab, Manfred Morari, George J Pappas, and Victor M Preciado. Learning lyapunov functions for hybrid systems. In Proceedings of the 24th International Conference on Hybrid Systems: Computation and Control, pages 1 11, 2021a.

Shaoru Chen, Mahyar Fazlyab, Manfred Morari, George J Pappas, and Victor M Preciado. Learning region of attraction for nonlinear systems. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 6477 6484. IEEE, 2021b.

Jason Choi, Fernando Castaneda, Claire J Tomlin, and Koushil Sreenath. Reinforcement learning for safety-critical control under model uncertainty, using control lyapunov functions and control barrier functions. ar Xiv preprint ar Xiv:2004.07584, 2020.

Hongkai Dai, Benoit Landry, Lujie Yang, Marco Pavone, and Russ Tedrake. Lyapunov-stable neural-network control. In Robotics: Science and Systems (RSS), 2021.

Priya L Donti, Melrose Roderick, Mahyar Fazlyab, and J Zico Kolter. Enforcing robust control guarantees within neural network policies. In International Conference on Learning Representations, 2021.

Sicun Gao, Soonho Kong, and Edmund M Clarke. dreal: An smt solver for nonlinear theories over the reals. In International conference on automated deduction, pages 208 214. Springer, 2013.

Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. ar Xiv preprint ar Xiv:1810.12715, 2018.

Minghao Han, Lixian Zhang, Jun Wang, and Wei Pan. Actor-critic reinforcement learning for control with stability guarantee. IEEE Robotics and Automation Letters, 5(4):6217 6224, 2020.

Hassan K Khalil. Nonlinear control, volume 406. Pearson New York, 2015.

J Zico Kolter and Gaurav Manek. Learning stable deep dynamics models. Advances in neural information processing systems, 32, 2019.

Reza Lavaei and Leila J Bridgeman. Systematic, lyapunov-based, safe and stabilizing controller synthesis for constrained nonlinear systems. IEEE Transactions on Automatic Control, 2023.

Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In neural information processing systems, 2019.

Antonin Raffin. Rl baselines3 zoo. https://github.com/DLR-RM/rl-baselines3-zoo, 2020.

Hadi Ravanbakhsh and Sriram Sankaranarayanan. Learning control lyapunov functions from counterexamples and demonstrations. Autonomous Robots, 43(2):275 307, 2019.

Rosana CB Rego and Fábio MU de Araújo. Learning-based robust neuro-control: A method to compute control lyapunov functions. International Journal of Robust and Nonlinear Control, 32 (5):2644 2661, 2022.

Spencer M Richards, Felix Berkenkamp, and Andreas Krause. The lyapunov neural network: Adaptive stability certification for safe learning of dynamical systems. In Conference on Robot Learning, pages 466 476. PMLR, 2018.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Sumeet Singh, Spencer M Richards, Vikas Sindhwani, Jean-Jacques E Slotine, and Marco Pavone. Learning stabilizable nonlinear dynamics with contraction-based regularization. The International Journal of Robotics Research, 40(10-11):1123 1150, 2021.

Jarrod M. Snider. Automatic steering methods for autonomous automobile path tracking. Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RITR-09-08, 2009.

Russ Tedrake. Underactuated robotics: Learning, planning, and control for efficient and agile machines course notes for mit 6.832. Working draft edition, 3:4, 2009.

Kaidi Xu, Zhouxing Shi, Huan Zhang, Yihan Wang, Kai-Wei Chang, Minlie Huang, Bhavya Kailkhura, Xue Lin, and Cho-Jui Hsieh. Automatic perturbation analysis for scalable certified robustness and beyond. Advances in Neural Information Processing Systems, 33:1129 1141, 2020.

Ruikun Zhou, Thanin Quartz, Hans De Sterck, and Jun Liu. Neural lyapunov control of unknown nonlinear systems with stability guarantees. In Neural Information Processing Systems, 2022.

A Further Details about Experiments

A.1 Dynamics

Since these are described as continuous-time domains, we set the time between state updates to be 0.05s in all cases for our discretized versions of these systems.

Inverted Pendulum We use the following standard dynamics model for inverted pendulum [Chang et al., 2019, Richards et al., 2018, Zhou et al., 2022]:

θ = mgℓsin(θ) + u b θ

with gravity constant g = 9.81, ball weight m = 0.15, friction b = 0.1, and pendulum length l = 0.5. We set the maximum torque allowed to be 6Nm, meaning u [ 6.0, 6.0], and it is applied to all experiments except for the NLC (free) case.

Path Tracking Our path tracking dynamics follows Snider [2009] and Chang et al. [2019]:

s = v cos (θe) 1 eraκ(s), era = v sin (θe) , θe = v tan(δ)

L vκ(s) cos (θe)

1 eraκ(s) ,

where the control policy determines the steering angle δ. era is the distance error and θe is the angle error. We can simplify the dynamics by defining u = tan(δ), so that non-linearity of dynamics is only in terms of state variables. Since in practice |δ| 40 , we set u [tan( 40 ), tan(40 )]; this is applied to all experiments. We set speed v = 2.0, while the track is a circle with radius 10.0 (and, thus, κ = 0.1) and L = 1.0.

For RL in path tracking, the reward function is 1

10(e2 ra + θ2 e)1/2, and we use PPO [Schulman et al., 2017], training over 50K steps.

Cartpole For cartpole we use the dynamics from Tedrake [2009]. The original dynamics is

x = 1 mc + mp sin2 θ

h fx + mp sin θ(l θ2 + g cos θ) i (6)

θ = 1 l(mc + mp sin2 θ)

h fx cos θ mpl θ2 cos θ sin θ (mc + mp)g sin θ i . (7)

We change the variable and rename π θ as θ, so that in our setting, θ represents the angle between the pole and the upward horizontal direction, the pole angle is positive if it is to the right. The purpose of this transformation is that the equilibrium point is the origin.

Using the second order chain rule

d2y dz2 = d2y

let y = π θ, z = t, u = θ, we have the new dynamics as

x = 1 mc + mp sin2 θ

h fx + mp sin θ(l θ2 g cos θ) i (8)

θ = 1 l(mc + mp sin2 θ)

h fx cos θ mpl θ2 cos θ sin θ + (mc + mp)g sin θ i . (9)

Here mc = 1.0 is the weight of the cart, mp = 0.1 is the weight of the pole, x is the horizontal position of the cart, l = 1.0 is the length of the pole, g = 9.81 is the gravity constant and fx is the controller (force applied to the cart). Furthermore, the maximum force allowed to the cart is set to be 30N, meaning fx [ 30, 30]; this is applied to all experiments.

PVTOL For PVTOL, our dynamics follows Singh et al. [2021]. Specifically, the system has the following state representation: x = px, pz, vx, vz, ϕ, ϕ . (px, pz) and (vx, vz) are the 2D position

and velocity, respectively, and (ϕ, ϕ) are the roll and angular rate. Control u R2 >0 corresponds to the controlled motor thrusts. The dynamics f(x, u) are described by

vx cos ϕ vz sin ϕ vx sin ϕ + vz cos ϕ ϕ vz ϕ g sin ϕ vx ϕ g cos ϕ 0

0 0 0 0 0 0 0 0 (1/m) (1/m) l/J ( l/J)

where g = 9.8 is the acceleration due to gravity, m = 4.0 is the mass, l = 0.25 is the moment-arm of the thrusters, and J = 0.0475 is the moment of inertia about the roll axis. The maximum force allowed to both u1 and u2 is set to be 39.2N, which is mg, meaning u1, u2 [0, 39.2]; this is applied to all experiments.

A.2 Computation of the LQR Solution

Given the system dx/dt = Ax + Bu, the objective function of the LQR is to compute the optimal controller u = Kx that minimizes the quadratic cost R 0 (x Qx + u Ru)dt, where R and Q are identity matrices. The details of the system linearization as well as the LQR solution are described below.

Inverted Pendulum We linearize the system as

, B = 0 1 ml2

The final LQR solution is u = 1.97725234θ 0.97624064 θ.

Path Tracking We linearize the system as

A = 0 2 0 0.04

The final LQR solution is u = 2u + 0.1 = era 2.19642572θe.

Cartpole We linearize the system as

0 1 0 0 0 0 0.98 0 0 0 0 1 0 0 10.78 0

The final LQR solution is u = x + 2.4109461 x + 34.36203947θ + 10.70094483 θ.

PVTOL We linearize the system as

0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 m 1 m r J r

The final LQR solution is u1 = 0.70710678x 0.70710678y 5.03954871θ + 1.10781077 x 1.82439774 y 1.20727555 θ, and u2 = 0.70710678x 0.70710678y + 5.03954871θ 1.10781077 x 1.82439774 y + 1.20727555 θ.

A.3 PPO Training for Path Tracking

We use the implementation in Stable-Baselines [Raffin, 2020] for PPO training. The hyperparameters of are fine-tuned using their auto-tuning script with a budget of 1000 trials with a maximum of 50000 steps. The final hyperparameters are in Table 3.

Table 3: Hyperparameter for PPO (Path Tracking)

Parameter Value

policy Mlp Policy n_timesteps !!float 100000 batch_size 32 n_steps 64 gamma 0.95 learning_rate 0.000208815 ent_coef 1.90E-06 clip_range 0.1 n_epochs 10 gae_lambda 0.99 max_grad_norm 0.8 vf_coef 0.550970466 activation_fn nn.Re LU log_std_init -0.338380542 ortho_init False net_arch [8, 8] vf [8, 8] sde_sample_freq 128

A.4 Benchmark Models

When relevant, benchmark models (as well as our approach) are run for 10 seeds (seed 0 to 9). Mean standard deviation are reported in Table 1 and Table 2. For SOS benchmark, we use polynomials of degree 6 for all environments.

NLC and UNL For Cartpole and PVTOL, we train against diameter 1.0 under the l2 norm for NLC and UNL. The main reason for the failure of training is that the d Real verifier becomes incredibly show after some iterations. For example, for seed 0, NLC only finished 167 certifications within 2 hours limit for Cartpole, and 394 certifications within 24 hours limit for PVTOL.

LQR and SOS For Cartpole environment, we first certify against target region 0.1 ||x|| 0.16 for LQR benchmark where d Real returns the counterexample; we later attempt to certify against 0.1 ||x|| 0.15, d Real did not return any result within the 2-hour limit. For SOS benchmark, d Real verifier did not return any result within the 2 hours limit against target region 0.1 ||x|| 0.5. For PVTOL environment, LQR benchmark returns the counterexample against target region 0.1 ||x|| 0.5 after 10 hours, we then attempt to certify against 0.1 ||x|| 0.4 and fail to return any result within the remaining 14 hours time limit. For SOS benchmark, d Real verifier did not return any result within the 24 hours limit against target region 0.1 ||x|| 0.5.

A.5 Bounding the Functions

There are three types of bounding in our paper: 1) bound the dynamics f(x, u) with linear upper/lower bound functions for verification; 2) bound the dynamics with constants for the calculation of M for Re LU; 3) bound the dynamics with constants for the calculation of ROA. Note that the bounding for 2) and 3) are similar, except that for 2) it is calculated for each subgrid, and for 3) it is calculated for the entire valid region.

Inverted Pendulum Split the region: we split the region over the domain of θ, please refer to our code for the details of the splitting.

(1) Bound for verification: since the only non-linear part of the function is sin(θ), we use the upper/lower bounds in α, β-CROWN. (2) Bound for M: | θ| g

ml2 + b| θ|max

ml2 , where |u|max is calculated using IBP bound. (3) Bound for ROA: ||f(x, u)|| = γ + max(γ, mgl+|u|max+bγ

ml2 )dt, where |u|max is the maximum force allowed.

Path Tracking Split the region: we split the region over the domain of θe, please refer to our code for the details of the splitting.

(1) Bound for verification: for era since the only nonlinear function is sin(θ), we use the bound in α, β-CROWN; for θe we bound the term vκ(s) cos(θe)

1 eraκ(s) , which is concave in interval [ π/2, π/2], and

convex otherwise. (2) Bound for M: | era| v, | θe| v|u|max

L + v R v, where |u|max is calculated

using IBP bound. (3) Bound for ROA: ||f(x, u)|| = γ + max(v, v|u|max

L + v R v)dt, where |u|max is the maximum force allowed.

Cartpole Split the region: we split the region over the domain of θ and θ with interval 0.1 when either θ or θ is greater than 0.1. In case when both θ and θ are smaller than 0.1, we split θ, θ into intervals of 0.05. Please refer to our code for the details of the splitting.

(1) Bound for verification: we split the dynamics into six functions and bound them separately for both one variable and multiple variables: f1 = 1 mc+mp sin2 θ, f2 = cos θ l(mc+mp sin2 θ), f3 = gmp sin θ cos θ

mc+mp sin2 θ ,

f4 = (mc+mp)g sin θ

l(mc+mp sin2 θ), f5 = mpl sin θ θ2

mc+mp sin2 θ, f6 = mpl θ2 cos θ sin θ

l(mc+mp sin2 θ) . (2) Bound for M: | x| |u|max +

mpl| θ|2 max + mpg

l (|u|max + mpl| θ|2 max 2 + (mc + mp)g) where |u|max is calculated using IBP

bound. (3) Bound for ROA: ||f(x, u)|| = γ + max(γ, |u|max + mplγ2 + mpg

l (|u|max + mplγ2

2 + (mc + mp)g))dt, where |u|max is the maximum force allowed.

PVTOL Split the region: we split the region over the domain of θ with interval 0.25 and split the region over the domain of x, y, θ with interval 0.5. Furthermore, in the MILP we add a constraint to ensure the l norm of state value is no less than 0.1. Please refer to our code for the details of the splitting.

(1) Bound for verification: we split the dynamics into three functions and bound them separately for multi-variables: f1 = x cos y, f2 = x sin y, f3 = xy. (2) Bound for M: we bound each term of the dynamics with the min/max values using the monotonic property. (3) Bound for ROA: ||f(x, u)|| = γ + max(γ, γ2 + g + 2|u|max

m , 2γ, 2l|u|max

J )dt, where |u|max is the maximum force allowed.

A.6 Speeding up Verification

When there is a limit on parallel resources, so that verification must be done serially for the subgrids k, we can reduce the number of grids to verify in each step as follows. The first time a verifier is called, we perform it over all subgrids; generally, only a subset will return a counterexample. Subsequently, we only verify these subgrids until none return a counterexample, and only then attempt full verification.

A.7 ROA Calculation

For Inverted Pendulum and Path Tracking, we approximate ROA using a grid of 2000 cells along each coordinate. For Cartpole, we use a grid of 150 cells along each coordinate, and for PVTOL, we use a grid of 50 cells along each coordinate.

For our baselines where the systems are continuous, the set levels are {x R | V (x) < ρ} where ρ = minx R V (x). Note that for discrete-time systems, having ρ = minx R V (x) is not sufficient. For example, assume R is the valid region, any point outside of R satisfy V (f(x, u)) > V (x), and x R, V (x) = ρ. Assume we have x R \ R and f(x, u) = y, y / R (this is possible since

the system is discrete). Since x is in the valid region, we have V (y) < V (x) < ρ. However, since y is no longer within the valid region, assume V (f(y, u)) > ρ > V (y), the system will not converge to equilibrium point starting from point y. This means {x R | V (x) < ρ} is not a subset of ROA.

B Computing Linear Bounds on Lipschitz-Continuous Dynamics

Theorem B.1. Given function g(x) : Rn 7 R where the Lipschitz constant is λ, we have g(x) g(x1,l, x2,l, . . . , xn,l) + λ P

i(xi xi,l) and g(x1,l, x2,l, . . . , xn,l) λ P

i(xi xi,l) g(x), where xi [xi,l, xi,u], i [n].

Proof. For any i [n], given xi [xi,l, xi,u], for arbitrary fixed values along dimension [n] \ {j} (we denote them as x i), we have |g(xi, x i) g(xi,l, x i)| λ|xi xi,l|.

we show that g(xi, x i) g(xi,l, x i) + λ(xi xi,l)

if g(xi, x i) g(xi,l, x i), then g(xi, x i) g(xi,l, x i) λ(xi xi,l), meaning g(xi, x i) g(xi,l, x i) + λ(xi xi,l). if g(xi, x i) < g(xi,l, x i), since λ(xi xi,l) 0, we have g(xi, x i) g(xi,l, x i) + λ(xi xi,l).

we show that g(xi,l, x i) λ(xi xi,l) g(xi, x i)

if g(xi, x i) g(xi,l, x i), then since λ(xi xi,l) 0, we have g(xi,l, x i) λ(xi xi,l) g(xi, x i). if g(xi, x i) < g(xi,l, x i), then g(xi,l, x i) g(xi, x i) λ(xi xi,l), meaning g(xi,l, x i) λ(xi xi,l) g(xi, x i).

By applying the above result along all n dimensions recursively, we have the upper bound as g(x) g(x1,l, x2,l, . . . , xn,l) + P

i λ(xi xi,l), and the lower bound as g(x1,l, x2,l, . . . , xn,l) P

i λ(xi xi,l) g(x).

C Ablation Study

In this section, we conduct an ablation study on the MILP solver. Table 4 are the results for DITLd Real, in which we use our framework, but replace all the MILP components with d Real. As we can see, using MILP is an essential ingredient in the success of the proposed approach:

Table 4: Ablation study on MILP solver.

Environment Runtime ROA Max ROA Success Rate

Inverted Pendulum (DITL-d Real) 6.0 1.7 (s) 57 24 75 100% Inverted Pendulum (DITL, main paper) 8.1 4.7(s) 61 31 123 100%

Path Tracking (RL, DITL-d Real) 600 0 (s) 0 0 0 0% Path Tracking (RL, DITL, main paper) 14 11 (s) 9 3.5 16 100%

Path Tracking (LQR, DITL-d Real) 420.9 182 (s) 4 4 11 60% Path Tracking (LQR, DITL, main paper) 9.8 4 (s) 8 3 12.5 100%

Cartpole (DITL-d Real) >2 (hours) N/A N/A 0% Cartpole (DITL, main paper) 0.9 0.3 (hours) 0.021 0.012 0.045 100%

PVTOL (DITL-d Real) >24 (hours) N/A N/A 0% PVTOL (DITL, main paper) 13 6 (hours) 0.011 0.008 0.028 100%