# finitetime_convergence_in_continuoustime_optimization__0a384169.pdf

Finite-Time Convergence in Continuous-Time Optimization

Orlando Romero 1 Mouhacine Benosman 2

In this paper, we investigate a Lyapunov-like differential inequality that allows us to establish ﬁnite-time stability of a continuous-time state-space dynamical system represented via a multivariate ordinary differential equation or differential inclusion. Equipped with this condition, we synthesize ﬁrst and second-order (in an optimization variable) dynamical systems that achieve ﬁnite-time convergence to the minima of a given sufﬁciently regular cost function. As a byproduct, we show that the q-rescaled gradient ﬂow (q-RGF) proposed by Wibisono et al. (2016) is indeed ﬁnite-time convergent, provided the cost function is gradient dominated of order p (1, q). This way, we effectively bridge a gap between the q-RGF and the ﬁnite-time convergent normalized gradient ﬂow (NGF) (q = ) proposed by Cort es (2006) in his seminal paper in the context of multiagent systems. We discuss strategies to discretize our proposed ﬂows and conclude by conducting some numerical experiments to illustrate our results.

1. Introduction

In recent years, there has been a surge of research papers aiming to leverage ideas from dynamical systems and control theory (both in continuous and discrete time) into optimization and machine learning. As a simple example to illustrate the connection between these ﬁelds, consider the gradient ﬂow (GF)

x(t) = f(x(t)), (1)

where x(t) dx(t)

dt , for a given convex and differentiable cost function (or functional) f : X R deﬁned over a

1Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA, USA. 2Mitsubishi Electric Research Laboratories, Cambridge, MA, USA.. Correspondence to: Orlando Romero <oromero@seas.upenn.edu>, Mouhacine Benosman <benosman@merl.com>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

smooth Banach space X (typically a Euclidean or otherwise ﬁnite-dimensional real vector space in nonlinear programming, but inﬁnite-dimensional in optimal control, calculus of variations, and trajectory optimization). Indeed, this system has been long studied in the mathematical community due to its provable asymptotic stability (in the sense of Lyapunov), and thus its ability for solutions x(t) to converge, as t , to a minimum of f. This idea dates back to at least Hadamard (1908), as noted by Courant (1943), where solving the differential equation (1) as an optimization method is referred to as the method of gradients. For a modern instances of papers dealing with the GF in abstract and inﬁnite-dimensional (or otherwise non-Euclidean) spaces, see (Ambrosio et al., 2005), (Danieri & Savar e, 2014), and (Feehan, 2016). On the other hand, the standard gradient descent (GD) algorithm

xk+1 = xk η f(xk) (2)

with ﬁxed step size η > 0 (also known as learning rate in deep learning) is likely to be older even. Its origin, or at least its main inspiration, is often attributed to Cauchy (1847). Clearly, the GF and GD methods are connected since the GD (2) is nothing more than the forward-Euler discretization of the GF (1). Likewise, the backward-Euler discretization (also known as implicit discretization)

xk+1 = xk η f(xk+1), (3)

of the GF (1) can be readily rewritten as1

xk+1 = arg min x Rn

2η x xk 2 , (4)

which is simply the usual proximal point algorithm (PPA).

In continuous-time optimization, an ordinary differential equation (ODE), partial differential equation (PDE), or differential inclusion is designed to be explicitly computable under assumed oracles of a cost function or some surrogate of it, in such a way to lead their solutions to converge to a minimizer or extremum of the cost function. The gradient ﬂow (1) naturally becomes the archetype gradient-based system. To achieve this, tools from Lyapunov stability theory are often employed, mainly due to the rich body of work

1In this paper, . denotes the ℓ2-norm.

Finite-Time Convergence in Continuous-Time Optimization

within the nonlinear systems and control theory community for this purpose. In particular, we often seek asymptotically Lyapunov stable gradient-based systems with an equilibrium (stationary point) at an isolated extremum of the given cost function, thus certifying local convergence. Naturally, global asymptotic stability leads to global convergence, though such an analysis will typically require the cost function to be strongly convex everywhere.

For early work in this direction, see the work of Botsaris (1978a;b), Zghier (1981), Snyman (1982; 1983), and Brown (1989). In particular, Brockett (1988) and, subsequently, Helmke & Moore (1994), studied relationships between linear programming, ODEs, and general matrix theory. Further, Schropp (1995) and Schropp & Singer (2000) explored several aspects linking nonlinear dynamical systems to gradient-based optimization, including nonlinear constraints. Cort es (2006) proposed two discontinuous normalized modiﬁcations of gradient ﬂows to attain ﬁnite-time convergence. Later, Wang & Elia (2011) proposed a controltheoretic perspective on centralized and distributed convex optimization.

More recently, Su et al. (2014) derived a second-order ODE as the limit of Nesterov s accelerated gradient method, when the gradient step sizes vanish. This ODE is then used to study Nesterov s scheme from a new perspective, particularly in an larger effort to better understand acceleration without substantially increasing computational burden. Expanding upon the aforementioned idea, Franc a et al. (2018) derived a ﬁrst-order ODE that models the continuous-time limit of the sequence of iterates generated by the alternating direction method of multipliers (ADMM). Then, the authors employ Lyapunov theory to analyze the stability at critical points of the dynamical systems and to obtain associated convergence rates.

Later, Franc a et al. (2019) analyzed general non-smooth and linearly constrained optimization problems by deriving equivalent (at the limit) non-smooth dynamical systems related to variants of the relaxed and accelerated ADMM. In particular, two new ADMM-like algorithms were proposed: one based on Nesterov s acceleration and the other inspired by Polyak s heavy ball method, and derive differential inclusions modeling these algorithms in the continuous-time limit. Using a non-smooth Lyapunov analysis, results on rate of convergence are obtained for these dynamical systems in the convex and strongly convex settings.

In the more traditional context of machine learning, there are multiple papers that have adopted the approach of explicitly borrowing or connecting ideas from control and dynamical systems. For unsupervised learning, Plumbley (1995) proposed Lyapunov stability theory as an approach to establish convergence of principal component algorithms. Pequito et al. (2011) and Aquilanti et al.

(2019) proposed continuous-time generalized expectationmaximization (EM) algorithms, based on mean-ﬁeld games, for clustering of ﬁnite mixture models. Romero et al. (2019) established convergence of the EM algorithm, and a class of generalized EM algorithms denoted δ-EM, via discrete-time Lyapunov stability theory. For supervised learning, Liu & Theodorou (2019) provided a review of deep learning from the perspective of control and dynamical systems, with a focus in optimal control. Zhu (2018) and Rahnama et al. (2019) explored connections between control theory and adversarial machine learning.

Statement of Contribution

In this work, we ﬁrst provide a Lyapunov-based tool to both check and construct continuous-time dynamical systems that are ﬁnite-time stable represented via differential inclusions. We then use this condition to construct multiple dynamical systems with ﬁnite-time convergence to (strict) local minima that can be viewed as continuous-time optimization algorithms. In particular, for continuously differentiable and gradient dominated cost functions, we provide a ﬁrst-order method that only assumes access to the cost function and its gradient. Finally, for twice continuously differentiable and strongly convex functions, we also provide a family of ﬁnitetime convergent second-order methods, whose convergence time can be prescribed near the desired minimum.

2. Finite-Time Convergence in Optimization via Finite-Time Stability

Consider some objective cost function f : Rn R that we wish to minimize. In particular, let x Rn be an arbitrary local minimum of f that is unknown to us. In continuoustime optimization, we typically proceed by designing a nonlinear state-space dynamical system

x = F(t, x) (5)

for which F(t, x) can be computed without explicit knowledge2 of x and for which (5) is certiﬁably asymptotically Lyapunov stable at x . For instance, we often seek systems that use only up to second-order information on the cost function, thus we design F through an oracle Of(x) = {f(x), f(x), 2f(x)}.

In this work, however, we seek dynamical systems for which (5) is certiﬁably ﬁnite-time Lyapunov stable at x . As will be clear later, such systems need to be possibly discontinuous or non-Lipschitz, which can more naturally be expressed and analyzed in the lense of differential inclusions

2In other words, we typically design some G( ) that can be explicitly computed for any input, and we set F(t, x) G(t, Of(x)), where Of( ) denotes some oracle function such that Of(x) encompasses all available data regarding f near x.

Finite-Time Convergence in Continuous-Time Optimization

instead of ODEs. To achieve this objective, our approach is largely based on exploiting the Lyapunov-like differential inequality

E(t) c E(t)α, a.e. t 0, (6)

with constants c > 0 and α < 1, for absolutely continuous (AC) functions E such that E(0) > 0. Indeed, under the aforementioned conditions, exact convergence E(t) 0 will be reached in ﬁnite time t t E(0)1 α

We will restrict ourselves to the design of time-invariant systems F(t, x) = F(x). We now summarize the problem statement: Problem 1. Given a sufﬁciently smooth cost function f : Rn R with a sufﬁciently regular local minimizer x , and an initial approximation x0 Rn sufﬁciently near x , solve the following tasks:

1. Design a sufﬁciently smooth3 candidate Lyapunov function V which is deﬁned and positive deﬁnite near and with respect to (w.r.t.) x 4.

2. Design a function G that can be explicitly computed for any input and such that, for the (possibly discontinuous) system5 (5) with F G Of, the differential inequality (6) holds for E V x( ), with x( ) a Filippov solution6 with x(0) = x0.

By following this strategy, we will therefore achieve (local and strong) ﬁnite-time stability, and thus ﬁnite-time convergence. Furthermore, if V (x0) can be upper bounded, then F can be readily tuned to achieve ﬁnite-time convergence under a prescribed range for the settling time, or even with exact prescribed settling time if V (x0) can be explicitly computed and (6) holds with equality.

One variant of Problem 1 that we do not explore in this paper is to remove the possible dependence of G on x0 by replacing ﬁnite-time stability with ﬁxed-time stability. In many (perhaps most) practical situations, we have full control on the initial approximation, so we ﬁnd it reasonable to design optimization algorithms around it.

3. First-Order Convergent Flows

Given some continuously differentiable cost function f : Rn R with a local minimizer and isolated stationary

3At least locally Lipschitz continuous and regular (see supplementary material (SM)). 4In other words, there exists some open neighborhood D of x

such that V is deﬁned in D and satisﬁes V (x) 0 with equality if and only if x = x , for every x D \ {x }. 5Right-hand side deﬁned at least a.e., Lebesgue measurable. Furthermore, we may require it to be locally essentially bounded to ensure existence of solutions. 6See SM.

point x Rn, Cort es (2006) proposed the (discontinuous) normalized gradient ﬂows

and x = sign( f(x)), (8)

where sign( ) denotes the sign function (applied elementwise for real-valued vectors). He then established ﬁnitetime stability based on the candidate Lyapunov function V (x) = f(x) f , with f = f(x ), and two differential inequality assumptions: a ﬁrst-order one akin to (6) for α = 0, and another which essentially boils down to the corresponding energy function E( ) being non-increasing and strongly convex. However, the ﬁrst-order conditions proved insufﬁcient to establish the ﬁnite-time convergence of his proposed ﬂows, whereas the second-order condition is sufﬁcient, but also requires twice continuously differentiability and that the Hessian of the cost function is positive deﬁnite near the local minimum of interest.

More precisely, Cort es (2006) showed that, if f is twice continuously differentiable and strongly convex in an open neighborhood D Rn of x , then the solutions to his proposed ﬂows (7) and (8) converge in ﬁnite time to x , provided they start in some positively invariant compact subset S D. He further showed that the convergence times are upper bounded by

t f(x0) min x S λmin[ 2f(x)] (9)

for (7), and

t f(x0) 1 min x S λmin[ 2f(x)] (10)

We will now see how our approach can be used to generalize (7) and (8) while still maintaining ﬁnite-time convergence, but without requiring second differentiability or strong convexity, instead focusing on the notion of gradient dominance.

Borrowing terminology from Wilson et al. (2019), we say that a continuously differentiable function f : Rn R is µ-gradient dominated of order p (1, ] (with µ > 0) in some neighborhood D of a local minimizer x Rn if

p p 1 µ 1 p 1 (f(x) f ) (11)

for every x D, where f = f(x ). When µ > 0 is unknown or unimportant, but known to exist, we will omit it in the previous deﬁnition.

Finite-Time Convergence in Continuous-Time Optimization

For strongly convex functions, gradient dominance of order p = 2 can be established. In fact, gradient dominance is usually deﬁned exclusively for order p = 2, often referred to as the Polyak-Łojasiewicz (PL) inequality, which was introduced by Polyak (1963) to relax the (strong) convexity assumption commonly used to show convergence of the GD algorithm (2). The PL inequality can also be used to relax convexity assumptions of similar gradient and proximalgradient methods (Karimi et al., 2016). Our adopted generalized notion of gradient dominance is strongly tied to the Łojasiewicz gradient inequality from real analytic geometry, established by Łojasiewicz (1963; 1965)7 independently and simultaneously from (Polyak, 1963), and generalizing the PL inequality. More precisely, this inequality is typically written as

f(x) C |f(x) f |θ (12)

for every x Rn in a small enough open neighborhood of the stationary point x = x , for some C > 0 and θ 1

2, 1 . This inequality is guaranteed for analytic8

functions (Łojasiewicz, 1965). More precisely, when x is a local minimizer of f, the aforementioned relationship is explicitly given by

p µ 1 p , θ = p 1

Therefore, analytic functions are always gradient dominated. However, while analytic functions are always smooth, smoothness is not required to attain gradient dominance. Continuous differentiability is still required, however.

Let us now consider the candidate Lyapunov function V (x) f(x) f from which we will construct ﬁrstorder ﬁnite-time convergent ﬂows. Notice that (6) with E(t) V (x(t)) then becomes

f(x(t)) x(t) c(f(x(t)) f )α. (14)

Naturally, an immediate candidate for x(t) is

x = c(f(x) f )α

f(x) 2 f(x), (15)

but, unfortunately, it requires knowledge of f , which is usually not available. See SM for further discussion. More generally, to satisfy (14) without knowledge of f , we could design x = c f(x) β 2 f(x) (16)

with β chosen such that f(x) β (f(x) f )α. Notice that the RHS of (16) is continuous (but non-Lipschitz, unless

7For more modern treatments in English, see (Łojasiewicz & Zurro, 1999; Bolte et al., 2007) 8Analytic functions are functions that are locally given by convergent power series.

β 2), provided that f is continuously differentiable near x and β > 1. From the gradient dominance, it follows that

p p 1 p p 1

µ 1 p 1 (f f ), (17)

f(x) α( p p 1) p p 1

α µ α p 1 (f(x) f )α (18)

for every x near x . Since f(x) 0 as x x , then it clearly sufﬁces to choose β α p p 1 . On the other hand, the particular choice of c and α < 1 are unimportant to attain ﬁnite-time convergence, and thus we may choose any β < p p 1. In other words, it sufﬁces that β = q q 1 with q (p, ], which results in

q 2 q 1 , (19)

known as the q-rescaled gradient ﬂow (q-RGF), originally proposed by Wibisono et al. (2016). We can use a similar reasoning to generalize (8) into q-signed gradient ﬂow (q SGF)

1 q 1 1 sign( f(x)), (20)

which naturally coincides with (19) for n = 1. Due to inequalities between different ℓr-norms, we could also replace the norms in either of the proposed ﬂows by r, subject to a suitable range for r 1. Other generalizations could include replacing the norms altogether by some function x 7 α( s ) with α a class K function (Khalil, 2001).

Theorem 1. Suppose that f : Rn R is continuously differentiable and µ-gradient dominated of order p (1, ) near a strict local minimizer x Rn. Let c > 0 and q (p, ]. Then, any maximal solution (in the sense of Filippov) to the q-RGF given by (19) or the q-SGF ﬂow (20) will converge in ﬁnite time to x , provided that x(0) x > 0 is sufﬁciently small. Furthermore, their convergence times are both upper bounded by

t f(x0) 1 θ 1

c C 1 θ 1 θ

where x0 = x(0) and f = f(x ), with C, θ given by (13) and θ = q 1

q . In particular, given any compact and positively invariant subset S D, both ﬂows converge in ﬁnite with the aforementioned convergence time upper bound (which can be tightened by replacing D with S) for any x0 S. Furthermore, if D = Rn, then we have global ﬁnite-time convergence, i.e. ﬁnite-time convergence to any maximal solution (in the sense of Filippov) x( ) with arbitrary x0 Rn.

Finite-Time Convergence in Continuous-Time Optimization

Proof. The main idea is to show that the differential inequality (6) is satisﬁed for the energy function E(t) V (x(t)), deﬁned in terms of the candidate Lyapunov function V (x) f(x) f . Refer to SM for full details.

Wibisono et al. (2016) showed, for convex cost functions, that solutions x( ) to the q-RGF (19) (with c = 1) satisfy the convergence rate f(x(t)) f = O 1 tq 1 . However, as concluded in Theorem 1, the q-RGF actually converges in ﬁnite time, provided that f is gradient dominated of order p (1, q). Therefore, the q-RGF will be ﬁnite-time convergent for strongly convex functions, provided we choose q > 2. For analytic functions, the q-RGF is also ﬁnite-time convergent, provided that q > 1 is chosen large enough. More precisely, any q > p = 1 1 θ achieves this. Unfortunately, however, bounding the exponent θ for non-strongly convex and non-polynomial functions appears to be mathematically intractable in general (D Acunto & Kurdyka, 2005; Pham, 2012). Notice that, in principle, the convergence time can be prescribed, as t T with used-selected T > 0 by appropriate choice of c > 0 and q > p that make the RHS of (21) equal to T. However, the set of candidate hyperparameters c, q will naturally depend on µ, p. These observations are equally valid for the ﬂow (8).

We believe a reciprocal result to Theorem 1 is likely to be true. More precisely, that, if f is not gradient dominated of order p for some p (1, q), then the convergence of the q-RGF will only be asymptotic. To illustrate this notion, let us explore the objective function considered in Appendix F of (Wibisono et al., 2016), given by

p x p, (22)

with p (1, ), which is µ-gradient dominated of order p near x = 0, with µ = (p 1)p 1. Furthermore, f is not gradient dominated of order p for any p < p. Therefore, in order to apply our theory and thus ensure ﬁnite-time convergence, we must choose q > p.

Notice that the p-RGF reduces to

x = c x, (23)

and thus x(t) = e c tx0. In other words, the solutions to the p-RGF converge asymptotically to the minimum x = 0, and not in ﬁnite time.

On the other hand, the q-RGF for a general q > 1 becomes

x = c x εx, (24)

with ε = q p

q 1. It appears that this ODE cannot be analytically solved in the multivariate case, so for simplicity we assume x(t) R. The solutions thus become

x(t) = sign(x0) max{0, (|x0|ε ε c t) 1 ε }. (25)

Clearly, if q > p, then ε > 0, and x(t) t as t t , with t = 1 c ε|x0|ε < . On the other hand, if 1 < q < p, then ε < 0, and thus x(t) x only as t . The case q = p, corresponds to ε 0, which leads to x(t) = e tx0 as previously discussed.

In terms of the settling time upper bound (21) (assuming now q > p and multivariate x(t) Rn), in this case it turns out to hold with equality. Indeed, it simpliﬁes to t 1 c ε x0 ε with ε = q p

q 1, which generalizes the exact settling time derived analytically in the scalar case. The settling time upper bound is tight in this case precisely because the inequality originating from the gradient dominance actually holds with equality, and thus the Lyapunov differential inequality (6) will hold with equality as well.

4. Second-Order Convergent Flows

Let us now investigate the candidate Lyapunov function V (x) f(x) 2, with x now assumed a local minimizer and isolated stationary point. Setting E(t) V (x(t)), then, provided that f is twice continuously differentiable, (6) becomes

2 f(x) 2f(x) x c f(x) 2α. (26)

Clearly, there are many possible ﬂows that can be constructed to satisfy the previous condition, so let us focus on a family constructed for strongly convex f. Given a symmetric and positive deﬁnite matrix P Rn n with SVD decomposition P = V ΣV , Σ = diag(λ1 . . . , λn), λ1, . . . , λn > 0, we deﬁne P r V Σr V , where Σr diag (λr 1, . . . , λr n). Equipped with this deﬁnition, we propose the family

x = c f(x) 2α [ 2f(x)]r f(x) f(x) [ 2f(x)]r+1 f(x) (27)

with c > 0, α < 1, and r R tunable hyperparameters, as (27) leads directly to (26). In a similar fashion, we propose the family

x = c f(x) 2α 1 1 [ 2f(x)]r sign( f(x)) sign( f(x)) [ 2f(x)]r+1 sign( f(x)) (28)

with the same hyperparameters, constructed via the candidate Lyapunov function V (x) f(x) 1.

We are now ready to state the ﬁnite-time convergence of these proposed ﬂows.

Theorem 2. Suppose that f : Rn R is twice continuously differentiable and strongly convex in an open neighborhood D Rn of a stationary point x Rn. Let c > 0, α < 1, and r R. Then, any maximal Filippov solution to the discontinuous second-order generalized Newton-like ﬂows (27) and (28) with sufﬁciently small x0 x > 0

Finite-Time Convergence in Continuous-Time Optimization

(where x0 = x(0)) will converge in ﬁnite time to x . Furthermore, their convergence times are given exactly by

t = f(x0) 2(1 α)

2c(1 α) , t = f(x0) 2α 1 2cα , (29)

respectively. In particular, given any compact and positively invariant subset S D, the ﬂows converge in ﬁnite for any x0 S. Furthermore, if D = Rn, then we have global ﬁnite-time convergence.

Proof. Refer to SM for a detailed proof.

One point that we want to emphasise here is the fact that with these second-order ﬂows, one can much more readily (compared to our ﬁrst-order ﬂows) prescribe the ﬁnite convergence time by appropriate choice of c, α. This is a clear advantage comparatively to the ﬁrst-order methods, for which the obtained ﬁnite-time convergence upper bound (21) is less practical. In particular, we may prescribe t = T with arbitrary T > 0 by choosing, for instance, α = 1

2 and c = f(x0) /T. In particular, for instance, we propose the rescaled Newton ﬂow (RNF)

T [ 2f(x)] 1 f(x)

by further choosing r = 1 in (27). Therefore, for (30), we obtain the prescribed ﬁnite-time convergence x(t) x as t T, where T > 0 fully tunable.

As a simple example, let us reconsider the function (22), this time only with p = 2. Indeed, the ﬂow (30) reduces to x = x0

T x x . In particular, for n = 1, its solution is given by x(t) = max 0, 1 t

T x0, which clearly satisﬁes x(t) x = 0 as t T.

5. Numerical Experiments

In this section, we illustrate the ﬁnite-time convergence properties of the q-RGF (19) and our designed second-order ﬂow (27) on academic optimization test functions. Then, we investigate preliminary discretization strategies for the ﬂows discussed in this paper.

5.1. First-Order Flow

Consider once again the cost function (22), in the scalar case x R and with p = 3. We will illustrate the performance of the RGF (19) with c = 2.

First, we ﬁx x0 = 3/4 and vary q > 1. The results are reported in Figure 1. As we can see, choosing any q > p ensures ﬁnite-time convergence, but as q p+, the convergence becomes purely asymptotic. Furthermore, we can

1 c ε (q ) |x0|ε(q)

28 25/9 0.514

1 c ε (q ) |x0|ε(q)

0 1 2 3 4 5 0.0

0 1 2 3 4 5

Figure 1. Solutions to the q-RGF (24) with x0 = 3/4, c = 2, and various values of q > 1, on the cost function (22) (scalar case) with p = 3. Note: ε(q) = (q p)/(q 1).

see that the settling time upper bound (21) from Theorem 1, which simpliﬁes to t 1 c ε x0 ε with ε = q p

q 1, is tight.

Next, we ﬁx q = 10 and vary x0 R near x = 0, while maintaining every other parameter the same as before. The results are reported in Figure 2. As we can see, we have ﬁnite-time convergence near x = 0 and the settling time upper bound (21) is tight. In reality, for this simple example, the ﬁnite-time convergence property holds globally, meaning that any x0 Rn actually leads to x(t) x = 0 as t t = 1 c ε x0 ε.

0.1 0.2 0.3 0.4 0.5 0.6

1 c ε |x0|ε

28 25/9 0.514

1 c ε |x0|ε

14 27/9 0.375

Figure 2. Solutions to the q-RGF (24) with c = 2, q = 10, and various values of x0 R near x = 0, on the cost function (22) (scalar case) with p = 3. Note: ε = (q p)/(q 1) = 7/9.

5.2. Second-Order Flow

We now test the second-order ﬂow (27) with (c, α, r) = ( f(x0) , 1/2, 1) on the optimization testbed function known as the Rosenbrock function, namely f : R2 R given by

f(x1, x2) = (a x1)2 + b(x2 x2 1)2, (31)

with parameters a, b R. This function admits exactly one stationary point (x 1, x 2) = (a, a2) for b 0, which is a

Finite-Time Convergence in Continuous-Time Optimization

strict global minimum for b > 0. If b < 0, then (x 1, x 2) is a saddle point. Finally, if b = 0, then {(a, x2) : x2 R} are the stationary points of f, and they are all non-strict global minima.

Figure 3. Trajectories of the proposed ﬂow (27) with (c, α, r) = ( f(x0) , 1/2, 1) and four different initial conditions x0 R2, for the Rosenbrock function with parameters (a, b) = (3, 100), which has a unique minimum x = (a, a2) = (3, 9).

As we can see in Figure 3, this ﬂow converges correctly to the minimum (a, a2) = (3, 9) for all the tested initial conditions with an exact prescribed settling time T = 1. Note that, at any given point in the trajectory x( ), the functions t 7 x(t) x and t 7 |f(x(t)) f(x )| are not guaranteed to not increase, indeed only t 7 f(x(t)) can be guaranteed to do so, which explains the increase in (d) that could never have occurred in (e).

5.3. Preliminary Numerical Investigation of Potential Discretization Schemes

It is unclear if ﬁnite-time convergence in continuous time translates into some useful property when discretized, or if it should merely serve as a warning to better understand the limitations of a continuous-time representation and analysis of optimization algorithms that are ultimately intended to be implemented on a digital computer (and thus any continuous-time representation requires discretization to be implementable). Nevertheless, in this subsection we provide a preliminary investigation of potential discretization approaches that seek to combine the ﬁnite-time convergent ﬂows studied in this paper together with well-established acceleration ideas for iterative (discrete-time) algorithms.

Recall that the Nesterov s accelerated GD is given by

yk = xk + βk(xk xk 1) (32a)

xk+1 = yk η f(yk) (32b)

with 0 βk < 1, often βk = k 1

k+2 or βk = β [0, 1).

We argue that Nesterov s acceleration can be interpreted as actually being a modiﬁed Forward-Euler discretization of the GF (1), that thus achieves acceleration. More generally, given some optimization ﬂow represented by the continuous-time system x = F(x), locally convergent to a local minimizer x Rn of a cost function f : Rn R, then we can replicate Nesterov s acceleration of (1). More precisely, we obtain the algorithm

yk = xk + βk(xk xk 1) (33a)

xk+1 = yk + ηF(yk) (33b)

where naturally choosing F = f results in Nesterov s accelerated GD.

We can now choose amongst the different ﬂows discussed in this paper, in order to test this simple discretization idea. To achieve further acceleration, we can also make the step size η > 0 adaptive, i.e. we replace η ηk in (33). The parameter µ > 0 can also (and often is, for F = f) be made adaptive. In particular, we will adopt an accelerated backtracking approach (and thus an inexact line search) borrowed from (Almeida et al., 1997). More precisely, we choose the (tunable) hyperparameters 0 < d < 1 < u and set ηk = uηk 1 if

f(yk + uηk 1F(yk))

min{f(xk), f(yk + ηk 1F(yk))}, (34)

and ηk = drkηk 1 otherwise, where rk is deﬁned by the smallest r {0, 1, . . .} such that

f(yk + drηk 1F(yk)) f(xk) (35)

In other words, we increase the step size by a factor u > 0 (i.e., η u η) whenever helpful, but otherwise reduce it by a factor d > 0 (i.e., η d η) repeatedly until the objective function actually decreases, or at least until it does not increase.

We now test this discretization idea with on the log-sum-exp function given by

f(x) = ρ log

i=1 exp a i x bi

with n = 20, m = 50, ρ = 5. Each entry of a1, . . . , am and b1, . . . , bm is independently sampled from a N(0, 1) distribution.

The results are presented in Figures 4 and 5 and illustrate the potential of our proposed discretization approach, particularly when combined with the ﬁnite-time convergent ﬂows studied in this paper. All of the hyperparameters were manually tuned to achieve near-optimal performance. The

Finite-Time Convergence in Continuous-Time Optimization

ﬁgures represent the average result of 50 random initializations x0 N(0, 1) (component-wise, independently), with a ﬁxed sample for a1, . . . , am, b1, . . . , bm as previously described.

We notice that, regarding the ﬁrst-order ﬂows, namely the q-RGF (19), we can slightly improve convergence of its discretization, compared to discretizations of the standard gradient ﬂow (1), by tuning q, particularly so if we re-tune the other parameters in the discretization. This is the case for both a simple forward-Euler discretization, or our proposed Nesterov-style discretization described in (33). However, we also noted in our experiments that backtracking does not appear to combine well with the ﬁrst-order methods considered in this paper, including for the gradient ﬂow case. However, once we move to the second-order ﬂows, namely for the RNF (30), we see that (accelerated) backtracking synergizes remarkably well with the Nestorov-like forward-Euler discretization scheme proposed earlier. It is also interesting to note that a simple forward-Euler discretization with ﬁxed step sizes of the RNF (30) originally appears slower than the standard Newton method, but indeed eventually surpasses it, for the example considered.

0 50 100 150 200 250 300 350 400 450 500 k

GD (fixed stepsizes) RGF (forward-Euler discretization w/ fixed stepsizes) NAGD (fixed stepsizes) RGF (proposed discretization w/ fixed stepsizes)

Figure 4. Numerical experiments for the ﬁrst-order discrete algorithms

0 50 100 150 200 250 300 350 400 450 500 k

f(xk) - f* (+ 10-13)

RNF (forward-Euler w/ fixed stepsizes) Newton NF (proposed discretization w/ fixed stepsizes) NF (proposed discretization w/ accelerated backtracking) RNF (proposed discretization w/ fixed stepsizes) Newton (accelerated backtracking) RNF (proposed discretization w/ accelerated backtracking)

Figure 5. Numerical experiments for the second-order discrete algorithms

6. Conclusion

We have introduced a new family of discontinuous ﬁrstorder and second-order ﬂows for continuous-time optimization. The main characteristic of the proposed ﬂows is their ﬁnite-time convergence guarantees, with, in some cases, an

arbitrary pre-deﬁned convergence time. To analyze these discontinuous ﬂows, we ﬁrst extended an exiting Lyapunovbased inequality condition for ﬁnite-time stability in the case of smooth dynamics to the case of non-smooth dynamics modeled by differential inclusions. We then derived and established ﬁnite-time stability (and thus convergence) for the proposed family of continuous-time optimization algorithms. We also proposed a preliminary discretization method of the proposed ﬂows. Finally, we conducted numerical experiments on known optimization benchmarks.

Several questions remain open, which we will target in our future work. First, while we have used commonly available numerical solvers in part of our (small-scale) numerical experiments, and have proposed a ﬁrst step towards a discretization method, more work will be done in this constructive discretization research direction. Furthermore, we also seek to adapt our methods to allow for linear and nonlinear constraints, and to develop distributed and decentralized variants. Lastly, many real-life problems that require a timevarying optimization framework, such as in motion planning or formation control in robotics, do not allow direct access to gradients, Hessian matrices, or time-derivatives of the gradient. Instead, these are typically estimated based on measurements (e.g. of the cost function) that often occur in discrete time and carry noisy perturbations. Therefore, future work will also be dedicated to the robustiﬁcation of our proposed ﬂows, including zeroth-order (gradient-free) schemes.

Acknowledgements

The research for this paper was mostly conducted during a research internship of OR supervised by MB in the summer of 2019 at MERL in Cambridge, MA. OR also thanks the department of Industrial and Systems Engineering at Rensselaer Polytechnic Institute in Troy, NY, where he was a graduate student during the time of writing of the paper.

Almeida, L. B., Langlois, T., Amaral, J. D., and Redol, R. A. On-line step size adaptation. Technical report, INESC, 1997.

Ambrosio, L., Gigli, N., and Savare, G. Gradient Flows in Metric Spaces and in the Space of Probability Measures. Springer, January 2005.

Aquilanti, L., Cacace, S., Camilli, F., and De Maio, R. A mean ﬁeld games approach to cluster analysis. July 2019.

Bolte, J., Daniilidis, A., and Lewis, A. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. Society

Finite-Time Convergence in Continuous-Time Optimization

for Industrial and Applied Mathematics, 17:1205 1223, January 2007.

Botsaris, C. A class of methods for unconstrained minimization based on stable numerical integration techniques. Journal of Mathematical Analysis and Applications, 63 (3):729 749, 1978a.

Botsaris, C. Differential gradient methods. Journal of Mathematical Analysis and Applications, 63(1):177 198, 1978b.

Brockett, R. Dynamical systems that sort lists, diagonalize matrices and solve linear programming problems. pp. 799 803, 1988.

Brown, A. Some effective methods for unconstrained optimization based on the solution of systems of ordinary differential equations. Journal of Optimization Theory and Applications, 62(2):211 224, August 1989.

Cauchy, A. Methode generale pour la resolution des syst emes d equations simultane. . C. R. Acad. Sci. Paris, 25:536 538, 1847.

Cort es, J. Finite-time convergent gradient ﬂows with applications to network consensus. Automatica, 42(11): 1993 2000, November 2006.

Courant, R. Variational methods for the solution of problems of equilibrium and vibrations. Bulletin of the American Mathematical Society, 49(1):1 23, January 1943.

D Acunto, D. and Kurdyka, K. Explicit bounds for the Łojasiewicz exponent in the gradient inequality for polynomials. Annales Polonici Mathematici, 87:51 61, March 2005.

Danieri, S. and Savar e, G. Lecture notes on gradient ﬂows and optimal transport, pp. 100 144. London Mathematical Society Lecture Note Series. Cambridge University Press, 2014.

Feehan, P. M. N. Global existence and convergence of solutions to gradient systems and applications to yangmills gradient ﬂow. ar Xiv preprint 1409.1525v4, 2016.

Franc a, G., Robinson, D., and Vidal, R. ADMM and accelerated ADMM as continuous dynamical systems. July 2018.

Franc a, G., Robinson, D., and Vidal, R. A dynamical systems perspective on nonsmooth constrained optimization. ar Xiv preprint 1808.04048, 2019.

Hadamard, J. M emoire sur le probl eme d analyse relatif a l equilibre des plaques elastiques encastr ee. M emoires pr esent es par divers savants a l Acad emie des Sciences de l Institut National de France, Series 2, 33(4):1 128, 1908.

Helmke, U. and Moore, J. B. Optimization and Dynamical Systems. Springer-Verlag, 1994.

Karimi, H., Nutini, J., , and Schmidt, M. Linear convergence of gradient and proximalgradient methods under the Polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811. Springer, 2016.

Khalil, H. K. Nonlinear Systems. Prentice-Hall, Englewood Cliffs, New Jersey, 2001.

Liu, G.-H. and Theodorou, E. Deep learning theory review: An optimal control and dynamical systems perspective. ar Xiv preprint 1908.10920, August 2019.

Łojasiewicz, S. A topological property of real analytic subsets (in French). Les equations aux d eriv ees partielles, pp. 87 89, 1963.

Łojasiewicz, S. Ensembles semi-analytiques. Centre de Physique Theorique de l Ecole Polytechnique, 1965. URL https://perso.univ-rennes1.fr/ michel.coste/Lojasiewicz.pdf.

Łojasiewicz, S. and Zurro, M.-A. On the gradient inequality. Bulletin of the Polish Academy of Sciences, Mathematics, 47, January 1999.

Pequito, S. D., Aguiar, A. P., Sinopoli, B., and Gomes, D. A. Unsupervised learning of ﬁnite mixture models using mean ﬁeld games. 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 321 328, 2011.

Pham, T. S. An explicit bound for the Łojasiewicz exponent of real polynomials. Kodai Mathematical Journal, 35(2): 311 319, 06 2012.

Plumbley, M. D. Lyapunov functions for convergence of principal component algorithms. Neural Networks, 8(1): 11 23, 1995.

Polyak, B. Gradient methods for the minimisation of functionals (in Russian). USSR Computational Mathematics and Mathematical Physics, 3:864 878, December 1963.

Rahnama, A., Nguyen, A. T., and Raff, E. Connecting lyapunov control theory to adversarial attacks. ar Xiv preprint 1907.07732, 2019.

Romero, O., Chaterjee, S., and Pequito, S. Convergence of the expectation-maximization algorithm through discretetime lyapunov stability theory. Proceedings of the American Control Conference (ACC), pp. 163 168, July 2019.

Schropp, J. Using dynamical systems methods to solve minimization problems. Applied Numerical Mathematics, 18(1):321 335, 1995.

Finite-Time Convergence in Continuous-Time Optimization

Schropp, J. and Singer, I. A dynamical systems approach to constrained minimization. Numerical Functional Analysis and Optimization, 21:537 551, May 2000.

Snyman, J. A new and dynamic method for unconstrained minimization. Applied Mathematical Modelling, 6(6): 448 462, December 1982.

Snyman, J. An improved version of the original leapfrog dynamic method for unconstrained minimization: LFOP1(b). Applied Mathematical Modelling, 7(3):216 218, June 1983.

Su, W., Boyd, S., and Candes, E. J. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp. 2510 2518. Curran Associates, Inc., 2014.

Wang, J. and Elia, N. A control perspective for centralized and distributed convex optimization. In IEEE Conference on Decision and Control and European Control Conference, pp. 3800 3805, December 2011.

Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47): E7351 E7358, 2016.

Wilson, A., Mackey, L., and Wibisono, A. Accelerating rescaled gradient descent: Fast optimization of smooth functions. In Advances in Neural Information Processing Systems. December 2019.

Zghier, A. K. The use of differential equations in optimization. Ph D thesis, Loughborough University, 1981.

Zhu, X. An optimal control view of adversarial machine learning. ar Xiv preprint 1811.04422, 2018.