# nonconvex_online_learning_via_algorithmic_equivalence__5c93932c.pdf

Non-convex online learning via algorithmic equivalence

Udaya Ghai Zhou Lu Elad Hazan

We study an algorithmic equivalence technique between non-convex gradient descent and convex mirror descent. We start by looking at a harder problem of regret minimization in online non-convex optimization. We show that under certain geometric and smoothness conditions, online gradient descent applied to nonconvex functions is an approximation of online mirror descent applied to convex functions under reparameterization. In continuous time, the gradient flow with this reparameterization was shown to be exactly equivalent to continuous-time mirror descent by Amid and Warmuth [4], but theory for the analogous discrete time algorithms is left as an open problem. We prove an O(T 2 3 ) regret bound for non-convex online gradient descent in this setting, answering this open problem. Our analysis is based on a new and simple algorithmic equivalence method.

1 Introduction

Gradient descent is probably the simplest and most popular algorithm in convex optimization, with numerous and extensive studies on its convergence properties. For non-convex objectives, it is known that gradient descent does not necessarily converge to the global optimum, which is in general NP hard. Thus recent research focuses on alternate objectives, such as finding first-order stationary points efficiently, or the importance of higher order stationary points, e.g. [19, 22, 1].

However, the study of global convergence of gradient descent for non-convex objectives is increasingly important due to the fact that in practice, gradient descent and its variants can achieve zero error on a highly non-convex loss function of a deep neural network.

To explain the success of modern deep learning it is thus important to understand why gradient descent can converge to a global minimum for some highly non-convex functions. Several important research directions have stemmed from this motivation, including the study of optimization for deep linear networks [20, 21, 6], non-convex matrix factorization [7], provable convergence for neural networks in the linearization regime of the neural tangent kernel [26, 12, 13, 2, 32, 18, 9], and more.

A recent promising approach, proposed in [4, 3], is to consider reparameterizing mirror descent as gradient descent. In particular, [4] shows that a continuous-time online mirror descent with a convex loss can be written in an equivalent form of a continuous-time online gradient descent whose loss isn t necessarily convex after reparameterization, but has significant structure. [3] further explores this idea, showing that after reparameterization, online gradient descent has the same worst-case regret bound as the exponentiated gradient algorithm. Both [4] and [3] provide a new prospect on understanding the convergence of non-convex gradient descent. More recently, [23] provides a more general result, tightly characterizing when gradient flow can be written as mirror flow by introducing the notion of a commuting parameterization.

Google AI Princeton Princeton University Correspondence to: <ughai@cs.princeton.edu>.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

1.1 The Amid-Warmuth question

The works of [4, 3] are limited in two respects: either they are restricted to the continuous-time gradient-flow setting, which is impractical, or they address specific algorithms (Winnow, Hedge, EG for linear regression) with a relative entropy divergence. Amid and Warmuth pose the question of extending the reparameterization idea to the general online convex optimization as well as the conditions where this can be obtained. Solving the question from [4] would give an answer to realistic scenarios and algorithms, with precise regret and running time bounds, for certain non-convex optimization problems that can be solved to global optimality using gradient descent.

1.2 Statement of results

Our study starts from the more general problem of regret minimization in online convex optimization. Specifically, we study the reparameterization of online mirror descent (OMD) as online gradient descent (OGD) [30] for general online convex optimization. We show that under certain geometric and smoothness conditions, a non-convex reparameterized OGD algorithm closely match the original convex OMD algorithm. Quantitative bounds on this matching allow us to prove that the OGD algorithm after reparameterization achieves an O(T 2 3 ) regret bound. This bound is discrete-time and fairly general, applying broadly to smooth, bounded mirror descent regularizers rather than anything more specific. We also provide sufficient conditions for non-convex losses such that OGD implicitly has an equivalent mirror descent regularizer and hence will have O(T 2 3 ) regret.

Our analysis relies on a simple and new algorithmic equivalence technique, which may be of independent interest. The key step is to show that the outputs of OMD and reparameterized OGD are very close to each other when initialized from the same point. Considering the OGD update as a perturbed version of OMD along with the fact that OMD can tolerate bounded (adversarial) perturbation per trial allows us to prove an O(T 2 3 ) regret bound for the OGD update.

This answers the open question raised in [4], generalizing from continuous-time gradient flow and specialized algorithms, to general online convex optimization. The regularization term of MD is also generalized from relative entropy to any strongly convex function. Further, for discrete-time reparameterization, our results extend to arbitrary Lipschitz convex loss functions, as opposed to custom analysis for specific settings of expert prediction, linear prediction, and linear regression [3].

1.3 Paper structure

Section 2 introduces the setting and sketches out background work in the continuous setting. Section 3 provides our main result, which is subsequently proven in Section 4. Section 5 provides analysis in the opposite direction, showing the existence of an equivalent mirror descent regularizer for some non-convex gradient descent problems. Some analysis is deferred to the Appendix.

1.4 Related works

Gradient descent is the simplest and one of the most popular algorithms in convex optimization. Extensive studies have been conducted to show the convergence of GD to the global minimizer in both the stochastic optimization setting [24], and the online convex optimization setting [31]. For the non-convex setting, however, finding the global minimum is NP-hard and most work focuses on the convergence to first-order stationary points instead. The well-known descent lemma guarantees the convergence of non-convex GD when the objective is smooth, and dropping smoothness is possible if other conditions are introduced [10]. Many works consider avoiding saddle points; [22] shows that GD converges to local minimizers a.s. if all saddle points are strict while [19] proposes perturbed gradient descent to escape saddle points. It s also shown that even in the online setting, chasing first-order stationary points is possible [17].

Mirror descent, first introduced by [25], generalizes the gradient descent method in the sense that it can adapt to the geometry of the optimization problem [11]. Its analogue in the online setting, online mirror descent also achieves tight regret bounds [27] and without projection is shown to be equivalent to the classical regularized follow-the-leader (RFTL) method with constant step-size. Though it s natural to think of mirror descent as changing the geometry of the optimization objective in gradient

descent [15], the idea to explain non-convex GD by a corresponding convex MD is explored only recently.

The most relevant works to ours are [4] and [3], which prove the equivalence between continuous time OMD and OGD after reparameterization, and special discrete-time algorithms when the regularization is the relative entropy, respectively. Our results extend [4] and [3] to general discrete-time OCO with general regularization.

2 Preliminaries

2.1 Notation

For a function f : Rd Rd, we use the notation Jf(x) : Rd Rd to denote the Jacobian of f at x. We use the notaion to represents an element-wise product of vectors. Given a strictly convex function R : Rd Rd, we denote the Bregman divergence as

DR(x y) := R(x) R(y) R(y) (x y) .

For a convex set K and strictly convex regularizer R, we use ΠR K(x) := arg miny K DR(y x)

to denote Bregman projection, with shorthand ΠK := Π 2

K for Euclidean projection. Given a positive-definite matrix M Rd d, we define the norm x M :=

x Mx. We use the notation, Bp := {x Rd : x p 1} for an ℓp ball and B+ p to denote the intersection of the ℓp ball and the positive orthant.

2.2 Online convex optimization

We consider the online convex optimization (OCO) problem. At each round t the player A chooses xt K where K Rd is some convex domain, then the adversary reveals loss function ft(x) and player suffers loss ft(xt). The goal is to minimize regret:

Regret(A) =

t=1 ft(xt) min x K

The player is allowed to get access to the (sub-)gradient xft(xt) as well.

2.3 Reparameterizing continuous-time mirror flow as gradient flow

In the continuous-time setting, we have exact conditions from [4] where the trajectory of mirror-flow exactly coincides with that of gradient descent in a reparameterized space. Concretely, mirror flow on f : Rd R is defined by the following ordinary differential equation (ODE):

t R(x(t)) = η f(x(t)) . (1)

In Theorem 2 of [4], this update is shown to be equivalent to the following gradient flow ODE with x(t) = q(u(t)):

t = η f(q(u(t))) , (2)

if the mirror descent regularizer R and reparameterization function q satisfy [ 2R(q(u))] 1 = Jq(u)Jq(u) . This relationship between the Hessian of the OMD regularizer and the reparameterization function assures that, up to second order factors in u v 2

DR(q(u)||q(v)) 1

2 u v 2 2 , (3)

so the geometry induced by R is approximately transformed into a Euclidean geometry by q 1. Because higher order factors vanish in continuous time, this assures that mirror flow and this reparameterized gradient flow coincide as desired.

2.4 Discretizing the updates

Unfortunately, continuous-time updates cannot be implemented in practice so we must rely on discretization. Using the forward Euler scheme for discretization of (1), yields the mirror descent algorithm. After discretization, the exact equivalence of mirror descent and reparameterized gradient descent no longer holds as higher order factors in (3) are relevant. This motivates the open problem of [4] of finding general conditions under which the discretizations of these continuous time trajectories closely track each other. We tackle extending these results through the online setting described above.

The Online Mirror Descent (OMD) algorithm has the following update on xt:

R(yt+1) = R(xt) η ft(xt) xt+1 = arg min x K DR(x yt+1)

where R : K R is a 1-strongly convex regularization over the domain4, DR is the Bregman divergence, and y1 is initialized to satisfy R(y1) = 0.

Another equivalent interpretation of OMD is

xt+1 = arg min x K ft(xt) (x xt) + 1

η DR(x||xt) .

The most important special case is online gradient descent (OGD): xt+1 = ΠK(xt η ft(xt)) , which is OMD with DR(x||xt) = 1

2 x xt 2 2. In this paper we consider reparameterizing general OMD update as a simple OGD update. We introduce assumptions of the regularizer, reparameterization, and losses in the sequel.

2.5 Assumptions

Assumption 1. There exists a convex domain K Rd and a bijective reparameterization function q : K K satisfying [ 2R(x)] 1 = Jq(u)Jq(u) , with x = q(u).

We also make the following smoothness/Lipschitz assumptions:

Assumption 2. Let G > 1 be a constant. We assume q is G-Lipschitz, and the 1strongly convex regularization R is smooth with its first and third derivatives upper bounded by some constant G. The first and second derivatives of q 1 are bounded by G. Furthermore, assume that for all x K DR(x ) is G-Lipschitz over K.

Finally, we assume loss functions have bounded gradients and the convex domain has bounded diameter with respect to the Bregman divergence.

Assumption 3. The loss functions ft have gradients bounded by ft(x) 2 GF for all x K. The diameter of K, supx,y K DR(x y) D .

3 Algorithm

We now provide our main result, a regret bound for Algorithm 2. We note, Algorithm 1 and Algorithm 2 are the (projected) forward euler discretizations of (1) and (2). It s tempting to directly analyze the regret of OGD with losses ft(u) = ft(q(u)). The main barrier in doing so is that f isn t necessarily convex. However, we show that the OMD and OGD updates are in fact O(η 3 2 )-close to each other, therefore the OGD update still suffers only sublinear regret.

4Usually, the strong convexity of a mirror descent regularizer is with respect to some norm that is not necessarily the ℓ2 norm. In this work, the focus is on the dependence on the regret on T, so for simplicity we consider just the ℓ2 strong convexity, which will worsen constants, but does not affect the dependence on T.

Algorithm 1 Online Mirror Descent

1: Input: Initialization x1 K, regularizer R. 2: for t = 1, . . . , T do 3: Predict xt 4: Observe ft(xt) 5: Update

yt+1 = ( R) 1( R(xt) η ft(xt))

xt+1 = ΠR K(yt+1)

Algorithm 2 Online Gradient Descent

1: Input: Initialization u1 K = q 1(K). 2: for t = 1, . . . , T do 3: Predict ut 4: Observe ft(ut) = ft(q(ut))) 5: Update

vt+1 = ut η ft(ut) ut+1 = ΠK (vt+1)

Remark 1. The gradient oracle ft can also be calculated from an oracle of ft if we know the reparameterization function q. Remark 2. The reparameterized gradient descent uses Euclidean projection rather than Bregman projection. This can potentially be an advantage if Euclidean projection can be computed efficiently on q 1(K) (eg. via method like [29]), while Bregman projection is less efficient on K.

Theorem 3. Under Assumptions 1-3, by setting η = T 2/3D2/3G 10/3G 1 F the regret of Algorithm 2 is upper bounded by O(T 2/3D1/3G10/3GF ).

Proof of the theorem is in Section 4.

3.1 Examples

We now provide a few example settings. Note that in all three settings, q are such that f(u) = f(q(u)) can be nonconvex for a convex f.

Exponentiated gradient using quadratic reparameterization One of the most popular algorithms for online learning over the filled-in simplex5 K = B+ 1 is the Exponentiated Gradient (EG) [8]. In this case, we show that OGD with quadratic reparameterization q(u) = 1

4u u has vanishing average regret. We empirically verify that reparameterized GD iterates and EG iterates stay close on a toy problem in Figure 1. In [3] custom analysis is provided for linear losses that achieves optimal regret, but our approach has a simpler and more general analysis. If R(x) = Pd i=1 xi log(xi) then OMD is EG. In this case, if x = q(u), then

Jq(u)Jq(u) = diag(u/2)diag(u/2) = diag(u u/4) = diag(q(u)) = diag(x) = [ 2R(x)] 1

K = q 1(B+ 1 ) = {u Rd + : 1

i=1 u2 i 1} = 2B+ 2 .

We note that for this example, Assumptions 2 will not hold as relative entropy is not sufficiently behaved near the boundary of the simplex. This can be handled by modifying K to be the smoothed simplex where all weights are at least ε for some ε. This is a fairly standard technique. ε can be chosen such that the full regret in sublinear (see App. B).

Log barrier with exponential reparameterization Consider the case of a log-barrier regularization R(x) = Pd i=1 log(xi) with K = [ε, 1]d. It can readily be shown that the box constraint maps to a box constraint. Consider x = q(u) = exp(u), where the exponential is elementwise. We see Assumption 1 is satisfied as

Jq(u)Jq(u) = diag(exp(u))diag(exp(u)) = diag(q(u)2) = diag(x2) = [ 2R(x)] 1 .

Tempered Bregman divergences with power reparameterization A more recently studied family of mirror descent regularizers interpolate between ℓ2 regularization and negative entropy regularization [5]. Regularization for this family uses link function R(x) = logτ(x) = 1 1 τ (x1 τ

5One drawback of our approach is that the current analysis does not work for the true simplex like [3] because q 1(K) is the positive part of the unit sphere, which is not a convex set.

Figure 1: Gradient Descent using reparameterization q(u) = 1

4u u produces iterates that closely track Exponential Gradient for a simple fixed quadratic loss example.

1), where τ is a temperature that can range between 0 and 1. As τ approaches 1, the link function approaches the natural logarithm, and hence the update approaches EG, while τ = 0 corresponds to ℓ2 regularization. Such divergences are amenable to reparameterization when K = B+ p .

To see this, ignoring constant factors [ 2R(x)] 1 = xτ for τ [0, 1). Now if q(u) = u 2 2 τ where the power is elementwise, then Jq(u) = diag(u τ 2 τ ) = diag(q(u)τ/2) = diag(xτ/2) as desired. Now, we note that the reparameterization is a power, so B+ p would get mapped to the B+ 2p 2 τ , which is

convex as long as p > 1 τ/2.

3.2 Challenges

In some situations, we do not have a closed form solution to the differential equation described by Jq. For example with the β-hypentropy regularizer of [14], the reparameterization involves q 1(x) = R (x2 + β2) 1/4dx, and this integral does not have a known closed form. Furthermore, other cases of interest, like von Neumann Entropy regularization for Matrix Multiplicative Weights [28] seem difficult to fit into this framework as Assumption 1 is challenging to use with spectral functions.

4 Reparameterization analysis

We now move on to the proof of Theorem 3. The proof idea is to show that the OMD and OGD iterates are close to each other after a single update step starting from the same initial point. Then we can view the OGD update a perturbed version of OMD, and combine it with the fact that the OMD algorithm can tolerate bounded noise per trial.

We begin with the following key lemma showing that the updates xt+1 and q(ut+1) created by OMD and OGD respectively, are close to each other from the same initial point xt = q(ut). To do this, we carefully analyze the errors that occur due to the approximation in (3) starting from a proximal formulation of mirror descent. We can show that the two updates are solutions to approximately the same strongly-convex objective, hence the solutions must be close.

Lemma 4. Suppose Assumptions 1-3 hold and xt = q(ut), then we have that xt+1 q(ut+1) 2 = O(G4G3/2 F η3/2).

Proof. We consider the following forms of Algorithms 1 and 2.

xt+1 = arg min x K ft(xt) (x xt) + 1

η DR(x||xt)

ut+1 = arg min u K ft(ut) (u ut) + 1

2η u ut 2 2

We observe that both objectives we are minimizing can be written as the sum of a linear function plus a strongly convex function. In particular, if x xt 2 > 2ηGF , then the objective takes positive value which can t be minimal because taking x = xt gives 0. Define Kr,xt = K {x : x xt 2 r}, then the OMD iterate xt+1 Kt := K2ηGF ,xt. Likewise, ut+1 K t := K 2ηGF G,ut.

Using Taylor expansion with Assumption 2, for DR(x||xt) with x xt 2 2ηGF , we rewrite it as

DR(x||xt) = R(x) R(xt) R(xt) (x xt)

2(x xt) 2R(xt)(x xt) + O(G G3 F η3)

2 x xt 2 2R(xt) + ε(x) ,

where ε(x) = O(G G3 F η3) for x Kt. And the OMD update becomes

xt+1 = arg min x Kt ft(xt) (x xt) + 1

2η x xt 2 2R(xt) + ε(x) .

We now use Taylor expansion to relate ft(ut) (u ut) to ft(xt) (x xt). Let x = q(u) for u K t, hence x Kt := K2ηGF G2,xt, then we have

f t (ut) (u ut) = ft(xt) Jq(ut)(u ut)

= ft(xt) Jq(ut)(q 1(x) q 1(xt))

= ft(xt) Jq(ut)(J 1 q (ut)(x xt)) + ε(x)

= ft(xt) (x xt) + ε(x) ,

where ε(x) = O(G5G3 F η2).

Using the above approximation, we can rewrite the OGD update as

q(ut+1) = arg min x Kt ft(xt) (x xt) + 1

2η q 1(x) q 1(xt) 2 2 + ε(x) .

On the other hand, we can rewrite 1

2 q 1(x) q 1(xt) 2 2 as

1 2 q 1(x) q 1(xt) 2 2 = 1

2 J 1 q (xt)(x xt) + O(G5G2 F η2) 2 2

2 x xt 2 J 1 q (xt)J 1 q (xt) + O(G8G3 F η3)

Then by Assumption 1, 2R(xt) = J 1 q (xt)J 1 q (xt) , so we conclude that 1

ηBR(x||xt) and 1 2η q 1(x) q 1(xt) 2 2 are O(G8G3 F η2)-close. We can then rewrite the OGD update as

q(ut+1) = arg min x Kt ft(xt) (x xt) + 1

2η x xt 2 2R(xt) + ε(x) ,

where ε(x) = O(G8G3 F η2) for x Kt. We note that Kt Kt as needed.

Comparing both OMD and OGD updates, the objective functions are only off by an O(G8G3 F η2) term, therefore the minimizers are also O(G4G3/2 F η 3 2 )-close since the objective functions are both Θ( 1

η) strongly convex.

Remark 5. It can be shown that if projection is not required, xt+1 qut+1 2 = O(η2) in Lemma 4. Furthermore, if the projection operation that Algorithms 1 and 2 perform are the same, the distance only gets closer. Such a bound would result in an O(

T) regret bound. For the case of EG over the simplex, the projection operations in both are the same weight scaling operations, so this could be a part of making O(

T) regret possible, though this still requires a better approach for handling exploding Lipschitz constants (see App. B).

Next, we show that the OMD algorithm is actually robust to noise: with a (potentially non-stochastic) bounded perturbation on the output xt per round, we can still get vanishing regret. Lemma 6. Suppose Assumptions 2 and 3 hold and Algorithm A does the following update:

xt+1 = rt + arg min x K ft(xt) (x xt) + 1

where rt 2 C. The regret of Algorithm A can be upper bounded by

Regret(A) CTG

This follows from a slightly modified analysis of OMD from [16], where the effect of the perturbation is bounded due to the Bregman divergence being Lipschitz. In particular, we end up with the standard regret bound for mirror descent that follows from a telescoping argument with an additional error term

t=1 DR(x xt) DR(x x R t ) CTG

where x R t is the counterfactual mirror descent iterate with rt = 0.

Full proof of Lemma 6 can be found in Appendix A. Combining both lemmas, we complete our proof of Theorem 3:

Proof. Algorithm 2 gives a perturbed version of OMD with perturbation bounded by C = O(η 3 2 ) due to Lemma 4. Plugging C = O(G4G3/2 F η3/2) into the bound of Lemma 6, the regret is upper bounded by O(G5G3/2 F ηT + D

η ) ignoring the lower order term. Optimizing this and all constants gives the choice of η = T 2/3D2/3G 10/3G 1 F and regret bound O(T 2/3D1/3G10/3GF ).

5 Implicit OMD reparameterization

We have shown that a general convex OMD can be reparameterized as a (potentially) non-convex OGD, with vanishing O(T 2 3 ) regret. The other direction from OGD to OMD is even more interesting: given a non-convex OGD, can we show its global convergence by showing the existence of a convex OMD which corresponds to OGD implicitly?

In other words, given a convex and compact domain K , what do we need to know about the domain K and the (not necessarily convex) loss ft to ensure the existence of such equivalence? Fully characterizing the necessary and sufficient condition seems hard, and we provide here a simple sufficient condition which covers some known scenarios. Assumption 4. We assume the following properties about q.

There exists a function q such that ft(u) can be written as ft(q(u)) where ft is convex.

q is a C3-diffeomorphism, and Jq(u) is diagonal.

q(K ) is convex and compact.

We argue that once the conditions above are satisfied, there exists a regularization R and a corresponding OMD with the desired equivalence. The only thing we need to verify is the existence of a strongly convex regularization R which satisfies Assumption 1. We first show that Jq(u) always has

non-zero (in fact, with the same sign because of continuity) determinant due to the fact that q is a diffeomorphism. Then R can be reconstructed by integrating twice its Hessians which we compute from the equation in Assumption 1. R is also strongly convex as the Hessians are strictly PSD. We note that such conditions may not be tight, but are general, succinct and may cover some interesting examples.

Theorem 7. Given a convex and compact domain K , and not necessarily convex loss ft satisfying Assumption 3. When Assumption 4 is met, there exists an OMD object with convex loss ft, a convex domain and a strongly convex regularization R satisfying Assumption 1. As a result, running Algorithm 2 on loss ft(u) has regret upper bound O(T 2 3 ).

Proof. The first two properties are satisfied by the assumption. We verify the third property by constructing a regularization R, which is strongly convex and satisfies Assumption 1. The rest follows Theorem 3. By the fact that q is a diffeomorphism, Jq(u) is an invertible matrix. The fact that Jq(u) is an invertible matrix implies that Jq(u)Jq(u) is invertible, and further Jq(u)Jq(u) 0.

Denote H(u) = [Jq(u)Jq(u) ] 1 0 which is well defined by the above argument, we want to construct R such that R2(q(u)) = H(u). By the assumption that Jq(u) is diagonal, we can write q(u) as q(u) = (q1(u1), q2(u2), ..., qd(ud))

where each qi is a scalar function, such that H(u) is diagonal with H(u)ii = 1/(q i(ui))2, which only depends on ui. We set R(q(u)) = Pd i=1 Ri(qi(ui))) and denote the function Ri(ui) = Ri(qi(ui)) with variable ui, then by the chain rule we have that

Ri (ui) = q i(ui)R i(xi)

Ri (ui) = q i (ui)R i(xi) + (q i(ui))2R i (xi) .

Plugging the first formula into the second one, we eliminate the R i(xi) term:

Ri (ui) = q i (ui) Ri (ui) q i(ui) + (q i(ui))2R i (xi) .

Recall, we want to find R such that R i (xi) = 1/(q i(ui))2. Plugging it back to the equation above, the problem is reduced to find Ri such that it satisfies the ODE

q i(ui) Ri (ui) q i (ui) Ri (ui) q i(ui) = 0 .

It s known that there exists a solution to any linear and continuous second-order ODE. In fact, using the standard variation of constant method, one example solution Ri (ui) can be

Ri (ui) = q i(ui) Z ui

1 q i(u)du + ciq i(ui) .

where ci, Ci are constants and Ci = minu K ui . It s well-defined because q i(ui) always has the same sign. To verify the correctness, we calculate that

Ri (ui) = q i (ui) Z ui

1 q i(u)du + 1 + ciq i (ui) ,

which solves the ODE together with the solution of Ri (ui).

Next, plugging Ri(ui) = R ui Ci Ri (u)du and u = q 1(x) into the expression of R gives a solution of the regularization. Its strong-convexity is implied by the fact that 2 R(x) = H(u) 0 and is continuous over a compact domain, thus there exists a constant c > 0 such that 2 R(x) c I.

Finally, q being a C3-diffeomorphism means that both q and q 1 are 3 times continuously differentiable. By the compactness of domains and the continuity of the derivatives, there exists a constant G satisfying Assumption 2. A similar argument holds for R.

6 Conclusion

We study the convergence of non-convex gradient descent in this paper, by showing (approximate) algorithmic equivalence between gradient descent and mirror descent in the discrete setting. We prove that under certain geometric and smoothness conditions, running OGD on non-convex losses obtained by reparameterizing a convex OMD has regret bound O(T 2 3 ), answering an open problem in [4]. Our analysis is based on a new algorithmic equivalence technique, combining the one-step closeness between OMD and the reparameterized OGD and the robustness of OMD. We further extend our result to the other direction, providing sufficient conditions for a non-convex OGD to have an implicit corresponding OMD and thus converge well. We leave several questions as future directions.

1. Is the O(T 2 3 ) regret bound improvable to O(

T) in general? The decay in the bound in this work seems to be caused by differences in projection in the reparameterized space and the original Bregman projection. Perhaps a different analysis technique can produce optimal bounds. 2. Can assumptions for reparameterization be relaxed? For example, it is not clear that Assumption 1 is a necessary condition. In the implicit mirror descent direction, it may be possible to lift the diagonal Jq(u) assumption in Theorem 7 with more refined conditions on PDEs. Perhaps the commuting parameterization conditions from [23] can be adapted from the continuous to the discrete setting.

[1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195 1199, 2017.

[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization, 2019.

[3] Ehsan Amid and Manfred K Warmuth. Winnowing with gradient descent. In Conference on Learning Theory, pages 163 182. PMLR, 2020.

[4] Ehsan Amid and Manfred KK Warmuth. Reparameterizing mirror descent as gradient descent. Advances in Neural Information Processing Systems, 33:8430 8439, 2020.

[5] Ehsan Amid, Manfred KK Warmuth, Rohan Anil, and Tomer Koren. Robust bi-tempered logistic loss based on bregman divergences. Advances in Neural Information Processing Systems, 32, 2019.

[6] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, pages 244 253. PMLR, 2018.

[7] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.

[8] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of computing, 8(1):121 164, 2012.

[9] Yu Bai and Jason D Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. ar Xiv preprint ar Xiv:1910.01619, 2019.

[10] Heinz H Bauschke, Jérôme Bolte, and Marc Teboulle. A descent lemma beyond lipschitz gradient continuity: first-order methods revisited and applications. Mathematics of Operations Research, 42(2):330 348, 2017.

[11] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

[12] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. ar Xiv preprint ar Xiv:1811.03804, 2018.

[13] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018.

[14] Udaya Ghai, Elad Hazan, and Yoram Singer. Exponentiated gradient meets gradient descent. In Algorithmic Learning Theory, pages 386 407. PMLR, 2020.

[15] Suriya Gunasekar, Blake Woodworth, and Nathan Srebro. Mirrorless mirror descent: A natural derivation of mirror descent. In International Conference on Artificial Intelligence and Statistics, pages 2305 2313. PMLR, 2021.

[16] Elad Hazan. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

[17] Elad Hazan, Karan Singh, and Cyril Zhang. Efficient regret minimization in non-convex games. In International Conference on Machine Learning, pages 1433 1441. PMLR, 2017.

[18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. ar Xiv preprint ar Xiv:1806.07572, 2018.

[19] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724 1732. PMLR, 2017.

[20] Kenji Kawaguchi. Deep learning without poor local minima. Advances in neural information processing systems, 29, 2016.

[21] Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima are global. In International conference on machine learning, pages 2902 2907. PMLR, 2018.

[22] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246 1257. PMLR, 2016.

[23] Zhiyuan Li, Tianhao Wang, Jason D Lee, and Sanjeev Arora. Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. ar Xiv preprint ar Xiv:2207.04036, 2022.

[24] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

[25] Arkadij Semenoviˇc Nemirovskij and David Borisovich Yudin. Problem complexity and method efficiency in optimization. SIAM Review, 1983.

[26] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 2018.

[27] Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. Advances in neural information processing systems, 24, 2011.

[28] Koji Tsuda, Gunnar Rätsch, and Manfred K. Warmuth. Matrix exponentiated gradient updates for on-line learning and bregman projection. Journal of Machine Learning Research, 6(34):995 1018, 2005.

[29] Ilnura Usmanova, Maryam Kamgarpour, Andreas Krause, and Kfir Levy. Fast projection onto convex smooth constraints. In International Conference on Machine Learning, pages 10476 10486. PMLR, 2021.

[30] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML 03, page 928 935, 2003.

[31] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pages 928 936, 2003.

[32] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep Re LU networks. ar Xiv preprint ar Xiv:1811.08888, 2018.