# reparameterizing_mirror_descent_as_gradient_descent__a39861d2.pdf

Reparameterizing Mirror Descent

as Gradient Descent

Ehsan Amid and Manfred K. Warmuth

Google Research, Brain Team

Mountain View, CA {eamid, manfred}@google.com

Most of the recent successful applications of neural networks have been based on training with gradient descent updates. However, for some small networks, other mirror descent updates learn provably more efﬁciently when the target is sparse. We present a general framework for casting a mirror descent update as a gradient descent update on a different set of parameters. In some cases, the mirror descent reparameterization can be described as training a modiﬁed network with standard backpropagation. The reparameterization framework is versatile and covers a wide range of mirror descent updates, even cases where the domain is constrained. Our construction for the reparameterization argument is done for the continuous versions of the updates. Finding general criteria for the discrete versions to closely track their continuous counterparts remains an interesting open problem.

1 Introduction

Mirror descent (MD) [Nemirovski and Yudin, 1983, Kivinen and Warmuth, 1997] refers to a family of updates which transform the parameters w 2 C from a convex domain C 2 Rd via a link function (a.k.a. mirror map) f : C ! Rd before applying the descent step. The continuous-time mirror descent (CMD) update, which can be seen as the limit case of (discrete-time) MD, corresponds to the solution of the following ordinary differential equation (ODE) [Nemirovski and Yudin, 1983, Warmuth and Jagota, 1998, Raginsky and Bouvrie, 2012]:

f(w(t + h)) f(w(t))

= r L(w(t)) , (CMD) (1)

w(t + h) = f 1

f(w(t)) h r L(w(t))

Here L denotes a differentiable real-valued loss and

@t is the time derivative of the link function. The vanilla discretized MD update is obtained by setting the step size to h. The main link functions investigated in the past are f(w) = w and f(w) = log(w) leading to the gradient descent (GD) and the unnormalized exponentiated gradient (EGU) family of updates.2 These two link functions are associated with the squared Euclidean and the relative entropy divergences, respectively. For example, the classical Perceptron and Winnow algorithms are motivated using the identity and log links, respectively, when the loss is the hinge loss. A number of papers discuss the difference between the two updates [Kivinen and Warmuth, 1997, Kivinen et al., 2006, Nie et al., 2016, Ghai et al., 2020] and their rotational invariance properties have been explored in [Warmuth et al., 2014]. In

An earlier version of this manuscript (with additional results on the matrix case) appeared as "Interpolating Between Gradient Descent and Exponentiated Gradient Using Reparameterized Gradient Descent" as a preprint.

2The normalized version is called EG and the two-sided version EGU . More about this later.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

particular, the Hadamard problem is a paradigmatic linear problem which shows that EGU can converge dramatically faster than GD when the instances are dense and the target weight vector is sparse [Kivinen et al., 1997, Warmuth and Vishwanathan, 2005]. This property is linked to the strong-convexity of the relative entropy w.r.t. the L1-norm3 [Shalev-Shwartz et al., 2012], which motivates the discrete EGU update.

Contributions Although other MD updates can be drastically more efﬁcient than GD updates on certain classes of problems, it was assumed that such MD updates are not realizable using GD. In this paper, we show that in fact a large number of MD updates (such as EGU, and those motivated by the Burg and Inverse divergences) can be reparameterized as GD updates. Concretely, our contributions can be summarized as follows. We cast continuous MD updates as minimizing a trade off between a Bregman momentum and the

loss. We also derive the dual, natural gradient, and the constrained versions of the updates. We then provide a general framework that allows reparameterizing one CMD update by another. It

requires the existence of a certain reparameterization function and a condition on the derivatives of the two link functions as well as the reparameterization function. Speciﬁcally, we show that on certain problems, the implicit bias of the GD updates can be

controlled by considering a family of tempered updates (parameterized by a temperature 2 R) that interpolate between GD (with = 0) and EGU (with = 1), while covering a wider class of updates.

We conclude the paper with a number of open problems for future research directions.

Previous work There has been an increasing amount of of interest recently in determining the implicit bias of learning algorithms [Gunasekar et al., 2017, 2018, Vaskevicius et al., 2019]. Here, we mainly focus on the MD updates. The special case of reparameterizing continuous EGU as continuous GD was already known [Akin, 1979, Amid and Warmuth, 2020]. In this paper, we develop a more general framework for reparameterizing one CMD update by another. We give a large variety of examples for reparameterizing the CMD updates as continuous GD updates. The main new examples we consider are based on the tempered versions of the relative entropy divergence [Amid et al., 2019]. The main open problem regarding the CMD updates is whether the discretization of the reparameterized updates track the discretization of the original (discretized) MD updates. The strongest methodology for showing this would be to prove the same regret bounds for the discretized reparameterized update as for the original. This has been done in a case-by-case basis for the EG family [Amid and Warmuth, 2020]. For more discussion see the conclusion section, where we also discuss how our reparameterization method allows exploring the effect of the structure of the neural network on the implicit bias.

Some basic notation We use , , and superscript for element-wise product, division, and power, respectively. We let w(t) denote the weight or parameter vector as a function of time t. Learning proceeds in steps. During step s, we start with weight vector w(sh) = ws and go to w((s + 1)h) = ws+1 while processing a batch of examples. We also write the Jacobian of vector valued function q as Jq and use HF to denote the Hessian of a scalar function F. Furthermore, we let rw F(w(t)) denote the gradient of function F(w) evaluated at w(t) and often drop the subscript w for conciseness.

2 Continuous-time Mirror Descent

For a strictly convex, continuously-differentiable function F : C ! R with convex domain C Rd, the Bregman divergence between e w, w 2 C is deﬁned as

DF ( e w, w) := F( e w) F(w) f(w)>( e w w) , where f := r F denotes the gradient of F, sometimes called the link function.4 Trading off the divergence to the last parameter ws with the current loss lets us motivate the iterative mirror descent

3Whereas the squared Euclidean divergence (which motivates GD) is strongly-convex w.r.t. the L2-norm. 4The gradient of a scalar function is a special case of a Jacobian, and should therefore be denoted by a row vector. However, in this paper we use the more common column vector notation for gradients, i.e. r F(w) := ( @F

(MD) updates [Nemirovski and Yudin, 1983, Kivinen and Warmuth, 1997]:

ws+1 = argmin

1/h DF (w, ws) + L(w) , (3)

where h > 0 is often called the learning rate. Solving for ws+1 yields the so-called prox or implicit update [Rockafellar, 1976, Nemirovski and Yudin, 1983, Kivinen et al., 2006]:

f(ws+1) = f(ws) h r L(ws+1) . (4)

This update is typically approximated by the following explicit update that uses the gradient at the old parameter ws instead (denoted here as the MD update):

f(ws+1) = f(ws) h r L(ws) . (MD) (5)

We now show that the CMD update (1) can be motivated similarly by replacing the Bregman divergence in the minimization problem (3) with a momentum version which quantiﬁes the rate of change in the value of Bregman divergence as w(t) varies over time. For the convex function F, we deﬁne the Bregman momentum between w(t), w0 2 C as the time differential of the Bregman divergence induced by F,

DF (w(t), w0) =

F(w(t)) f(w0)> w(t) =

f(w(t)) f(w0)

Theorem 1 (Main result #1). The CMD update5

= r L(w(t)) , with initial condition w(0) = w0,

is the solution of the following functional 6

min curve w(t)

n DF (w(t), w0) + L(w(t))

Proof. Setting the derivatives w.r.t. w(t) to zero, we have

f(w(t)) f(w0)

"> w(t) + L(w(t))

= HF (w(t))

f(w(t)) f(w0)

+ r L(w(t))

+ r L(w(t)) = 0 ,

where we use the fact that w(t) and

w(t) are independent variables7 [Burke, 1985], thus @

w(t) @w(t)=0.

Note that the implicit update (4) and the explicit update (5) can both be realized as the backward and the forward Euler approximations of (1), respectively, with step size h. Alternatively, (3) can be obtained from (6) via a simple discretization of the momentum term (see Appendix C).

We can provide an alternative deﬁnition of Bregman momentum in terms of the dual of F function. If F (w ) = sup e w2C

! ew>w F( ew)

denotes the Fenchel dual of F and w = arg sup e w2C( ew>w F( ew)), then the following relation holds between the pair of dual variables (w, w ):

w = f (w ) , w = f(w) , and f = f 1 . (7)

Taking the derivative of w(t) and w (t) w.r.t. t yields:

" w(t) . (9) This pairing allows rewriting the Bregman momentum in its dual form:

DF (w(t), w0) =

0, w (t)) = (w (t) w

0)>HF (w (t))

5An equivalent integral form of the CMD update is w(t) = f 1

z=s r L(w(z)) dz

6The objective of (3) is essentially a discretization of the objective of (6). See Appendix C. 7That is, the value of one variable does not depend on changes in the other.

An expanded derivation is given in Appendix A. Using (9), we can rewrite the CMD update (1) as

F (w(t)) r L(w(t)) , (NGD) (11) i.e. a natural gradient descent (NGD) update [Amari, 1998] w.r.t. the Riemannian metric HF . Using r L(w) = HF (w )rw L f (w ) and HF (w) = H 1

F (w ), the CMD update (1) can be written equivalently in the dual domain w as an NGD update w.r.t. the Riemannian metric HF , or by applying (8) as a CMD with the link f :

F (w (t)) rw L f (w (t)) , (12)

f (w (t)) = rw L f (w (t)) . (13) The equivalence of the primal-dual updates was already shown in [Warmuth and Jagota, 1998] for the continuous case and in [Raskutti and Mukherjee, 2015] for the discrete case (where it only holds in one direction). We will show that the equivalence relation is a special case of the reparameterization theorem, introduced in the next section. In the following, we discuss the projected CMD updates for the constrained setting. Proposition 1. The CMD update with the additional constraint

= 0 for some function : Rd ! Rm s.t. {w 2 C|

= 0} is non-empty, amounts to the projected gradient update

= P (w(t))r L(w(t)) &

f (w (t)) = P (w(t))> r L f (w (t)) , (14)

where P := Id J>

F is the projection matrix onto the tangent space of F at w(t) and J (w(t)). Equivalently, the update can be written as a projected natural gradient descent update

F (w(t))r L(w(t)) &

F (w (t))r L f (w (t)). (15) Example 1 ((Normalized) EG). The unnormalized EG update is motivated using the link function f(w) = log w. Adding the linear constraint (w) = w>1 1 to the unnormalized EG update results in the (normalized) EG update [Kivinen and Warmuth, 1997]. Since J (w) = 1> and

HF (w) 1 = diag(w), P = I 11> diag(w)

1>diag(w)1 = I 1w> and the projected CMD update (15) (the continuous EG update) and its NGD form become

log(w) = (I 1w>) r L(w) = (r L(w) 1 w>r L(w)) ,

w = (diag(w)r L(w) w w>r L(w)) .

3 Reparameterization

We now establish the second main result of the paper. Theorem 2 (Main result #2). Let F and G be strictly convex, continuously-differentiable functions with domains in Rd and Rk, respectively, s.t. k d. Let q : Rk ! Rd be a reparameterization function expressing parameters w of F uniquely as q(u) where u lies in the domain of G. Then the CMD update on parameter w for the convex function F (with link f(w) = r F(w)) and loss L(w),

f(w(t)) = r L(w(t)) , coincides with the CMD update on parameters u for the convex function G (with link g(u) := r G(u)) and the composite loss L q,

g(u(t)) = ru L q

, provided that w(0) = q(u(0)) and range(q) dom(F) hold, and we have

F (w) = Jq(u) H 1

G (u) Jq(u)>, for all w = q(u) .

Proof. Note that (dropping t for simplicity) we have

u and ru L q(u) = Jq(u)>r L(w). The CMD update on u with the link function g(u) can be written in the NGD form as

G (u)ru L q(u). Thus,

G (u) Jq(u)> rw L(w) . Multiplying by Jq(u) from the left yields

w = Jq(u)H 1

G (u)Jq(u)>rw L(w) . Comparing the result to (11) concludes the proof.

In the following examples, we will mainly consider reparameterizing a CMD update with the link function f(w) as a GD update on u, for which we have HG = Ik. Example 2 (EGU as GD). The continuous-time EGU can be reparameterized as continuous GD with the reparameterization function w = q(u) = 1/4 u u = 1/4 u 2, i.e.

log(w) = r L(w) equals

u = r L q (u) | {z } ru L (1/4 u 2)

= 1/2 u r L(w) .

This is proven by verifying the condition of Theorem 2:

Jq(u)Jq(u)> = 1/2 diag(u) (1/2 diag(u))> = diag(1/4 u 2) = diag(w) = H 1

Example 3 (Reduced EG in 2-dimension). Consider the 2-dimensional normalized weights w = [ !, 1 !]> where 0 ! 1. The normalized reduced EG update [Warmuth and Jagota, 1998] is motivated by the link function f(w) = log w 1 w, thus HF (w) = 1

w + 1 1 w = 1 w(1 w). This update can be reparameterized as a GD update on u 2 R via ! = q(u) = 1/2(1 + sin(u)) i.e.

log( w 1 w) = rw L(w) equals

u = ru L q (u) | {z }

1/2(1+sin(u))

This is veriﬁed by checking the condition of Theorem 2: Jq(u) = 1/2 cos(u) and

Jq(u)Jq(u)> = 1

4 cos2(u) = 1

= w(1 w) = H 1

Open problem The generalization of the reduced EG link function to d > 2 dimensions becomes f(w) = log w 1 Pd 1

i=1 wi which utilizes the ﬁrst (d 1)-dimensions w s.t. [w>, wd]> 2 d 1. Reparameterizing the CMD update using this link as CGD is open. The update can be reformulated as

diag(w) ww>"

We will give a d-dimensional version of EG using a projection onto a constraint in Example 6. Example 4 (Burg updates as GD). The update associated with the negative Burg entropy F(w) = Pd

i=1 log wi and link f(w) = 1 w is reparameterized as GD with w = q(u) := exp(u), i.e.

( 1 w) = r L(w) equals

u = r L q (u) | {z } ru L (exp(u))

= exp(u) r L(w) ,

This is veriﬁed by the condition of Theorem 2: HF (w) = diag(1 w)2, Jq(u) = diag(exp(u)),

Jq(u)Jq(u)> = diag(exp(u))2 = diag(w)2 = H 1

F (w) . Example 5 (EGU as Burg). The reparameterization step can be chained, and applied in reverse, when the reparameterization function q is invertible. For instance, we can ﬁrst apply the inverse reparameterization of the Burg update as GD from Example 4, i.e. u = q 1(w) = log w. Subsequently, applying the reparameterization of EGU as GD from Example 2, i.e. v = q(u) = 1/4 u 2, results in the reparameterization of EGU update on v as Burg update on w, that is,

log(v) = r L(v) equals

( 1 w) = rw L q q 1(w) | {z } rw L(1/4(log w) 2)

= (log(w) (2w)) r L(v) .

For completeness, we also provide the constrained reparameterized updates (proof in Appendix B). Theorem 3. The constrained CMD update (14) coincides with the reparameterized projected gradient update on the composite loss,

= P q(u(t))ru L q(u(t)) ,

where P q := Ik J>

G is the projection matrix onto the tangent space at u(t) and J q(u) := J>

q (u)J (w).

Example 6 (EG as GD). We now extend the reparameterization of the EGU update as GD in Example 2 to the normalized case in terms of a projected GD update. Combining q(u) = 1/4 u 2

with (w) = 1>w 1, we have J q(u) = 1/2 diag(u) 1> = 1/2 u> and P q(u) = I

1/4kuk2 = I 1/4 uu>. Thus,

ru L(1/4 u 2) with w(t) = 1/4 u(t) 2 , equals the normalized EG update in Example 2. Note that similar ideas were explored in an evolutionary game theory context in [Sandholm, 2010].

4 Tempered Updates

Figure 1: log (x), for different 0.

In this section, we consider a richer class of examples derived using the tempered relative entropy divergence [Amid et al., 2019], parameterized by a temperature 2 R. As we will see, the tempered updates allow interpolating between many well-known cases. We start with the tempered logarithm link function [Naudts, 2002]:

f (w) = log (w) = 1 1 (w1 1) , (16)

0 and 2 R. The log function is shown in Figure 1 for different values of 0. Note that = 1 recovers the standard log function as a limit point. The log (w) link function is the gradient of the convex function

wi log wi + 1 2 (1 w2

1 (1 )(2 ) w2

i 1 1 wi + 1 2

The convex function F induces the following tempered Bregman divergence8:

DF ( e w, w) =

ewi log ewi ewi log wi ew2

i 2 ( ewi wi) w1

For = 0, we obtain the squared Euclidean divergence DF0( e w, w) = 1

2 k e w wk2

2 and for = 1, the relative entropy DF1( e w, w) = P

i( ewi log( e wi/wi) ewi + wi) (See [Amid et al., 2019] for an extensive list of examples).

In the following, we derive the CMD updates using the time derivative of (17) as the tempered Bregman momentum. Notice that the link function log (x) is only deﬁned for x 0 when > 0. In order to have a weight w 2 Rd, we use the -trick [Kivinen and Warmuth, 1997] by maintaining two non-negative weights w+ and w and setting w = w+ w . We call this the tempered EGU

updates, which contain the standard EGU updates as a special case of = 1. As our ﬁnal main result, we show that that continuous tempered EGU

updates interpolate between continuous-time GD and continuous EGU (for 2 [0, 1]). Furthermore, these updates can be simulated by continuous GD on a new set of parameters u using a simple reparameterization. We show that reparameterizing the tempered updates as GD updates on the composite loss L q changes the implicit bias of GD, making the updates converge to the solution with the smallest L2 -norm for arbitrary 2 [0, 1].

4.1 Tempered EGU and Reparameterization

We ﬁrst introduce the generalization of the EGU update using the tempered Bregman divergence (17). Let w(t) 2 Rd

0. The tempered EGU update is motivated by

argmin curve w(t)2Rd

This results in the CMD update

log w(t) = r L(w(t)) . (18)

8The second form is more commonly known as β-divergence [Cichocki and Amari, 2010] with β = 2 .

Figure 2: A reparameterized linear neuron where wi = |ui|

2 2 as a twolayer sparse network: value of = 0 reduces to GD while = 1 simulates the EGU update.

An equivalent integral version of this update is

rw L(w(z)) dz

where exp (x) := [1 + (1 )x]

1 1 + is the inverse of tempered logarithm (16). Note that = 1 is a limit case which recovers the standard exp function and the update (18) becomes the standard EGU update. Additionally, the GD update (on the non-negative orthant) is recovered at = 0. As a result, the tempered EGU update (18) interpolates between GD and EGU for 2 [0, 1] and generalizes beyond for values of > 1 and < 0.9 We now show the reparameterization of the tempered EGU update (18) as GD. This corresponds to continuous-time gradient descent on the network of Figure 2. Proposition 2 (Main result #3). The tempered continuous EGU update can be reparameterized continuous-time GD with the reparameterization function

w = q (u) =

" 2 2 |u| 2 2 , for u 2 Rd and 6= 2 . (20)

log (w) = r L(w) equals

u = r L q (u) | {z }

2 2 " = sign(u)

2 |u| 2 r L(w).

Proof. This is veriﬁed by checking the condition of Theorem 2. The lhs is

(HF (w)(w)) 1 = (Jlog (w)) 1= (diag(w) ) 1 = diag(w) . Note that the Jacobian of q is

sign(u) |u| 2 "

= diag(sign(u) q (u)

Thus the rhs Jq (u)J>

q (u) of the condition equals diag

4.2 Minimum-norm Solutions

We apply the (reparameterized) tempered EGU

update on the underdetermined linear regression problem. For this, we ﬁrst consider the -trick on (18), in which we set w(t) = w+(t) w (t) where

log w+(t) = rw L(w(t)) ,

log w (t) = +rw L(w(t)) . (21) Note that using the -trick, we have w(t) 2 Rn. We call the updates (21) the tempered EGU

. The reparameterization of the tempered EGU

updates as GD can be written by applying Proposition 2,

u+(t) = ru+L

q (u+(t)) q (u (t))

u (t) = ru L

q (u+(t)) q (u (t))

and setting w(t) = q (u+(t)) q (u (t)).

The strong convexity of the F function w.r.t. the L2 -norm (see [Amid et al., 2019]) suggests that the updates motivated by the tempered Bregman divergence (17) yield the minimum L2 -norm solution in certain settings. We verify this by considering the following underdetermined linear regression problem. Let {xn, yn}N

n=1 where xn 2 Rd, yn 2 R denote the set of input-output pairs and let X 2 RN d with N < d be the design matrix for which the n-th row is equal to x>

n . Also, let y 2 RN denote the vector of targets. Consider the tempered EGU

updates (21) on the weights w(t) = w+(t) w (t) where w+(t), w (t) 0 and w+(0) = w (0) = w0. Following (19), we have

w+(t) = exp

, w (t) = exp

where δ(t) = X

w+(t) w (t)

9For example, = 2 corresponds to the Burg updates (Example 4).

Theorem 4. Consider the underdetermined linear regression problem where N < d. Let E = {w 2 Rd| Xw = y} be the set of solutions with zero error. Given w(1) 2 E, then the tempered EGU

updates (21) with temperature 0 1 and initial solution w0 = 1 0 converge to the minimum L2 -norm solution in E in the limit ! 0.

Proof. We show that the solution of the tempered EGU

satisﬁes the dual feasibility and complementary slackness KKT conditions for the following optimization problem (omitting t for simplicity):

min w+,w kw+ w k2

2 , for 0 1, s.t. X(w+ w ) = y and w+, w 0 .

Imposing the constraints using a set of Lagrange multipliers +, 0 and λ 2 R, we have

w sup +, 0,λ

The set of KKT conditions are 8 <

w+, w 0 , Xw = y , + sign(w) |w| (1 ) X>λ 0 , sign(w) |w| (1 ) + X>λ 0 , !

sign(w) |w| (1 ) X>λ

sign(w) |w| (1 ) X>λ

w = 0 , where w = w+ w . The ﬁrst condition is imposed by the form of the updates and the second condition is satisﬁed by the assumption at t ! 1. Using w0 = 1 with ! 0, we have

w+(t) = exp

w (t) = exp

Setting λ = (1 )

0 δ(z) satisﬁes the remaining KKT conditions.

Corollary 1. Under the assumptions of Theorem 4, the reparameterized tempered EGU

updates (22) also recover the minimum L2 -norm solution where w(t) = q (u+(t)) q (u (t)).

This corollary shows that reparameterizing the loss in terms of the parameters u changes the implicit bias of the GD updates. Similar results were observed before in terms of sparse signal recovery [Vaskevicius et al., 2019] and matrix factorization [Gunasekar et al., 2017]. Here, we show that this is a direct result of the dynamics induced by the reparameterization Theorem 2.

5 Conclusion and Future Work

In this paper, we motivated the continuous-time mirror descent updates and provided a general framework for reparameterizing these updates. We also introduced the tempered EGU

updates and its reparameterizations. The tempered EGU

updates include the two commonly used gradient descent and exponentiated gradient updates, and interpolations between them. For the underdetermined linear regression problem we showed that under certain conditions, the tempered EGU

updates converge to the minimum L2 -norm solution. The current work leads to many interesting future directions:

The focus is this paper was to develop the reparameterization method in full generality. Our repa-

rameterization equivalence theorem holds only in the continuous-time and the equivalence relation breaks down after discretization. However, in many important cases the discretized reparameterized updates closely track the discretized original updates [Amid and Warmuth, 2020]. This was done by proving the same on-line worst case regret bounds for the discretized reparameterized updates as the originals. A key research direction is to ﬁnd general conditions for which this is true. Perhaps the most important application of the current work is reparameterizing the weights of deep

neural networks for achieving sparse solutions or obtaining an implicit form of regularization that mimics a trade-off between the ridge and lasso methods (e.g. elastic net regularization [Zou and Hastie, 2005]). Here the key open question is the following: Are sparse networks (as in Figure 2) necessary, if the goal is to learn sample efﬁciently when the solution is sparse?10

A general treatment of the convergence results for underdetermined linear regression should allow

any start vector. Also, developing a matrix form of the reparameterization theorem is left open.

10This has recently been proven in Warmuth et al. [2020].

Acknowledgement

Thanks to Zakaria Mhammedi for valuable feedback and to NSF grant IIS-1546459 for partial support.

Broader Impact

The result of the paper suggests that the mirror descent updates can be effectively used in neural networks by running backpropagation on the reparameterized form of the neurons. This may have a potential use case for training these networks more efﬁciently. This is a theoretical paper and the broader ethical impact discussion is not applicable.

Ethan Akin. The geometry of population genetics, volume 31 of Lecture Notes in Biomathematics.

Springer-Verlag, Berlin-New York, 1979.

Shun-ichi Amari. Natural gradient works efﬁciently in learning. Neural Computation, 10(2):251 276,

Ehsan Amid and Manfred K Warmuth. Winnowing with gradient descent. In Conference on Learning

Theory (COLT), pages 163 182. PMLR, 2020.

Ehsan Amid, Manfred K. Warmuth, Rohan Anil, and Koren Tomer. Robust bi-tempered logistic loss

based on Bregman divergences. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Neur IPS 19, Cambridge, MA, USA, 2019.

William L. Burke. Applied Differential Geometry. Cambridge University Press, 1985.

Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible

and robust measures of similarities. Entropy, 12(6):1532 1568, 2010.

Udaya Ghai, Elad Hazan, and Yoram Singer. Exponentiated gradient meets gradient descent. In

Conference on Learning Theory (COLT), volume 117, pages 1 23. PMLR, 2020.

Suriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro.

Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems (Neur IPS), pages 6151 6159, 2017.

Suriya Gunasekar, Jason D. Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on

linear convolutional networks. In Advances in Neural Information Processing Systems (Neur IPS), pages 9461 9471, 2018.

Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear

predictors. Information and Computation, 132(1):1 63, 1997.

Jyrki Kivinen, Manfred K. Warmuth, and Peter Auer. The Perceptron algorithm vs. Winnow: linear

vs. logarithmic mistake bounds when few input variables are relevant. Artiﬁcial Intelligence, 97: 325 343, December 1997.

Jyrki Kivinen, Manfred K Warmuth, and Babak Hassibi. The p-norm generalization of the LMS

algorithm for adaptive ﬁltering. IEEE Transactions on Signal Processing, 54(5):1782 1793, 2006.

Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and

Computation, 108(2):212 261, 1994.

Jan Naudts. Deformed exponentials and logarithms in generalized thermostatistics. Physica A, 316

(1 4):323 334, 2002.

Arkadi Nemirovski and David B. Yudin. Problem Complexity and Method Efﬁciency in Optimization.

Wiley & Sons, New York, 1983.

Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online PCA with optimal regret. The

Journal of Machine Learning Research, 17(1):6022 6070, 2016.

Maxim Raginsky and Jake Bouvrie. Continuous-time stochastic mirror descent on a network:

Variance reduction, consensus, convergence. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6793 6800. IEEE, 2012.

Garvesh Raskutti and Sayan Mukherjee. The information geometry of mirror descent. IEEE

Transactions on Information Theory, 61(3):1451 1457, 2015.

R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on

control and optimization, 14(5):877 898, 1976.

William H Sandholm. Population games and evolutionary dynamics. MIT Press, 2010.

Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends

in Machine Learning, 4(2):107 194, 2012.

Tomas Vaskevicius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse

recovery. In Advances in Neural Information Processing Systems (Neur IPS), pages 2968 2979, 2019.

Manfred K. Warmuth and Arun Jagota. Continuous and discrete time nonlinear gradient descent: relative loss bounds and convergence. In R. Greiner E. Boros, editor, Electronic Proceedings of Fifth International Symposium on Artiﬁcial Intelligence and Mathematics. Electronic,http://rutcor.rutgers.edu/ amai, 1998.

Manfred K. Warmuth and S.V.N. Vishwanathan. Leaving the span. In Proceedings of the 18th Annual

Conference on Learning Theory (COLT), pages 366 381, 2005.

Manfred K. Warmuth, Wojciech Kotłowski, and Shuisheng Zhou. Kernelization of matrix up-

dates. Journal of Theoretical Computer Science, 558:159 178, 2014. Special issue for the 23nd International Conference on Algorithmic Learning Theory (ALT).

Manfred K Warmuth, Wojciech Kotłowski, and Ehsan Amid. A case where a spindly two-layer

linear network whips any neural network with a fully connected input layer. ar Xiv preprint ar Xiv:2010.08625, 2020.

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the

royal statistical society: series B (statistical methodology), 67(2):301 320, 2005.