# acceleration_through_optimistic_noregret_dynamics__93d867d2.pdf

Acceleration through Optimistic No-Regret Dynamics

Jun-Kun Wang College of Computing Georgia Institute of Technology Atlanta, GA 30313 jimwang@gatech.edu

Jacob Abernethy College of Computing Georgia Institute of Technology Atlanta, GA 30313 prof@gatech.edu

We consider the problem of minimizing a smooth convex function by reducing the optimization to computing the Nash equilibrium of a particular zero-sum convexconcave game. Zero-sum games can be solved using online learning dynamics, where a classical technique involves simulating two no-regret algorithms that play against each other and, after T rounds, the average iterate is guaranteed to solve the original optimization problem with error decaying as O(log T/T). In this paper we show that the technique can be enhanced to a rate of O(1/T 2) by extending recent work [22, 25] that leverages optimistic learning to speed up equilibrium computation. The resulting optimization algorithm derived from this analysis coincides exactly with the well-known NESTEROVACCELERATION [16] method, and indeed the same story allows us to recover several variants of the Nesterov s algorithm via small tweaks. We are also able to establish the accelerated linear rate for a function which is both strongly-convex and smooth. This methodology uniﬁes a number of different iterative optimization methods: we show that the HEAVYBALL algorithm is precisely the non-optimistic variant of NESTEROVACCELERATION, and recent prior work already established a similar perspective on FRANKWOLFE [2, 1].

1 Introduction

One of the most successful and broadly useful tools recently developed within the machine learning literature is the no-regret framework, and in particular online convex optimization (OCO) [28]. In the standard OCO setup, a learner is presented with a sequence of (convex) loss functions ℓ1( ), ℓ2( ), . . ., and must make a sequence of decisions x1, x2, . . . from some set K in an online fashion, and observes ℓt after only having committed to xt. Assuming the sequence {ℓt} is chosen by an adversary, the learner aims is to minimize the average regret RT := 1

T PT t=1 ℓt(xt) minx K PT t=1 ℓt(x) against any such loss functions. Many simple algorithms have been developed for OCO problems including MIRRORDESCENT, FOLLOWTHEREGULARIZEDLEADER, FOLLOWTHEPERTURBEDLEADER, etc. and these algorithms exhibit regret guarantees that are strong even against adversarial opponents. Under very weak conditions one can achieve a regret rate of RT = O(1/

T), or even RT = O(log T/T) with required curvature on ℓt.

One can apply online learning tools to several problems, but perhaps the simplest is to ﬁnd the approximate minimum of a convex function argminx K f(x). With a simple reduction we set ℓt = f, and it is easy to show that, via Jensen s inequality, the average iterate x T := x1+...+x T

T PT t=1 f(xt) = 1

T PT t=1 ℓt(xt) minx K 1

T PT t=1 ℓt(x) + RT = minx K f(x) + RT

hence RT upper bounds the approximation error. But this reduction, while simple and natural, is quite limited. For example, we know that when f( ) is smooth, more sophisticated algorithms such

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

as FRANKWOLFE and HEAVYBALL achieve convergence rates of O(1/T), whereas the now-famous NESTEROVACCELERATION algorithm achieves a rate of O(1/T 2). The fast rate shown by Nesterov was quite surprising at the time, and many researchers to this day ﬁnd the result quite puzzling. There has been a great deal of work aimed at providing a more natural explanation of acceleration, with a more intuitive convergence proof [27, 4, 10]. This is indeed one of the main topics of the present work, and we will soon return to this discussion.

Another application of the no-regret framework is the solution of so-called saddle-point problems, which are equivalently referred to as Nash equilibria for zero-sum games. Given a function g(x, y) which is convex in x and concave in y (often called a payoff function), deﬁne V = infx K supy g(x, y). An ϵ-equilibrium of g( , ) is a pair ˆx, ˆy such that such that

V ϵ infx K g(x, ˆy) V supy g(ˆx, y) V + ϵ. (1)

One can ﬁnd an approximate saddle point of the game with the following setup: implement a noregret learning algorithm for both the x and y players simultaneously, after observing the actions {xt, yt}t=1...T return the time-averaged iterates (ˆx, ˆy) = x1+...+x T

T , y1+...+y T

T . A simple proof shows that (ˆx, ˆy) is an approximate equilibrium, with approximation bounded by the average regret of both players (see Theorem 1). In the case where the function g( , ) is biafﬁne, the no-regret reduction guarantees a rate of O(1/

T), and it was assumed by many researchers this was the fastest possible using this framework. But one of the most surprising online learning results to emerge in recent years established that no-regret dynamics can obtain an even faster rate of O(1/T). Relying on tools developed by [8], this fact was ﬁrst proved by [21] and extended by [25]. The new ingredient in this recipe is the use of optimistic learning algorithms, where the learner seeks to beneﬁt from the predictability of slowly-changing inputs {ℓt}.

We will consider solving the classical convex optimization problem minx f(x), for smooth functions f, by instead solving an associated saddle-point problem which we call the Fenchel Game. Speciﬁcally, we consider that the payoff function g of the game to be

g(x, y) = x, y f (y). (2)

where f ( ) is the fenchel conjugate of f( ). This is an appropriate choice of payoff function since, V = minx f(x) and supy g(ˆx, y) = supy ˆx, y f (y) = f(ˆx). Therefore, by the deﬁnition of an ϵ-equilibrium, we have that

Lemma 1. If (ˆx, ˆy) is an ϵ-equilibrium of the Fenchel Game (2), then f(ˆx) minx f(x) ϵ.

One can imagine computing the equilibrium of the Fenchel game using no-regret dynamics, and indeed this was the result of recent work [2] establishing the FRANKWOLFE algorithm as precisely an instance of two competing learning algorithms.

In the present work we will take this approach even further.

1. We show that, by considering a notion of weighted regret, we can compute equilibria in the Fenchel game at a rate of O(1/T 2) using no-regret dynamics where the only required condition is that f is smooth. This improves upon recent work [1] on a faster FRANKWOLFE method, which required strong convexity of f (see Appendix J). 2. We show that the secret sauce for obtaining the fast rate is precisely the use of an optimistic no-regret algorithm, OPTIMISTICFTL [1], combined with appropriate weighting scheme. 3. We show that, when viewed simply as an optimization algorithm, this method is identically the original NESTEROVACCELERATION method. In addition, we recover several variants of NESTEROVACCELERATION (see [15, 17, 19]) using small tweaks of the framework. 4. We show that if one simply plays FOLLOWTHELEADER without optimism, the resulting algorithm is precisely the HEAVYBALL. The latter is known to achieve a suboptimal rate in general, and our analysis sheds light on this difference. 5. Under the additional assumption that function f( ) is strongly convex, we show that an accelerated linear rate can also be obtained from the game framework. 6. Finally, we show that the same equilibrium framework can also be extended to composite optimization and lead to a variant of Accelerated Proximal Method.

Related works: In recent years, there are growing interest in giving new interpretations of Nesterov s accelerated algorithms. For example, [26] gives a uniﬁed analysis for some Nesterov s accelerated

algorithms [17, 18, 19], using the standard techniques and analysis in optimization literature. [13] connects the design of accelerated algorithms with dynamical systems and control theory. [7] gives a geometric interpretation of the Nesterov s method for unconstrained optimization, inspired by the ellipsoid method. [10] studies the Nesterov s methods and the HEAVYBALL method for quadratic non-strongly convex problems by analyzing the eigen-values of some linear dynamical systems. [4] proposes a variant of accelerated algorithms by mixing the updates of gradient descent and mirror descent and showing the updates are complementary. [24, 27] connect the acceleration algorithms with differential equations. In recent years there has emerged a lot of work where learning problems are treated as repeated games [14, 3], and many researchers have been studying the relationship between game dynamics and provable convergence rates [5, 11, 9].

We would like to acknowledge George Lan for his excellent notes titled Lectures on Optimization for Machine Learning (unpublished). In parallel to the development of the results in this paper, we discovered that Lan had observed a similar connection between NESTEROVACCELERATION and repeated game playing (Chapter 3.4). A game interpretation was given by George Lan and Yi Zhou in Section 2.2 of [12].

2 Preliminaries

Convex functions and conjugates. A function f on Rd is L-smooth w.r.t. a norm if f is everywhere differentiable and it has lipschitz continuous gradient f(u) f(v) L u v , where denotes the dual norm. Throughout the paper, our goal will be to solve the problem of minimizing an L-smooth function f( ) over a convex set K. We also assume that the optimal solution of x := argminx K f(x) has ﬁnite norm. For any convex function f, its Fenchel conjugate is f (y) := supx dom(f) x, y f(x). If a function f is convex, then its conjugate f is also convex. Furthermore, when the function f( ) is strictly convex, we have that f(x) = argmax y x, y f (y).

Suppose we are given a differentiable function φ( ), then the Bregman divergence Vc(x) with respect to φ( ) at a point c is deﬁned as Vc(x) := φ(x) φ(c), x c φ(c). Let be any norm on Rd. When we have that Vc(x) σ

2 c x 2 for any x, c dom(φ), we say that φ( ) is a σ-strongly convex function with respect to . Throughout the paper we assume that φ( ) is 1-strongly convex.

No-regret zero-sum game dynamics. Let us now consider the process of solving a zero-sum game via repeatedly play by a pair of online learning strategies. The sequential procedure is described in Algorithm 1.

Algorithm 1 Computing equilibrium using no-regret algorithms

1: Input: sequence α1, . . . , αT > 0 2: for t = 1, 2, . . . , T do 3: y-player selects yt Y = Rd by OAlgy. 4: x-player selects xt X by OAlgx, possibly with knowledge of yt. 5: y-player suffers loss ℓt(yt) with weight αt, where ℓt( ) = g(xt, ). 6: x-player suffers loss ht(xt) with weight αt, where ht( ) = g( , yt). 7: end for 8: Output ( x T , y T ) := PT s=1 αsxs

PT s=1 αsys

In this paper, we consider Fenchel game with weighted losses depicted in Algorithm 1, following the same setup as [1]. In this game, the y-player plays before the x-player plays and the x-player sees what the y-player plays before choosing its action. The y-player receives loss functions αtℓt( ) in round t, in which ℓt(y) := f (y) xt, y , while the x-player see its loss functions αtht( ) in round t, in which ht(x) := x, yt f (yt). Consequently, we can deﬁne the weighted regret of the x and y players as

α-REGy := PT t=1 αtℓt(yt) miny PT t=1 αtℓt(y) (3)

α-REGx := PT t=1 αtht(xt) PT t=1 αtht(x ) (4)

Notice that the x-player s regret is computed relative to x the minimizer of f( ), rather than the minimizer of PT t=1 αtht( ). Although slightly non-standard, this allows us to handle the unconstrained setting while Theorem 1 still holds as desired.

At times when we want to refer to the regret on another sequence y 1, . . . , y T we may refer to this as α-REG(y 1, . . . , y T ). We also denote At as the cumulative sum of the weights At := Pt s=1 αs and the weighted average regret α-REG := α-REG

AT . Finally, for ofﬂine constrained optimization (i.e. minx K f(x)), we let the decision space of the benchmark/comparator in the weighted regret deﬁnition to be X = K; for ofﬂine unconstrained optimization, we let the decision space of the benchmark/comparator to be a norm ball that contains the optimum solution of the ofﬂine problem (i.e. contains arg minx Rn f(x)), which means that X of the comparator is a norm ball. We let Y = Rd be unconstrained. Theorem 1. [1] Assume a T-length sequence α are given. Suppose in Algorithm 1 the online learning algorithms OAlgx and OAlgy have the α-weighted average regret α-REG x and α-REG y

respectively. Then the output ( x T , y T ) is an ϵ-equilibrium for g( , ), with ϵ = α-REG x + α-REG y.

3 An Accelerated Solution to the Fenchel Game via Optimism

We are going to analyze more closely the use of Algorithm 1, with the help of Theorem 1, to establish a fast method to compute an approximate equilibrium of the Fenchel Game. In particular, we will establish an approximation factor of O(1/T 2) after T iterations, and we recall that this leads to a O(1/T 2) algorithm for our primary goal of solving minx K f(x).

3.1 Analysis of the weighted regret of the y-player (i.e. the gradient player)

A very natural online learning algorithm is FOLLOWTHELEADER, which always plays the point with the lowest (weighted) historical loss

FOLLOWTHELEADER ˆyt := argminy n Pt 1 s=1 αsℓs(y) o .

FOLLOWTHELEADER is known to not perform well against arbitrary loss functions, but for strongly convex ℓt( ) one can prove an O(log T/T) regret bound in the unweighted case. For the time being, we shall focus on a slightly different algorithm that utilizes optimism in selecting the next action:

OPTIMISTICFTL eyt := argminy n αtℓt 1(y) + Pt 1 s=1 αsℓs(y) o .

This procedure can be viewed as an optimistic variant of FOLLOWTHELEADER since the algorithm is effectively making a bet that, while ℓt( ) has not yet been observed, it is likely to be quite similar to ℓt 1. Within the online learning community, the origins of this trick go back to [8], although their algorithm was described in terms of a 2-step descent method. This was later expanded by [21] who coined the term optimistic mirror descent (OMD), and who showed that the proposed procedure can accelerate zero-sum game dynamics when both players utilize OMD. OPTIMISTICFTL, deﬁned as a batch procedure, was ﬁrst presented in [1] and many of the tools of the present paper follow directly from that work.

For convenience, we ll deﬁne δt(y) := αt(ℓt(y) ℓt 1(y)). Intuitively, the regret will be small if the functions δt are not too big. This is formalized in the following lemma. Lemma 2. For an arbitrary sequence {αt, ℓt}t=1...T , the regret of OPTIMISTICFTL satisﬁes α-REGy(ey1, . . . , ey T ) PT t=1 δt(eyt) δt(ˆyt+1).

Proof. Let Lt(y) := Pt s=1 αsℓs(y) and also Lt(y) := αtℓt 1(y) + Pt 1 s=1 αsℓs(y).

α-REG(ey1:T ) := PT t=1 αtℓt(eyt) LT (ˆy T +1) = PT t=1 αtℓt(eyt) LT (ˆy T +1) δT (ˆy T +1)

PT t=1 αtℓt(eyt) LT (ey T ) δT (ˆy T +1)

= PT 1 t=1 αtℓt(eyt) LT 1(ey T ) + δT (ey T ) δT (ˆy T +1)

PT 1 t=1 αtℓt(eyt) LT 1(ˆy T ) + δT (ey T ) δT (ˆy T +1) = α-REG(ey1:T 1) + δT (ey T ) δT (ˆy T +1).

The bound follows by induction on T.

The result from Lemma 2 is generic, and would hold for any online learning problem. But for the Fenchel game, we have a very speciﬁc sequence of loss functions, ℓt(y) := g(xt, y) = f (y) xt, y . With this in mind, let us further analyze the regret of the y player.

For the time being, let us assume that the sequence of xt s is arbitrary. We deﬁne

xt := 1 At Pt s=1 αsxs and ext := 1 At (αtxt 1 + Pt 1 s=1 αsxs).

It is critical that we have two parallel sequences of iterate averages for the x-player. Our ﬁnal algorithm will output x T , whereas the Fenchel game dynamics will involve computing f at the reweighted averages ext for each t = 1, . . . , T.

To prove the key regret bound for the y-player, we ﬁrst need to state some simple technical facts.

ˆyt+1 = argmin y

s=1 αs (f (y) xs, y ) = argmax y xt, y f (y) = f( xt) (5)

eyt = f(ext) (following same reasoning as above), (6)

ext xt = αt At (xt 1 xt). (7)

Equations 5 and 6 follow from elementary properties of Fenchel conjugation and the Legendre transform [23]. Equation 7 follows from a simple algebraic calculation. Lemma 3. Suppose f( ) is a convex function that is L-smooth with respect to the the norm with dual norm . Let x1, . . . , x T be an arbitrary sequence of points. Then, we have

α-REGy(ey1, . . . , ey T ) L PT t=1 α2 t At xt 1 xt 2. (8)

Proof. Following Lemma 2, and noting that here we have δt(y) = αt xt 1 xt, y , we have

PT t=1 αtℓt(eyt) αtℓt(y ) PT t=1 δt(eyt) δt(ˆyt+1) = PT t=1 αt xt 1 xt, eyt ˆyt+1

(Eqns. 5, 6) = PT t=1 αt xt 1 xt, f(ext) f( xt)

(Hölder s Ineq.) PT t=1 αt xt 1 xt f(ext) f( xt)

(L-smoothness of f) L PT t=1 αt xt 1 xt ext xt

(Eqn. 7) = L PT t=1 α2 t At xt 1 xt xt 1 xt

as desired.

We notice that a similar bound is given in [1] for the gradient player using OPTIMISTICFTL, yet the above result is a stict improvement as the previous work relied on the additional assumption that f( ) is strongly convex. The above lemma depends only on the fact that f has lipschitz gradients.

3.2 Analysis of the weighted regret of the x-player

In the present section we are going to consider that the x-player uses MIRRORDESCENT for updating its action, which is deﬁned as follows.

xt := argminx K αtht(x) + 1

γt Vxt 1(x) = argminx K γt x, αtyt + Vxt 1(x), (9)

where we recall that the Bregman divergence Vx( ) is with respect to a 1-strongly convex regularization φ. Also, we note that the x-player has an advantage in these game dynamics, since xt is chosen with knowledge of yt and hence has knowledge of the incoming loss ht( ). Lemma 4. Let the sequence of xt s be chosen according to MIRRORDESCENT. Assume that the Bregman Divergence is uniformly bounded on K, so that D = supt=1,...,T Vxt(x ), where x denotes the minimizer of f( ). Assume that the sequence {γt}t=1,2,... is non-increasing. Then we have α-REGx D

γT PT t=1 1 2γt xt 1 xt 2.

The proof of this lemma is quite standard, and we postpone it to Appendix A. We also note that the benchmark x is always within a ﬁnite norm ball by assumption. We given an alternative to this lemma in the appendix, when γt = γ is ﬁxed, in which case we can instead use the more natural constant D = Vx1(x ).

3.3 Convergence Rate of the Fenchel Game

Theorem 2. Let us consider the output ( x T , y T ) of Algorithm 1 under the following conditions: (a) the sequence {αt} is positive but otherwise arbitrary (b) OAlgy is chosen OPTIMISTICFTL, (c) OAlgx is MIRRORDESCENT with any non-increasing positive sequence {γt}, and (d) we have a bound Vxt(x ) D for all t. Then the point x T satisﬁes

f( x T ) min x X f(x) 1 AT

α2 t At L 1

xt 1 xt 2 !

Proof. We have already done the hard work to prove this theorem. Lemma 1 tells us we can bound the error of x T by the ϵ error of the approximate equilibrium ( x T , y T ). Theorem 1 tells us that the pair ( x T , y T ) derived from Algorithm 1 is controlled by the sum of averaged regrets of both players, 1 AT (α-REGx + α-REGy). But we now have control over both of these two regret quantities, from Lemmas 3 and 4. The right hand side of (10) is the sum of these bounds.

Theorem 2 is somewhat opaque without a specifying the sequence {αt}. But what we now show is that the summation term vanishes when we can guarantee that α2 t At remains constant! This is where we obtain the following fast rate.

Corollary 1. Following Theorem 2 with αt = t and for any non-increasing sequence γt satisfying

1 CL γt 1 4L for some constant C > 4, we have f( x T ) min x X f(x) 2CLD

Proof. Observing At := t(t+1)

2 , the choice of {αt, γt} implies D

γt c DL and Lα2 t At = 2Lt2 t(t+1) 2L 1 2γt , which ensures that the summation term in (10) is negative. The rest is simple algebra.

A straightforward choice for the learning rate γt is simple the constant sequence γt = 1 4L. The corollary is stated with a changing γt in order to bring out a connection to the classical NESTEROVACCELERATION in the following section.

Remark: It is worth dwelling on exactly how we obtained the above result. A less reﬁned analysis of the MIRRORDESCENT algorithm would have simply ignored the negative summation term in Lemma 4, and simply upper bounded this by 0. But the negative terms xt xt 1 2 in this sum happen to correspond exactly to the positive terms one obtains in the regret bound for the y-player, but this is true only as a result of using the OPTIMISTICFTL algorithm. To obtain a cancellation of these terms, we need a γt which is roughly constant, and hence we need to ensure that α2 t At = O(1). The ﬁnal bound, of course, is determined by the inverse quantity 1 AT , and a quick inspection reveals that the best choice of αt = θ(t). This is not the only choice that could work, and we conjecture that there are scenarios in which better bounds are achievable for different αt tuning. We show in Section 4.3 that a linear rate is achievable when f( ) is also strongly convex, and there we tune αt to grow exponentially in t rather than linearly.

4 Nesterov s methods are instances of our accelerated solution to the game

Starting from 1983, Nesterov has proposed three accelerated methods for smooth convex problems (i.e. [16, 15, 17, 19]. In this section, we show that our accelerated algorithm to the Fenchel game can generate all his methods with some simple tweaks.

4.1 Recovering Nesterov s (1983) method for unconstrained smooth convex problems [16, 15]

In this subsection, we assume that the x-player s action space is unconstrained. That is, K = Rn. Consider the following algorithm.

Algorithm 2 A variant of our accelerated algorithm.

1: In the weighted loss setting of Algorithm 1: 2: y-player uses OPTIMISITCFTL as OAlgy: yt = f(ext). 3: x-player uses ONLINEGRADIENTDESCENT as OAlgx: 4: xt = xt 1 γtαt ht(x) = xt 1 γtαtyt = xt 1 γtαt f(ext).

Theorem 3. Let αt = t. Assume K = Rn. Algorithm 2 is actually the case the x-player uses MIRRORDESCENT. Therefore, x T is an O( 1

T 2 )-approximate optimal solution of minx f(x) by Theorem 2 and Corollary 1.

Proof. For the unconstrained case, we can let the distance generating function of the Bregman divergence to be the squared of L2 norm, i.e. φ(x) := 1

2 x 2 2. Then, the update becomes xt = argminx γt x, αtyt + Vxt 1(x) = argminx γt x, αtyt + 1

2 x 2 2 xt 1, x xt 1 1

2 xt 1 2 2. Differentiating the objective w.r.t x and setting it to zero, one will get xt = xt 1 γtαtyt.

Having shown that Algorithm 2 is actually our accelerated algorithm to the Fenchel game. We are going to show that Algorithm 2 has a direct correspondence with Nesterov s ﬁrst acceleration method (Algorithm 3) [16, 15] (see also [24]).

Algorithm 3 Nesterov Algorithm [[16, 15]]

1: Init: w0 = z0. Require: θ 1

L. 2: for t = 1, 2, . . . , T do 3: wt = zt 1 θ f(zt 1). 4: zt = wt + t 1

t+2(wt wt 1). 5: end for 6: Output w T .

To see the equivalence, let us re-write xt := 1 At Pt s=1 αsxs of Algorithm 2.

xt = At 1 xt 1+αtxt

At = At 1 xt 1+αt(xt 1 γtαt f(ext))

= At 1 xt 1+αt( At 1 xt 1 At 2 xt 2

αt 1 γtαt f(ext))

= xt 1( At 1

At + αt(αt 1+At 2)

Atαt 1 ) xt 2( αt At 2

Atαt 1 ) γtα2 t At f(ext)

= xt 1 γtα2 t At f(ext) + ( αt At 2

Atαt 1 )( xt 1 xt 2)

= xt 1 1 4L f(ext) + ( t 2

t+1)( xt 1 xt 2),

where αt = t and γt = (t+1)

Theorem 4. Algorithm 3 with θ = 1 4L is equivalent to Algorithm 2 with γt = (t+1)

t 1 8L in the sense that they generate equivalent sequences of iterates:

for all t = 1, 2, . . . , T, wt = xt and zt 1 = ext.

Let us switch to comparing the update of Algorithm 2, which is (11), with the update of the HEAVYBALL algorithm. We see that (11) has the so called momentum term (i.e. has a ( xt 1 xt 2) term). But, the difference is that the gradient is evaluated at ext = 1 At (αtxt 1 + Pt 1 s=1 αsxs), not xt 1 = 1 At 1 Pt 1 s=1 αsxs, which is the consequence that the y-player plays OPTIMISTICFTL. To

Algorithm 4 HEAVYBALL algorithm

1: In the weighted loss setting of Algorithm 1: 2: y-player uses FOLLOWTHELEADER as OAlgy: yt = f( xt 1). 3: x-player uses ONLINEGRADIENTDESCENT as OAlgx: 4: xt := xt 1 γtαt ht(x) = xt 1 γtαtyt = xt 1 γtαt f( xt 1).

elaborate, let us consider a scenario (shown in Algorithm 4) such that the y-player plays FOLLOWTHELEADER instead of OPTIMISTICFTL.

Following what we did in (11), we can rewrite xt of Algorithm 4 as

xt = xt 1 γtα2 t At f( xt 1) + ( xt 1 xt 2)( αt At 2

Atαt 1 ), (12)

by observing that (11) still holds except that f(ext) is changed to f( xt 1) as the y-player uses FOLLOWTHELEADER now, which give us the update of the Heavy Ball algorithm as (12). Moreover, by the regret analysis, we have the following theorem. The proof is in Appendix C.

Theorem 5. Let αt = t. Assume K = Rn. Also, let γt = O( 1

L). The output x T of Algorithm 4 is an O( 1

T )-approximate optimal solution of minx f(x).

To conclude, by comparing Algorithm 2 and Algorithm 4, we see that Nesterov s (1983) method enjoys O(1/T 2) rate since its adopts OPTIMISTICFTL, while the HEAVYBALL algorithm which adopts FTL may not enjoy the fast rate, as the distance terms may not cancel out. The result also conforms to empirical studies that the HEAVYBALL does not exhibit acceleration on general smooth convex problems.

4.2 Recovering Nesterov s (1988) 1-memory method [17] and Nesterov s (2005) -memory method [19]

In this subsection, we consider recovering Nesterov s (1988) 1-memory method [17] and Nesterov s (2005) -memory method [19]. To be speciﬁc, we adopt the presentation of Nesterov s algorithm given in Algorithm 1 and Algorithm 3 of [26] respectively.

Algorithm 5 (A) Nesterov s 1-memory method [17] and (B) Nesterov s -memory method [19]

1: Input: parameter βt = 2 t+1, γ t = t 4L, θt = t, and η = 1 4L. 2: Init: w0 = x0 3: for t = 1, 2, . . . , T do 4: zt = (1 βt)wt 1 + βtxt 1. 5: (A) xt = argminx K γ t f(zt), x + Vxt 1(x).

6: Or, (B) xt = argminx K Pt s=1 θs x, f(zs) + 1

ηR(x), where R( ) is 1-strongly convex. 7: wt = (1 βt)wt 1 + βtxt. 8: end for 9: Output w T .

Theorem 6. Let αt = t. Algorithm 5 with update by option (A) is the case when the y-player uses OPTIMISTICFTL and the x-player adopts MIRRORDESCENT with γt = 1 4L in Fenchel game. Therefore, w T is an O( 1

T 2 )-approximate optimal solution of minx K f(x).

The proof is in Appendix D, which shows the direct correspondence of Algorithm 5 using option (A) to our accelerated solution in Section 3.

Theorem 7. Let αt = t. Algorithm 5 with update by option (B) is the case when the y-player uses OPTIMISTICFTL and the x-player adopts BETHEREGULARIZEDLEADER with η = 1 4L in Fenchel game. Therefore, w T is an O( 1

T 2 )-approximate optimal solution of minx K f(x).

The proof is in Appendix E, which requires the regret bound of BETHEREGULARIZEDLEADER.

4.3 Accelerated linear rate

Nesterov observed that, when f( ) is both µ-strongly convex and L-smooth, one can achieve a rate that is exponentially decaying in T (e.g. page 71-81 of [18]). It is natural to ask if the zero-sum game and regret analysis in the present work also recovers this faster rate in the same fashion. We answer this in the afﬁrmative. Denote κ := L

µ. A property of f(x) being µ-strongly convex is

that the function f(x) := f(x) µ x 2 2 2 is still a convex function. Now we deﬁne a new game

whose payoff function is g(x, y) := x, y f (y) + µ x 2 2 2 . Then, the minimax vale of the game

is V := minx maxy g(x, y) = minx f(x) + µ x 2 2 2 = minx f(x). Observe that, in this game, the loss of the y-player in round t is αtℓt(y) := αt( f (y) xt, y ), while the loss of the x-player in round t is a strongly convex function αtht(y) := αt( x, yt + µ x 2 2 2 ). The cumulative loss function of the x-player becomes more and more strongly convex over time, which is the key to allowing the exponential growth of the total weight At that leads to the linear rate. In this setup, we have a warmup round t = 0, and thus we denote At := Pt s=0 αs which incorporate the additional step into the average. The proof of the following result is in Appendix H.

Theorem 8. For the game g(x, y) := x, y f (y)+ µ x 2 2 2 , if the y-player plays OPTIMISTICFTL and the x-player plays BETHEREGULARIZEDLEADER: xt arg minx X Pt s=0 αsℓs(x), where

α0ℓ0(x) := α0 µ x 2 2 2 , then the weighted average points ( x T , y T ) would be an O(exp( T κ))- approximate equilibrium of the game, where the weights α0, α1, . . . are chosen to satisfy αt

This implies that f( x T ) minx X f(x) = O(exp( T κ)).

5 Accelerated Proximal Method

In this section, we consider solving composite optimization problems minx Rn f(x) + ψ(x), where f( ) is smooth convex but ψ( ) is possibly non-differentiable convex (e.g. 1). We want to show that the game analysis still applies to this problem. We just need to change the payoff function g to account for ψ(x). Speciﬁcally, we consider the following two-players zero-sum game, minx maxy{ x, y f (y)+ψ(x)}. Notice that the minimax value of the game is minx f(x)+ψ(x), which is exactly the optimum value of the composite optimization problem. Let us denote the proximal operator as proxλψ(v) = argminx ψ(x) + 1 2λ x v 2 2 . 1

Algorithm 6 Accelerated Proximal Method

1: In the weighted loss setting of Algorithm 1 (let αt = t and γt = 1 4L): 2: y-player uses OPTIMISITCFTL as OAlgy: yt = f(ext). 3: x-player uses MIRRORDESCENT with ψ(x) := 1

2 x 2 2 in Bregman divergence as OAlgx: 4: xt = argminx γt(αtht(x)) + Vxt 1(x) = argminx γt(αt{ x, yt + ψ(x)}) + Vxt 1(x) 5: = argminx φ(x) + 1 2αtγt ( x 2 2 + 2 αtγtyt xt 1, x ) = proxαtγtψ(xt 1 αtγt f(ext))

We notice that the loss function of the x-player here, αtht(x) = αt( x, yt + ψ(x)), is possibly nonlinear. Yet, we can slightly adapt the analysis in Section 3 to show that the weighed average x T is still an O(1/T 2) approximate optimal solution of the ofﬂine problem. Please see Appendix I for details. One can view Algorithm 6 as a variant of the so called Accelerated Proximal Gradient in [6]. Yet, the design and analysis of our algorithm is simpler than that of [6].

Acknowlegement: We would like to thank Kevin Lai and Kﬁr Levy for helpful discussions leading up to the results in this paper. This work was supported by funding from the Division of Computer Science and Engineering at the University of Michigan, from the College of Computing at the Georgia Institute of Technology, NSF TRIPODS award 1740776, and NSF CAREER award 1453304.

1It is known that for some ψ( ), their corresponding proximal operations have closed-form solutions (see e.g. [20] for details).

[1] Jacob Abernethy, Kﬁr Levy, Kevin Lai, and Jun-Kun Wang. Faster rates for convex-concave games. COLT, 2018.

[2] Jacob Abernethy and Jun-Kun Wang. Frank-wolfe and equilibrium computation. NIPS, 2017.

[3] Jacob Abernethy, Manfred K Warmuth, and Joel Yellin. Optimal strategies from random walks. In Proceedings of The 21st Annual Conference on Learning Theory, pages 437 446. Citeseer, 2008.

[4] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uniﬁcation of gradient and mirror descent. ITCS, 2017.

[5] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. ar Xiv preprint ar Xiv:1802.05642, 2018.

[6] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. on Imaging Sciences, 2009.

[7] Sabastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov s accelerated gradient descent. 2015.

[8] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, , and Shenghuo Zhu. Online optimization with gradual variations. 2012.

[9] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. ar Xiv preprint ar Xiv:1711.00141, 2017.

[10] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size. COLT, 2015.

[11] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Gabriel Huang, Remi Lepriol, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. ar Xiv preprint ar Xiv:1807.04740, 2018.

[12] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical Programming, 2017.

[13] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 2016.

[14] Brendan Mc Mahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2724 2732. Curran Associates, Inc., 2013.

[15] Yuri Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady AN USSR, 1983.

[16] Yuri Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 27:372 376, 1983.

[17] Yuri Nesterov. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonom. i. Mat. Metody, 24:509 517, 1988.

[18] Yuri Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004.

[19] Yuri Nesterov. Smooth minimization of nonsmooth functions. Mathematical programming, 2005.

[20] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 2014.

[21] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. COLT, 2013.

[22] Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. NIPS, 2013.

[23] Tyrrell Rockafellar. Convex analysis. Princeton University Press, 1996.

[24] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. NIPS, 2014.

[25] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. NIPS, 2015.

[26] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. 2008.

[27] Andre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016.

[28] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. ICML, 2003.