# anytime_onlinetobatch_optimism_and_acceleration__a160f691.pdf

Anytime Online-to-Batch, Optimism and Acceleration

Ashok Cutkosky 1

A standard way to obtain convergence guarantees in stochastic convex optimization is to run an online learning algorithm and then output the average of its iterates: the actual iterates of the online learning algorithm do not come with individual guarantees. We close this gap by introducing a black-box modiﬁcation to any online learning algorithm whose iterates converge to the optimum in stochastic scenarios. We then consider the case of smooth losses, and show that combining our approach with optimistic online learning algorithms immediately yields a fast convergence rate of O(L/T 3/2 + σ/

T) on L-smooth problems with σ2 variance in the gradients. Finally, we provide a reduction that converts any adaptive online algorithm into one that obtains the optimal accelerated rate of O(L/T 2 + σ/

T), while still maintaining O(1/

T) convergence in the nonsmooth setting. Importantly, our algorithms adapt to L and σ automatically: they do not need to know either to obtain these rates.

1. Online-to-Batch Conversions

We consider convex stochastic optimization problems, where our objective is to minimize some convex function L : D R where D is some convex domain. We do not have true access to L, however. Instead, we have a stochastic gradient oracle that given a point x D will provide a random value g such that E[g] = L(x). Our objective is to use this noisy information to optimize L.

A simple and extremely effective method for solving stochastic optimization problems is through online learning and online-to-batch conversion (Shalev-Shwartz, 2011; Cesa-Bianchi et al., 2004). These techniques require remarkably few assumptions about the nature of the expected loss or the stochasticity in the system and yet still obtain

1Google Research, California, USA. Correspondence to: Ashok Cutkosky <ashok@cutkosky.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

optimal or near-optimal guarantees. This has helped fuel the widespread adoption of online learning algorithms as the method-of-choice in training machine learning models. Brieﬂy, an online learning algorithm accepts a sequence of convex loss functions ℓ1, . . . , ℓT and outputs a sequence of iterates w1, . . . , w T D where D is some convex space and wt is output before the algorithm observes ℓt. Performance is measured by the regret:

t=1 ℓt(wt) ℓt(x )

A standard goal in online learning is to achieve sublinear regret, which means that lim T RT (x )/T = 0. This indicates that the algorithm is doing just as well on average as the ﬁxed benchmark point x . In fact, most algorithms obtain non-asymptotic guarantees of the form RT (x ) = O(

T), so that RT (x )/T = O(1/

Online learning algorithms often adopt an adversarial model, in which no relationship is posited between ℓt, but in our stochastic optimization problem we know that the ℓt are generated by some random process. This is where the Online-to-Batch conversion technique comes in (Cesa Bianchi et al., 2004). The classic argument is as follows: Set ℓt(x) = gt, x where gt is a stochastic gradient evaluated at wt. Then observe L(wt) L(x ) E[ gt, wt x ] and apply Jensen s inequality to obtain:

We therefore output ˆx =

T as an estimate of x , and so long as the algorithm obtains sublinear regret, L(ˆx) L(x ) will approach zero in expectation. In fact, with RT (x ) = O(

T), one obtains a convergence rate O(1/

T), which is often statistically optimal.

One drawback of the online-to-batch conversion is that the iterates wt produced by the algorithm (where the noisy gradients are actually evaluated) do not necessarily converge to the optimal loss value. In fact, there is typically very little known about the behavior of any individual wt. This is aesthetically unsatisfying and may even reduce performance. For example, optimistic online algorithms can take advantage of stability in the gradients, performing well when

Anytime Online-to-Batch, Optimism and Acceleration

gt 1 gt. We hope for this behavior because intuitively the iterates should converge to x and so become closer together. Unfortunately, because actually we usually have few guarantees about the individual iterates wt, it may not hold that gt 1 gt. We would like to make intuition match theory by enforcing some kind of stability in the iterates.

We address this problem by providing a black-box online-tobatch conversion: the iterates xt produced by our algorithm converge in the sense that L(xt) L(x ) (Section 2). We call this property anytime, because the last iterate is always a good estimate of x at any time. Our reduction is quite simple, and bears strong similarity to the classical one. It stabilizes the iterates xt, and we can exploit this stability when L is smooth. For example, when applied to an optimistic online algorithm, our reduction can leverage stability to improve the convergence rate on smooth losses from O(L/T) to O(L/T 3/2) (Section 4.1). Further, our reduction also has a surprising connection to the linear coupling framework for accelerated algorithms (Allen-Zhu & Orecchia, 2014). We develop this connection to provide an algorithm that obtains a near-optimal (up to log factors) O(L/T 2 + σ/

T) convergence rate for stochastic smooth losses with σ2 = Var(gt) without knowledge of L or σ while still guaranteeing O(1/

T) convergence rate for non-smooth losses (Section 4.2). In addition to these new algorithms, we feel that our analysis itself is interesting for its appealingly simplicity.

1.1. Notation and Deﬁnitions

We frequently use the compressed-sum notation α1:t = Pt i=1 αi for any indexed variables αt. A convex function f is L-smooth if f(x + δ) f(x) + f(x), δ + L

2 δ 2 for and x, δ, and f is µ strongly convex if f(x + δ) f(x) + f(x), δ + µ

2 δ 2 for all x, δ. Given a convex function f we say that g is a subgradient of f at x, or g f(x) if f(y) f(x) + g, y x for all y. f(x) f(x) if f is differentiable.

2. Anytime Online-to-Batch

In this section we provide our anytime online-to-batch conversion. Our algorithm is actually nearly identical to the classic online to batch: we set the tth iterate xt to be the average of the ﬁrst t iterates of some online learning algorithm A. The key difference is that we evaluate the stochastic gradient oracle at xt, rather than the iterates provided by A. As a result, the outputs of A in some sense exist only for analysis and are not directly visible outside the algorithm. Further, we incorporate weights αt into our conversion. Inspired by (Levy, 2017), these weights play a role in achieving faster rates on smooth losses, as well as removing log factors on strongly-convex losses. We provide speciﬁc pseudocode and analysis in Algorithm 1 and Theorem 1 below.

Algorithm 1 Anytime Online-to-Batch

Input: Online learning algorithms A with convex domain D. Non-negative weights α1, . . . , αT with α1 > 0. Get initial point w1 D from A. for t = 1 to T do

Pt i=1 αiwt

α1:t . Play xt, receive subgradient gt. Send ℓt(x) = αtgt, x to A as the tth loss. Get wt+1 from A. end for return x T .

Theorem 1. Suppose g1, . . . , gt satisfy E[gt|xt] L(xt) for some function L and gt is independent of all other quantities given xt. Let RT (x ) be a bound on the linearized regret of A:

t=1 αtgt, wt x

Then for all x D, Algorithm 1 guarantees:

E[L(x T ) L(x )] E

" RT (x ) PT t=1 at

Further, suppose that D has diameter B = supx,y D x y and gt G with probability 1 for some G. Then with probability at least 1 δ,

L(x T ) L(x ) RT (x ) + 2BG q PT t=1 α2 t log(2/δ) PT t=1 αt

Proof. First, observe that

αt(xt wt) = α1:t 1(xt 1 xt)

where by mild abuse of notation we deﬁne α1:0 = 0 and let x0 be an arbitrary element of D.

Now we use the standard convexity argument to say:

t=1 αt(L(xt) L(x ))

t=1 αt gt, xt x

t=1 αt gt, xt wt + αt gt, wt x

E [RT (x )] + E

t=1 α1:t 1 gt, xt 1 xt

Next we use convexity again to argue E[ gt, xt 1 xt ] E[L(xt 1) L(xt)], and then we subtract

Anytime Online-to-Batch, Optimism and Acceleration

E[PT t=1 αt L(xt)] from both sides:

t=1 αt(L(xt) L(x ))

E [RT (x )] + E

t=1 α1:t 1(L(xt 1) L(xt))

E [ α1:T L(x )]

E [RT (x )] + E

t=1 α1:t 1L(xt 1) α1:t L(xt)

Finally, telescope the above sum to conclude:

E [α1:T L(x T ) α1:T L(x )] E [RT (x )]

from which the in-expectation statement of the Theorem follows.

For the high-probability statement, let Ht 1 be the history gt 1, xt 1, . . . , g1, x1. Let Gt = E[gt|Ht 1, xt, wt]. Note that Gt is still a random variable, and satisﬁes Gt L(xt). Next, let ϵt = αt Gt, wt x αt gt, wt x . Then we have E[ϵt|Ht 1, xt, wt] = 0 and:

t=1 αt Gt, xt x

t=1 αt gt, xt x

|ϵt| 2αt BG with probability 1

So by the Azuma-Hoeffding bound, with probability at least 1 δ :

t=1 α2 t log(2/δ)

Therefore with probability at least 1 δ, we have

t=1 αt(L(xt) L(x ))

t=1 αt Gt, xt x

t=1 αt Gt, xt wt +

t=1 αt gt, wt x +

t=1 αt Gt, xt wt + RT (x ) + 2BG

t=1 α2 t log 2

Now an identical argument to the in-expectation part of the Theorem (but without need for taking expectations) yields:

L(x T ) L(x ) RT (x ) + 2BG q PT t=1 α2 t log(2/δ) PT t=1 αt

As a corollary, we observe that the simple setting of αt = 1 for all T yields a direct analog of the classic online-to-batch conversion guarantee:

Corollary 1. Under the assumptions of Theorem 1, set αt = 1 for all t. Then RT (x ) = PT t=1 gt, wt x , which is the usual un-weighted regret. We have

E[L(x T ) L(x )] E

Further, x T = 1

T PT t=1 wt.

Corollary 1 is quite similar to the classic online-to-batch conversion result: in both cases, the average of the online learner s predictions has excess loss bounded by the average regret. Again, the critical difference is that in Algorithm 1, the actual outputs where the gradients are evaluated are the averaged outputs of the online learner. Thus the loss of the iterates converges to the minimum loss for Algorithm 1, which is not the case for the standard reduction.

In addition to this anytime online-to-batch result, we show below that Algorithm 1 also maintains low regret:

Corollary 2. Under the assumptions of Theorem 1, let RM(x ) maxt Rt(x ). Then we have

t=1 αt(L(xt) L(x ))

RM(x ) 1 + log α1:T

Proof. From Theorem 1 we have

E[αt(L(xt) L(x ))] E

αt RM t (x ) α1:t

Then observe that log(a)+b/(a+b) log(a+b) and sum over t to conclude the Corollary.

Recall that essentially all online learning regret bounds are non-decreasing in T, so that maxt Rt(x ) = RT (x ). Thus the regret of Algorithm 1 is only a logarithmic factor worse than the regret of the original online learner. Moreover, in the typical case that Rt(x ) = O(

t), a trivial modiﬁcation of the above proof shows that E[L(x T ) L(x )] O(1/

T), so that in many cases one should not even incur the log factor.

In fact, the anytime result is signiﬁcantly more powerful than a standard regret bound because it provides point-wise bounds. This allows us to achieve a variety of different weighted regret bounds simultaneously:

Corollary 3. Under the assumptions of Theorem 1, further suppose that RT (x ) is non-decreasing in T and set αt = 1.

Anytime Online-to-Batch, Optimism and Acceleration

Let st = tk for some constants k > 0 (note that Algorithm 1 is not aware of st). Then

"PT t=1 st(L(xt) L(x ))

Proof. Observe s1:t = Θ(tk+1) so that E[st(L(xt) L(x ))/s1:T ] O(E[RT (x )]tk 1/T k+1), and sum over t.

3. General Analysis

In this section we provide a more general version of our online-to-batch reduction. The previous analysis appears to critically rely on linearized regret E[PT t=1 αt(L(xt) L(x ))] E[PT t=1 αt gt, xt x ]. This inequality may be tight for general convex losses, but in many cases we may want to take advantage of some known non-linearity in the losses. For example, when the loss function is µstrongly convex, one can use the inequality L(xt) L(x ) ℓt(xt) ℓt(x ) where ℓt(x) = L, x + µ

2 x xt 2, leading to a O(log(T)/T) convergence rate rather than O(1/

T) (Hazan et al., 2007). In order to incorporate this information in our framework, we propose Algorithm 2.

Algorithm 2 modiﬁes Algorithm 1 by considering an oracle that produces losses ℓt rather than stochastic gradients gt. Speciﬁcally, we will require ℓt that are convex and lower-bound L in expectation. This generalizes the linear losses of Algorithm 1, and it may often be possible to construct nonlinear ℓt via only a gradient oracle, such as in the strongly-convex case. Our strategy for using these losses is essentially unchanged from that of Algorithm 1, but now our analysis is slightly more delicate since we cannot exploit the nice algebraic properties of linearity.

Algorithm 2 General Anytime Online-to-Batch

Input: Online learning algorithms A with convex domain D. Non-negative weights α1, . . . , αT with α1 > 0 Get initial point w1 D from A. for t = 1 to T do

Pt i=1 αiwi

α1:t Play xt, compute loss ℓt. Send αtℓt(x) to A as the tth loss. Get wt+1 from A. end for return x T .

Theorem 2. Suppose ℓt is convex and satisﬁes L(xt) L(x) E[ℓt(xt) ℓt(x)|xt] for all t and for all x. Then

t=1 αtℓt(wt) αtℓt(x t )

Algorithm 2 obtains

E[L(x T ) L(x )] E

" RT (x ) PT t=1 αt

t=1 αt(ℓt(xt) ℓt(x t ))

t=1 αt(ℓt(xt) ℓt(wt)) +

t=1 αt(ℓt(wt) ℓt(x t ))

= RT (x ) +

t=1 αt(ℓt(xt) ℓt(wt)) (1)

Now observe that xt = α1:t 1xt 1+αtwt

α1:t 1 . Therefore by Jensen s inequality we have

ℓt(xt) α1:t 1ℓt(xt 1) + αtℓt(wt)

α1:t αtℓt(xt) αtℓt(wt) α1:t 1(ℓt(xt 1) ℓt(xt))

Now plug this into (1):

t=1 αt(ℓt(xt) ℓt(x ))

t=1 α1:t 1ℓt(xt 1) α1:t 1ℓt(xt)

Now observe that E[ℓt(xt 1) ℓt(xt)] E[L(xt 1) L(xt)]. So taking expectations yields:

t=1 αt(L(xt) L(x t ))

t=1 α1:t 1(L(xt 1) L(xt))

Now the rest of the proof is identical to that of Theorem 1.

3.1. Strongly Convex losses

In this section we apply the more general Algorithm 2 to µ-strongly-convex losses. We recover standard convergence rates using only a gradient oracle and knowledge of the strong-convexity parameter µ. We note that similar results also apply to exp-concave losses or other cases with lowerbounded Hessians.

Anytime Online-to-Batch, Optimism and Acceleration

Corollary 4. Suppose D has diameter B, gt G with probability 1, and A is Follow-the-Leader: wt+1 = argmin Pt i=1 ℓi(w). Suppose L is µ-strongly convex and we set ℓt(x) = gt, x + µ

2 x xt 2 where E[gt|xt] = L(xt). Let αt = 1 for all t. Then we have

RT (x ) (µB + G)2(log(T) + 1)

E[L(x T ) L(x )] (µB + G)2(log(T) + 1)

Proof. The fact that L(xt) L(x ) E[ℓt(xt) ℓt(x )|xt] follows from strong-convexity. Observe that ℓt(wt) = gt+µ(wt xt) G+µB so that ℓt is G+µB-Lipschitz. Then the bound on RT follows from standard analysis of the follow-the-leader algorithm using the fact that Pt i=1 ℓi(w) is tµ-strongly convex (Mc Mahan, 2014):

and then use Pt i=1 1/i log(T) + 1.

This corollary provides the anytime analog of the standard online-to-batch result for strongly-convex losses. However, it is well-known that in the stochastic case the logarithmic factor is unnecessary. Prior work has removed it via diverse mechanisms, including restarting schemes (Hazan & Kale, 2014) and tail-averaging (Rakhlin et al., 2012). Here we use the weights αt to easily remove the log factor, similar to the analogous scheme for the classic online-to-batch conversion (Lacoste-Julien et al., 2012; Bubeck et al., 2015).

Corollary 5. Under the assumptions of Corollary 4, suppose that αt = t for all t. Then we have

RT (x ) T(µB + G)2

E[L(x T ) L(x )] 2(µB + G)2

Proof. In this case, αtℓt is t(µB + G)-Lipschitz and Pt i=1 αiℓi is α1:tµ = T (T +1)µ

2 strongly convex. Thus the regret of Follow-the-Leader is bounded by

T (µB + G)2

Now divide by α1:T = T(T + 1)/2 to see the claim.

4. Adaptivity and Smoothness

Many so-called adaptive online algorithms obtain regret

bounds of the form RT (x ) O ψ(x ) q PT t=1 gt 2

for various functions ψ. For example, Mirror-Descent and FTRL-based algorithms often obtain ψ(x ) = B, where B is the diameter of the space D (Mc Mahan & Streeter, 2010; Duchi et al., 2010; Hazan et al., 2008) while so-called parameter-free algorithms can obtain ψ(x ) = O( x ), providing optimal adaptivity to x at the expense of logarithmic factors (Cutkosky & Orabona, 2018). These adaptive bounds can be shown to obtain the better regret guarantee E h PT t=1 L(wt) L(x ) i

O Lψ(x )2 + ψ(x )σ

T when the loss L is L-smooth and gt has variance σ, by exploiting the self-bounding property L(x) 2 L(L(x) L(x )) (Srebro et al., 2010; Cutkosky & Busa-Fekete, 2018; Levy et al., 2018).

The appealing property of this argument is that the algorithm knows neither L nor σ and yet automatically adapts to both parameters, matching the performance of an optimallytuned SGD algorithm. Since Algorithm 1 also obtains low regret, we can make a similar claim:

Corollary 6. Suppose RT (x ) ψ(x ) q PT t=1 α2 t gt 2. Suppose L is L-smooth and obtains its minimum at x D. Suppose gt has variance at most σ2. Then with αt = 1 for all t, Algorithm 1 obtains:

E[L(x T ) L(x )] O ψ(x )2L log2(T)

T + σ log(T)

Proof. Deﬁne t = E[L(xt) L(x )]. Observe that

E[ gt 2] E[ L(xt) 2] + σ2 L t + σ2

E[RT (x )] ψ(x )

t=1 t + Tσ2

Then apply Corollary 2 and quadratic formula to obtain PT t=1 t O ψ(x )2L log2(T) + σ log(T)

αt = 1 and observe T E[RT (x )]/T to prove the Corollary.

The assumption that x D and the log factors in this analysis are a bit troubling. In the next section we exploit optimism instead of the self-bounding property, which yields much better results with much less effort.

4.1. Optimism for Faster Rates

In this section we show how to leverage our online-to-batch scheme in combination with optimistic online learning to

Anytime Online-to-Batch, Optimism and Acceleration

further speed up the convergence rate. We will achieve a rate of O(L/T 3/2 + σ/

T) with no knowledge of either L or σ, resulting in a kind of interpolation between the O(L/T + σ/

T) rate and the optimal accelerated rate of O(L/T 2 + σ/

T) (Lan, 2012).

An optimistic online learning algorithm is an online learner that is given access to a series of hints ˆg1, . . . , ˆg T where ˆgt is revealed to the learner after gt 1 but before it commits to wt (Hazan & Kale, 2010; Rakhlin & Sridharan, 2013; Chiang et al., 2012; Mohri & Yang, 2016). Optimistic algorithms attempt to guarantee small regret when ˆgt gt, because in this scenario the learner has a good guess for what the future will contain. In particular, the optimistic algorithm of (Mohri & Yang, 2016) guarantees regret:

t=1 α2 t ˆgt gt 2

where B is the diameter of the D. A common choice for ˆgt is gt 1. Intuitively, this choice is optimistic in the sense that we are hoping gt 1 gt, which is the case on smooth losses if the iterates are close together. Fortunately, it is the case that xt is necessarily close to xt 1, so we use this regret bound for faster convergence in Algorithm 3 and Theorem 3.

Algorithm 3 Optimistic Anytime Online-to-Batch

Input: Optimistic Online algorithm A with domain D. Non-negative weights α1, . . . , αT with α1 > 0. Get initial point w1 D from A. Set g0 = 0. for t = 1 to T do

Send αtgt 1 to A ad tth hint. xt

Pt i=1 αiwt

α1:t . Play xt, receive subgradient gt. Send ℓt(x) = αtgt, x to A as the tth loss. Get wt+1 from A. end for return x T .

Theorem 3. Suppose D has diameter B and A obtains the

regret bound RT (x ) B q

2 PT t=1 α2 t ˆgt gt 2 when given hints ˆgt ahead of the gradient gt. Set αt = t for all t. Suppose each gt has variance at most σ2, and L is L-smooth. Then Algorithm 3 yields:

E[L(x T ) L(x )] O LB2

Proof. Since we set ˆgt = gt 1, the assumption on A implies:

t=1 α2 t gt 1 gt 2

We can write gt = L(xt)+ζt where ζt is some mean-zero random variable with E[ ζt 2] σ2. Then by smoothness, for t > 1 we have

gt gt 1 L(xt) L(xt 1) + ζt ζt 1

L xt xt 1 + ζt ζt 1

α1:t + ζt + ζt 1

E[ ˆgt gt 2] 5L2α2 t B2

(α1:t)2 + 10σ2

where in the last step we used (a+b+c)2 5(a2+b2+c2). Further, for t = 1, we have

E[ g1 2] E[( L(x1) L(x ) + ζt )2]

E[ g1 ˆg1 2] 2L2B2 + 2σ2 5L2B2α2 1 (α1:1)2 + 10σ2

Next, observe that α1:t > t2/2 so that

E[ ˆgt gt 2] 20L2B2

Now observe PT t=1 t2 < 3(T + 1)3/2 and apply Jensen:

E[RT (x )] E

t=1 α2 t ˆgt gt 2

30(T + 1)3σ2 + 40L2B2T

And by Theorem 1 we have the desired result:

E[L(x T ) L(x )] 4

Note that the ordinary online-to-batch conversion may not be able to obtain this rate: here we are critically relying on the stability of the iterates xt to guarantee that gt and gt 1 are not too far apart, while in the standard online-to-batch conversion one would require stability in the wt, which may not occur.

4.2. Acceleration

In the deterministic setting, (Levy et al., 2018) showed how to use adaptive step-sizes in conjuction with the linearcoupling framework (Allen-Zhu & Orecchia, 2014) to derive an accelerated algorithm that adapts to the smoothness parameter L. In this section we show that our Algorithm 1 and analysis is actually very similar in spirit to the linearcoupling scheme and so we can also derive an accelerated algorithm that adapts to both smoothness and variance optimally. To our knowledge this is the ﬁrst accelerated algorithm to adapt to variance. Our analysis is arguably simpler

Anytime Online-to-Batch, Optimism and Acceleration

than prior work: our proof is much shorter, we rely on only relatively simple properties of αt and we do not use the internals of the online algorithm.

Unlike previously in this paper, but similar to (Levy et al., 2018), here we will require L to be deﬁned on an entire vector space rather than potentially bounded domain D. We will also assume knowledge of some parameter B such that x B/2. Lifting these restrictions are both valuable future directions.

Algorithm 4 Adaptive Stochastic Acceleration

Input: Bound B 2 x , value c, Online learning algorithms A with domain D = { w B/2}. Get initial point w1 D from A. y0 w1. for t = 1 to T do

αt t. τt αt Pt i=1 αi . xt (1 τt)yt 1 + τtwt. Play xt, receive subgradient gt. ηt c B

1+Pt i=1 α1:i gi 2 yt xt ηtgt. Send ℓt(x) = αtgt, x to A as the tth loss. Get wt+1 from A. end for return x T .

Theorem 4. Suppose E[gt] = L(xt) for some L-smooth function L with domain an entire Hilbert space H. Suppose gt G with probability 1 and gt has variance at most σ2

for all t. Suppose x B/2. Let D be the ball of radius B/2 in H and suppose A guarantees regret

RT (x ) k B

t=1 αt gt 2

For some k. Then with c =

2k, Algorithm 4 guarantees:

E [L(y T ) L(x )] 2

2k B + 2k LB2 log(1 + G2T 3)

2 log(1 + G2T 3)

Proof. The opening of our proof is again very similar to that of Theorem 1: observe that

t=1 αt(L(xt) L(x ))

t=1 a1:t 1 gt, yt 1 xt

Next we use convexity again to argue E[ gt, yt 1 xt ] E[L(yt 1) L(xt)], and then we subtract E[PT t=1 αt L(xt)] from both sides:

E[ α1:T L(x )] (2)

t=1 α1:t 1L(yt 1) α1:t L(xt)

Now we use smoothness to relate L(yt) to L(xt). Deﬁning ζt = gt L(xt) and βt = α1:t, we have:

E[L(yt)] E[L(xt) + L(xt)(yt xt) + L

L(xt) ηt gt 2 + ηt ζt, gt + Lη2 t gt 2

Then multiply by βt:

E[βt(L(yt) L(xt))]

c Bβt gt 2 q

1 + Pt i=1 βi gi 2 + Lβtη2 t gt 2

2 + βt ζt, gt

Next, we borrow Lemma A.2 from (Levy et al., 2018): for positive numbers x1, . . . , xn v u u t

xi q Pi i =1 xi 2

Also, observe from concavity of log that:

xi 1 + Pi i =1 xi log

Using this we obtain

t=1 βt(L(yt) L(xt))

t=1 βt gt 2 + c2B2L log(1 + G2β1:T )

t=1 ζt, βtgt ηt

Next, use Cauchy-Schwarz:

t=1 ζt, βtgt ηt

t=1 βt ζt 2

t=1 βt gt 2η2 t

t=1 βt ζt 2

t=1 βt gt 2 !

Anytime Online-to-Batch, Optimism and Acceleration

And then use Jensen s inequality:

t=1 ζt, βtgt ηt

t=1 βt ζt 2p

log(1 + G2β1:T )

β1:T log(1 + G2β1:T )

Where in the last line we observed E[ ζt 2] σ2. Combining everything, we have

t=1 αt L(x )

t=1 α1:t 1L(yt 1) α1:t L(yt)

c2LB2 log(1 + G2β1:T )

t=1 α1:t gt 2

c B + c Bσ p

β1:t log(1 + G2β1:t) i

Now observe that α1:t > α2 t/2 and recall RT (x )

k B q PT t=1 α2 t gt 2. Therefore since c =

2k we cancel

the RT (x ), observe β1:T PT t=1 t2 T 3, and telescope to obtain:

E[α1:T (L(y T ) L(x ))] c B + c2B2L log(1 + G2T 3)

2 + c BT 3/2σ p

log(1 + G2T 3)

and dividing by α1:T = T (T +1)

2 completes the proof.

We remark also that, similar to the algorithm of (Levy et al., 2018), our Algorithm 4 is universal in the sense that for non-smooth losses we recover the O(1/

T) rate with no modiﬁcations. In fact, our analysis improves somewhat over (Levy et al., 2018) in that we maintain an adaptive convergence rate in the non-smooth setting.1

Theorem 5. Suppose E[gt] = L(xt) for some convex function L. Then Algorithm 4 guarantees:

E[[L(y T ) L(x )]

2 PT t=1 t2 L(yt) 2

log(1+G3T 3) T 2

Note that in the setting with gt G and RT (x ) =

O q PT t=1 α2 t gt 2 , Theorem 5 implies a convergence

rate of O( p

1We suspect this same adaptive non-smooth rate can be achieved by (Levy et al., 2018) via similar improved analysis.

Proof. We start from (3), and again proceed to relate L(yt) to L(xt), this time without the aid of smoothness:

E[L(yt) L(xt)] E[ L(yt), yt xt ]

E[ L(yt) gt ηt]

So by Cauchy-Schwarz, again deﬁning βt = α1:t we have

t=1 βt(L(yt) L(xt))

t=1 βt L(yt) gt ηt

t=1 βt L(yt) 2]

t=1 βt gt 2η2 t

t=1 βt L(yt) 2p

log(1 + G3T 3)

And combining everything yields

E[ α1:T L(x )]

RT (x ) + B

t=1 βt L(yt) 2p

log(1 + G3T 3)

t=1 α1:t 1L(yt 1) α1:t L(yt)

Telescope the sum and rearrange to prove the theorem.

5. Conclusion

We have provided a variant on the standard online-to-batch conversion technique that enables us to compute gradients at the iterates produced by the conversion algorithm rather than those produced by the online learning algorithm. This stabilizes the sequence of iterates and enables low regret even with respect to arbitrary polynomial weights. We show how to apply our approach to easily remove the log factors in stochastic strongly-convex optimization. Further, for smooth losses, we gain stability in the gradients which can be used by optimistic online algorithms. Finally, a small modiﬁcation allows us to achieve the optimal stochastic accelerated rates. Not only is this the ﬁrst method to adapt to both variance and smoothness optimally, it also is more general than prior analyses by virtue of being a black-box reduction from any sufﬁciently adaptive online learning algorithm. Finally, a recent connection between optimism and acceleration by (Wang & Abernethy, 2018) suggests that it may be possible to improve our optimistic analysis to match the accelerated rate in an even simpler manner.

Anytime Online-to-Batch, Optimism and Acceleration

Allen-Zhu, Z. and Orecchia, L. Linear coupling: An ultimate uniﬁcation of gradient and mirror descent. ar Xiv preprint ar Xiv:1407.1537, 2014.

Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):231 357, 2015.

Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning algorithms. Information Theory, IEEE Transactions on, 50(9):2050 2057, 2004.

Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., Jin, R., and Zhu, S. Online optimization with gradual variations. In Conference on Learning Theory, pp. 6 1, 2012.

Cutkosky, A. and Busa-Fekete, R. Distributed stochastic optimization via adaptive sgd. In Advances in Neural Information Processing Systems, pp. 1914 1923, 2018.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. ar Xiv preprint ar Xiv:1802.06293, 2018.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory (COLT), 2010.

Hazan, E. and Kale, S. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80(2-3):165 188, 2010.

Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Hazan, E., Rakhlin, A., and Bartlett, P. L. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pp. 65 72, 2008.

Lacoste-Julien, S., Schmidt, M., and Bach, F. A simpler approach to obtaining an o (1/t) convergence rate for the projected stochastic subgradient method. ar Xiv preprint ar Xiv:1212.2002, 2012.

Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365 397, 2012.

Levy, K. Online to ofﬂine conversions, universality and adaptive minibatch sizes. In Advances in Neural Information Processing Systems, pp. 1613 1622, 2017.

Levy, Y. K., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. In Advances in Neural Information Processing Systems, pp. 6501 6510, 2018.

Mc Mahan, H. B. A survey of algorithms and analysis for adaptive online learning. ar Xiv preprint ar Xiv:1403.3465, 2014.

Mc Mahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.

Mohri, M. and Yang, S. Accelerating online convex optimization via adaptive prediction. In Artiﬁcial Intelligence and Statistics, pp. 848 856, 2016.

Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In COLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA, pp. 993 1019, 2013. URL http://jmlr.org/proceedings/ papers/v30/Rakhlin13.html.

Rakhlin, A., Shamir, O., Sridharan, K., et al. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, volume 12, pp. 1571 1578. Citeseer, 2012.

Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2011.

Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in neural information processing systems, pp. 2199 2207, 2010.

Wang, J.-K. and Abernethy, J. D. Acceleration through optimistic no-regret dynamics. In Advances in Neural Information Processing Systems, pp. 3828 3838, 2018.