# temporal_variability_in_implicit_online_learning__1482e7ba.pdf

Temporal Variability in Implicit Online Learning

Nicol o Campolongo Universit a di Milano nicolo.campolongo@unimi.it

Francesco Orabona Boston University francesco@orabona.com

In the setting of online learning, Implicit algorithms turn out to be highly successful from a practical standpoint. However, the tightest regret analyses only show marginal improvements over Online Mirror Descent. In this work, we shed light on this behavior carrying out a careful regret analysis. We prove a novel static regret bound that depends on the temporal variability of the sequence of loss functions, a quantity which is often encountered when considering dynamic competitors. We show, for example, that the regret can be constant if the temporal variability is constant and the learning rate is tuned appropriately, without the need of smooth losses. Moreover, we present an adaptive algorithm that achieves this regret bound without prior knowledge of the temporal variability and prove a matching lower bound. Finally, we validate our theoretical ﬁndings on classiﬁcation and regression datasets.

1 Introduction

The online learning paradigm is a powerful tool to model common scenarios in the real world when the data comes in a streaming fashion, for example in the case of time series. In the last two decades there has been a tremendous amount of progress in this ﬁeld (see, e.g., [30, 13, 24], for an introduction), which also led to advances in seemingly unrelated areas of machine learning and computer science. In this setting, a learning agent faces the environment in a game played sequentially. The protocol is the following: given a time horizon T, in every round t = 1, . . . , T the agent chooses a model xt from a convex set V . Then, a convex loss function ℓt is revealed by the environment and the agent pays a loss ℓt(xt). As usual in this setting, we do not make assumptions about the environment, but allow it to be adversarial. The agent s goal is to minimize her regret against any decision maker, i.e., the cumulative sum of her losses compared to the losses of an agent which always commits to the same choice u. So, formally the regret against any u V is deﬁned as

t=1 ℓt(u) .

Much of the progress in this ﬁeld is driven by the strictly related model of Online Linear Optimization (OLO): exploiting the assumption that the loss functions are convex, we can linearize them using a ﬁrst-order approximation through its (sub)gradient and subsequently minimize the linearized regret. For example, the well-known Online Gradient Descent (OGD) [38] simply uses the direction of the negative (sub)gradient of the loss function to update its model, multiplied by a given learning rate. Usually, a properly tuned learning rate gives a regret bound of O(

T), which is also optimal. On the other hand, we can choose to not use any approximation to the loss function and instead update our model using directly the loss function rather than its subgradient [17]. This type of update is known as Implicit and algorithms designed in this way are known to have practical advantages [18]. Unfortunately, their theoretical understanding is still limited at this point.

Work done while visiting the OPTIMAL Lab at Boston University.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Our ﬁrst contribution (Section 5) in this paper is a reﬁned analysis of Implicit algorithms in the framework of Online Mirror Descent (OMD). Doing this allows us to understand why Implicit algorithms might practically work better compared to algorithms which use (sub)gradients in the update. In particular, we describe how these algorithms can potentially incur only a constant regret if the sequence of loss functions does not vary with time. In particular, we measure the hardness of the sequence of loss functions with its temporal variability, which is deﬁned as

t=2 max x V ℓt(x) ℓt 1(x) . (1)

Our second contribution (Section 6) is a new adaptive Implicit algorithm, Ada Implicit, which retains the worst-case O(

T) regret bound but takes advantage of a slow varying sequence of loss functions and achieve a regret of O(VT +1). Also, we prove a lower bound which shows that our algorithm is optimal. Finally, in order to show the beneﬁts of using Implicit algorithms in practice, in Section 7 we conduct an empirical analysis on real-world datasets in both classiﬁcation and regression tasks.

2 Related Work

Implicit Updates. The implicit updates in online learning were proposed for the ﬁrst time by Kivinen and Warmuth [17]. However, such update with the Euclidean divergence is the Proximal update in the optimization literature dating back at least to 1965 [22, 19, 29, 27], and more recently used even in the stochastic setting [33, 2]. Later, this idea was re-invented by Crammer et al. [11] for the speciﬁc case of linear prediction with losses that have a range of values in which they are zero, e.g., hinge loss and epsilon-insensitive loss. Implicit updates were also used for online learning with kernels [9] and to deal with importance weights [16]. Kulis and Bartlett [18] provide the ﬁrst regret bounds for implicit updates that match those of OMD, while Mc Mahan [20] makes the ﬁrst attempt to quantify the advantage of the implicit updates in the regret bound. Finally, Song et al. [31] generalize the results in Mc Mahan [20] to Bregman divergences and strongly convex functions, and quantify the gain differently in the regret bound. Note that in [20, 31] the gain cannot be exactly quantiﬁed, providing just a non-negative data-dependent quantity subtracted to the regret bound.

Adaptivity. Our new analysis hinges on the concept of temporal variability VT of the losses, a quantity ﬁrst deﬁned in Besbes et al. [5] in the context of non-stationary stochastic optimization and later generalized in Chen et al. [8]. In general, the temporal variability has been used in works considering dynamic environments [e.g., 15, 37, 3, 36]. In particular, Jadbabaie et al. [15] consider different notions of adaptivity at the same time: if we consider the static regret case with no op-

timistic updates, then their bound gives RT = O( q PT t=1 gt 2 + 1), which is never better than ours. At ﬁrst sight, our algorithm seems to achieve the same constant regret bound of Optimistic algorithms [10, 28] if the sequence of loss functions is such that VT = O(1). However, for this result Optimistic algorithms need either smooth or linear loss functions. In contrast, our algorithm does not need this assumption. Other examples of adaptivity to the sequence of loss functions can be found in [14, 32], which consider bounds in terms of the variance of the sequence of linear losses.

Finally, it is worth mentioning that recently there have been attempts to analyze Implicit algorithms in dynamic environments [see, e.g., 12, 1, 7]. Nevertheless, these works are not directly comparable to ours since they either consider a different (noisy) setting and competitor or make stronger assumptions (i.e. smoothness and/or strong convexity of the loss functions).

3 Deﬁnitions

For a function f : Rd ( , + ], we deﬁne a subgradient of f in x Rd as a vector g Rd

that satisﬁes f(y) f(x) + g, y x , y Rd. We denote the set of subgradients of f in x by f(x). The indicator function of the set V , i V : Rd ( , + ], is deﬁned as

i V (x) = 0, x V, + , otherwise.

We denote the dual norm of by . A proper function f : Rd ( , + ] is µ-strongly convex over a convex set V int dom f w.r.t. if x, y V and g f(x), we have

Algorithm 1 Implicit Online Mirror Descent (IOMD)

Require: Non-empty closed convex set V X Rd, ψ : X R, ηt > 0, x1 V

1: for t = 1, . . . , T do 2: Output xt V 3: Receive ℓt : Rd R and pay ℓt(xt) 4: Update xt+1 = arg minx V Bψ(x, xt) + ηtℓt(x) 5: end for

f(y) f(x) + g, y x + µ

2 x y 2. Let ψ : X R be strictly convex and continuously differentiable on int X. The Bregman Divergence w.r.t. ψ is Bψ : X int X R+ deﬁned as Bψ(x, y) = ψ(x) ψ(y) ψ(y), x y . We assume that ψ is strongly convex w.r.t. a norm in int X. We also assume w.l.o.g. the strong convexity constant to be 1, which implies

2 x y 2, x X, y int X . (2)

4 Online Mirror Descent with Implicit Updates

In this section, we introduce the Implicit Online Mirror Descent (IOMD) algorithm, its relationship with OMD, and some of its properties.

Consider a set V X Rd. The Online Mirror Descent [35, 4] update over V is

xt+1 = arg min x V Bψ(x, xt) + ηt(ℓt(xt) + gt, x xt ) = arg min x V Bψ(x, xt) + ηt gt, x ,

for gt ℓt(xt) received as feedback. In words, OMD updates the solution minimizing a ﬁrst-order approximation of the received loss, ℓt, around the predicted point, xt, constrained to be not too far from the predicted point measured with the Bregman divergence. It is well-known, [e.g. 24], that the regret guarantee for OMD for a non-increasing sequence of learning rates (ηt)T t=1 is

Bψ(u, xt) Bψ(u, xt+1)

2 gt 2 , u V . (3)

This gives a O(

T) regret with, e.g., maxx,y V Bψ(x, y) < , Lipschitz losses, and ηt 1/

A natural variation of the classic OMD update is to use the actual loss function ℓt, rather than its ﬁrst-order approximation. This is called implicit update [17] and is deﬁned as

xt+1 = arg min x V Bψ(x, xt) + ηtℓt(x) . (4)

Note that, in general, this update does not have a closed form, but for many interesting cases it is still possible to efﬁciently compute it. Notably, for ψ = 1

2 2 2 and linear prediction with the square, absolute, and hinge loss, these updates can all be computed in closed form when V = Rd [see, e.g., 11, 18]. This update leads to the Implicit Online Mirror Descent (IOMD) algorithm in Algorithm 1.

We next show how the update in Eq. (4) yields new interesting properties which are not shared with its non-implicit counterpart. Their proofs can be found in Appendix B. Proposition 4.1. Let xt+1 be deﬁned as in Eq. (4). Then, there exists g t ℓt(xt+1) such that

ℓt(xt) ℓt(xt+1) Bψ(xt+1, xt)/ηt 0, (5)

ηtg t + ψ(xt+1) ψ(xt), u xt+1 0 u V, (6)

g t, xt+1 xt gt, xt+1 xt . (7)

The ﬁrst property implies that, in contrast to OMD, the value of the loss function in xt+1 is always smaller than or equal to its value in xt. This means that, if ℓt = ℓ, the value ℓ(xt) will be monotonically decreasing over time. The second property gives an alternative way to write the update rule expressed in Eq. (4). In particular, using ψ(x) = 1 2 x 2 2 and V = Rd the update becomes xt+1 = xt ηtg t, motivating the name implicit . Using this fact in the last property,2 we have that with L2 regularization, the dual norm of g t is smaller than the dual norm of gt, i.e. g t 2 gt 2.

2Eq. (7) is nothing else than the fact that subgradients are monotone operators.

Let s gain some additional intuition on the implicit updates. Consider the case of V = Rd and ψ(x) = 1

2 2 2. We have that xt+1 = xt ηtg t, where g t ℓt(xt+1). Now, if ℓt+1 ℓt, we would be updating the algorithm approximately with the next subgradient. On the other hand, knowing future gradients is a safe way to have constant regret. Hence, we can expect IOMD to have low regret if the functions are slowly varying over time. In the next sections, we will see that this is indeed the case.

5 Two Regret Bounds for IOMD

In the following, we will present a new regret guarantee for IOMD. First, we give a simple lemma that provides a bound on the cumulative losses paid after the updates (proof in Appendix B). Lemma 5.1. Let V X Rd be a non-empty closed convex set. Let Bψ be the Bregman divergence w.r.t. ψ : X R. Then, Algorithm 1 guarantees

t=1 ℓt(xt+1)

Bψ(u, xt) Bψ(u, xt+1)

1 ηt Bψ(xt+1, xt) . (8)

Furthermore, assume that (ηt)T t=1 is a non-increasing sequence and let D2 maxx,u V Bψ(u, x). Then the bound can further be expressed as

t=1 ℓt(xt+1)

t=1 ℓt(u) D2

1 ηt Bψ(xt+1, xt) . (9)

Adding PT t=1 ℓt(xt) on both sides of Eq. (8), we immediately get our new regret bound. Theorem 5.2. Under the assumptions of Lemma 5.1, the regret incurred by Algorithm 1 is bounded as

Bψ(u, xt) Bψ(u, xt+1)

ℓt(xt) ℓt(xt+1) Bψ(xt+1, xt)

We note that this result could also be extrapolated from [31], by carefully going through the proof of their Lemma 1. However, as in the other previous work, they did not identify that the key quantity to be used in order to quantify an actual gain is the temporal variability VT , as we will show later.

First Regret: Recovering OMD s Guarantee. To this point, the advantages of an implicit update are still not clear. Therefore, we now show how, from Theorem 5.2, one can get a possibly tighter bound than the usual O(

T). The key point in this new analysis is to introduce g t as deﬁned in Proposition 4.1 and relate it to the Bregman divergence between xt and xt+1. Theorem 5.3. Let g t ℓt(xt+1) satisfy Eq. (6). Assume ψ to be 1-strongly convex w.r.t. . Then, under the assumptions of Lemma 5.1, we have that Algorithm 1 satisﬁes

ℓt(xt) ℓt(xt+1) Bψ(xt+1, xt)

ηt ηt gt min 2 g t , gt

, t, gt ℓt(xt) . (11)

Proof. Using the convexity of the losses, we can bound the difference between ℓt(xt) and ℓt(xt+1):

ℓt(xt) ℓt(xt+1) gt, xt xt+1 gt xt xt+1 ,

where gt ℓt(xt). Given that ψ is 1-strongly convex, we can use Eq. (2) to obtain

ℓt(xt) ℓt(xt+1) gt q

2Bψ(xt+1, xt) . (12)

Note that ℓt(xt) ℓt(xt+1) Bψ(xt+1, xt)/ηt ℓt(xt) ℓt(xt+1). Hence, to get the ﬁrst term in the min of Eq. (11), we can simply look for an upper bound on the term p

2Bψ(xt+1, xt) in Eq. (12) above. Using the fact that the Bregman divergence is convex in its ﬁrst argument, we get

Bψ(xt+1, xt) ψ(xt+1) ψ(xt), xt+1 xt ηtg t, xt xt+1 ηt g t xt+1 xt

2Bψ(xt+1, xt),

where we used Eq. (6) in the second inequality and Eq. (2) in the last one. Solving this inequality with respect to Bψ(xt+1, xt), we get p

2Bψ(xt+1, xt) 2ηt g t .

For the second term, it sufﬁces to subtract Bψ(xt+1, xt)/ηt on both sides of Eq. (12) and use the fact that bx a

2a, x R with x = Bψ(xt+1, xt).

This Theorem immediately gives us that Algorithm 1 has a regret upper-bounded by

Bψ(u, xt) Bψ(u, xt+1)

t=1 ηt gt min 2 g t , gt

where gt ℓt(xt). The presence of the minimum makes this bound equivalent in a worst-case sense to the one of OMD in Eq. (3). Moreover, at least in the Euclidean case, from Eq. (7) we have that g t 2 gt 2. However, it is difﬁcult to quantify the gain over OMD because in general gt and g t are data-dependent. Hence, as in the other previous analyses, the gain over OMD would be only marginal and not quantiﬁable. This is not a limit of our analysis: it is easy to realize that in the worst case the OMD update and the IOMD update can coincide. To show instead that a real gain is possible, we are now going to take a different path.

Second Regret: Temporal Variability in IOMD. Here we formalize our key intuition that IOMD is using an approximation of the future subgradient when the losses do not vary much over time. We use the notion of temporal variability of the losses, VT , as given in Eq. (1). Considering again our regret bound in Theorem 5.2 and using ηt = η for all t, we immediately have

RT (u) Bψ(u, x1)

ℓt(xt) ℓt(xt+1) Bψ(xt+1, xt)

η + ℓ1(x1) ℓT (x T +1) +

max x V ℓt(x) ℓt 1(x) Bψ(xt+1, xt)

η + ℓ1(x1) ℓT (x T +1) + VT .

This means that using a constant learning rate yields a regret bound of O(VT + 1), which might be better than O(

T) if the temporal variability is low. In particular, we can even get constant regret if VT = O(1). On the contrary, OMD cannot achieve a constant regret for any convex loss even if VT = 0, since it would imply an impossible O(1/T) rate for non-smooth batch black-box optimization [23, Theorem 3.2.1]. Instead, IOMD does not violate the lower bound since it is not a black-box method. As far as we know, the connection between IOMD and temporal variability has never been observed before. On the other hand, even when the temporal variability is high, we can still use a O(1/

T) learning rate to achieve a worst case regret of the order O(

We would like to point out that a similar behaviour arises from Follow The Regularized Leader algorithm (FTRL) employed with full losses, rather than linearized ones. We show a detailed derivation in Appendix E. Unfortunately, contrarily to the OMD case employing FTRL would entail solving a constrained convex optimization problem whose size (in terms of number of functions) grows each step, that would have a high running time even when the implicit updates have closed form expressions, e.g., linear classiﬁcation with hinge loss.

Finally, a natural question arises: can we get a bound which interpolates between O(VT + 1) and O(

T), without any prior knowledge on the quantity VT ? We give a positive answer to this question by presenting an adaptive strategy in the next section.

6 Adapting to the Temporal variability with Ada Implicit

In this section, we present an adaptive strategy to set the learning rates, in order to give a regret guarantee that depends optimally on the temporal variability.

From the previous section, we saw that the key quantity in the IOMD regret bound is

δt ℓt(xt) ℓt(xt+1) Bψ(xt+1, xt)

Algorithm 2 Ada Implicit

Require: Non-empty closed convex set V X Rd, ψ : X R, λ1 = 0, β2 > 0, x1 V

1: for t = 1, . . . , T do 2: Output xt V 3: Receive ℓt : Rd R and pay ℓt(xt) 4: Update xt+1 = arg minx V ℓt(x) + λt Bψ(x, xt) 5: Set δt = ℓt(xt) ℓt(xt+1) λt Bψ(xt+1, xt) 6: Update λt+1 = λt + 1 β2 δt 7: end for

From Eq. (5), we have that δt 0. At this point, one might think of using a doubling trick: monitor Pt i=1 δi over time and restart the algorithm with a different learning rate once it exceeds a certain threshold. In Appendix A, we show that it is indeed possible to use such a strategy. However, while theoretically effective, we can t expect the doubling trick to have any decent performance in practice. Consequently, we are going to show how to use instead an adaptive learning rate.

Ada Implicit. Deﬁne D2 maxx,u V Bψ(u, x) and assume D < . For ease of notation, we let ηt = 1/λt where λt will be decided in the following. Assuming (λt)T t=1 to be an increasing sequence, from Theorem 5.2 we get

RT (u) D2λT +

t=1 [ℓt(xt) ℓt(xt+1) λt Bψ(xt+1, xt)] . (15)

Ideally, to minimize the regret we would like to have λT to be as close as possible to the sum over time in the r.h.s. of this expression. However, setting λt Pt s=1 δi would introduce an annoying recurrence in the computation of λt. To solve this issue, we explore the same strategy adopted in Ada FTRL [25], adapting it to the OMD case: we set λt+1 = 1 β2 Pt i=1 δi for t 2, for a parameter β to be deﬁned later, and λ1 = 0. We call the resulting algorithm Ada Implicit and describe it in Algorithm 2. Before proving a regret bound for it, we ﬁrst provide a technical lemma for the analysis. This lemma can be found in [24, 26] and for completeness we give a proof in Appendix B. Lemma 6.1. Let {at} t=1 be any sequence of non-negative real numbers. Suppose that { t} t=1 is a sequence of non-negative real numbers satisfying 1 = 0 and3 t+1 t +

min bat, ca2 t/(2 t) , for any t 1. Then, for any T 0, T +1 q

(b2 + c) PT t=1 a2 t.

We are now ready to prove a regret bound for Algorithm 2. Theorem 6.2. Let V X Rd be a non-empty closed convex set. Let Bψ be the Bregman divergence w.r.t. ψ : X R and let D2 = maxx,u V Bψ(u, x). Assume ψ to be 1-strongly convex with respect to in V . Then, for any u V , running Algorithm 2 with β = D guarantees

2(ℓ1(x1) ℓT (x T +1) + VT ) , 2D

, gt ℓt(xt) . (16)

Proof. Using the deﬁnition of λt and the fact that the sequence (λt)T +1 t=1 is increasing over time, the regret in Eq. (15) can be upper bounded as RT (u) (D2 + β2)λT +1. Therefore, we need an upper bound on λT +1. We split the proof in two parts, one for each term in the min in Eq. (16). For the ﬁrst term, using the deﬁnition of λt we have

t=1 [ℓt(xt) ℓt(xt+1) λt Bψ(xt+1, xt)]

ℓ1(x1) ℓT (x T +1) +

t=2 [ℓt(xt) ℓt 1(xt)] ℓ1(x1) ℓT (x T +1) + VT ,

3With a small abuse of notation, let min(x, y/0) = x.

from which using β = D the result follows.

For the second term, from Lemma 5.3 for t 2 we have δt gt 2 2λt . On the other hand,

δt = ℓt(xt) ℓt(xt+1) λt Bψ(xt+1, xt) ℓt(xt) ℓt(xt+1) gt, xt xt+1

where in the last step we used Eq. (2) and the deﬁnition of D. Therefore, putting the last two results together we get δt min

2D gt , gt 2 /(2λt) , gt ℓt(xt) .

Note that λt+1 = λt + 1 β2 δt. Hence, λ1 = 0, λ2 = (ℓ1(x1) ℓ1(x2))/β2

2D g1 /β2, and

λt+1 = λt + 1

β2 δt λt + 1

2D gt , gt 2 2λt

Therefore, using Lemma 6.1 with t = λt, b =

2D β2 and c = 1 β2 , at = gt , we get

v u u t(2D2/β4 + 1/β2)

from which setting β = D we obtain the second term in the min in Eq. (16).

This last theorem shows that Algorithm 2 can have a low regret if the temporal variability of the losses VT is low. Moreover, differently from Optimistic Algorithms, Algorithm 2 does not need additional assumptions on the losses (for example smoothness), as done for example in [15].

Lower Bound. Next, we are going to prove a lower bound in terms of the temporal variability VT , which shows that the regret bound in Theorem 6.2 cannot be improved further. The proof is a simple modiﬁcation of the standard arguments used to prove lower bounds for constrained OLO and is reported in Appendix B. Theorem 6.3. Let d 2, an arbitrary norm on Rd, and V = {x Rd : x D/2}. Let A be a deterministic algorithm on V . Let T be any non-negative integer. Then, for any V T 0, there exists a sequence of convex loss functions ℓ1(x), . . . , ℓT (x) with temporal variability equal to V T and u V such that the regret of algorithm A satisﬁes RT (u) V T .

7 Empirical results

0 500 1000 1500 2000 T

Slow varying losses

Ada Implicit Implicit Ada OGD OGD

Figure 1: Synthetic experiment.

In this section, we compare the empirical performance of our algorithm Ada Implicit with standard baselines in online learning: OGD [38], OGD with adaptive learning rate ηt = β Pt i=1 gt 2 (Ada OGD) [21], and IOMD with ηt = β/

t (Implicit) [18].

Synthetic Experiment. We ﬁrst show the beneﬁts of Ada Implicit on a synthetic dataset. The loss functions are chosen to have a small temporal variability VT . In particular, we consider a 1-d case using ℓt(x) = 1

4(x yt)2 with yt = 100 sin(π t 10T ), a time horizon T = 2000 and the L2 ball of diameter D = 150. We set β = 1 in all algorithms. The update of the implicit algorithms can be computed in closed form: xt+1 = xt ηt 2+ηt (xt yt). In Fig. 1 we show the

cumulative loss LT = PT t=1 ℓt(xt) of the algorithms (note that the y-axis is plotted in logarithmic scale). From the ﬁgure we can see that, contrarily to the other algorithms, the cumulative loss of Ada Implicit grows slowly over time, reﬂecting experimentally the bound given in Theorem 6.2. Also, even if not directly observable, OGD and IOMD basically incur the same total cumulative loss.

Real world datasets. We are now going to show some experiments conducted on real data. Here, there is no reason to believe that the temporal variability is small. However, we still want to verify

2 20 2 10 20 210

Ada Implicit Implicit Ada OGD OGD

2 10 20 210 220

Ada Implicit Implicit Ada OGD OGD

2 20 2 10 20 210

Ada Implicit Implicit Ada OGD OGD

2 10 2 5 20 25 210 215

abalone_scale

Ada Implicit Implicit Ada OGD OGD

2 20 2 10 20 210 220

cpusmall_scale

Ada Implicit Implicit Ada OGD OGD

2 10 20 210 220

Ada Implicit Implicit Ada OGD OGD

Figure 2: Plots on classiﬁcation tasks using the hinge loss (top) and regression tasks using the absolute loss (bottom).

if Ada Implicit can achieve a good worst-case performance. We consider both classiﬁcation and regression tasks. Additional plots can be found in Appendix D.

We used datasets from the LIBSVM library [6]. Before running the algorithms, we preprocess the data by dividing each feature by its maximum absolute value so that all the values are in the range [ 1, 1], then we add a bias term. Details about the datasets can be found in Appendix D.

Given that in the online setting we cannot tune the hyperparameter β using hold-out data, we plot the average cumulative loss of each algorithm, i.e., Lt/t = 1 t Pt i=1 ℓi(xi), as a function of the hyperparameter β. This allows us to evaluate at the same time the sensitivity of the algorithms to β and their best performance with oracle tuning. Note that in all the algorithms we consider the optimal worst-case setting of β is proportional to the diameter of the feasible set, hence it is fair to plot their performance as a function of β. We consider values of β in [2 20, 220] with a grid containing 41 points. Then, each algorithm is run 10 times and results are averaged. For classiﬁcation tasks we use the hinge loss, while for regression tasks we use the absolute loss. In both cases, we adopt the squared L2 function for ψ. The details about implicit updates are discussed in Appendix C.

Results are illustrated in Fig. 2. From the plots, we can see that when ﬁne-tuned, all the algorithms achieve similar results, i.e., the minimum value of average cumulative loss is very close for all the algorithms considered and there is not a clear winner. However, note that the range of values which allows an algorithm to reach the minimum is considerably wider for Implicit algorithms and conﬁrms their robustness regarding learning rate misspeciﬁcation, as already investigated in other works [see, e.g., 33, 34]. This is a great advantage when considering online algorithms since, contrarily to the batch setting, algorithms cannot be ﬁne-tuned in advance relying on training/validation sets.

8 Conclusions

In this paper, we investigated online Implicit algorithms from a theoretical perspective. Our analysis revealed interesting insights regarding the behavior of these algorithms and allowed us to design a new adaptive algorithm, which may take advantage of easy data. The obtained experimental results indicate that in real-world tasks (such as online classiﬁcation with hinge loss or online regression with the absolute loss), Implicit algorithms provide a better solution in terms of robustness, which is particularly relevant in online settings. Future directions include extending our analysis to a broader area, for example considering dynamic environments or strongly-convex loss functions, to see if the same gains can be proved. Finally, other examples of easy data can be considered, such as the case of stochastic loss functions.

Broader Impact

We believe our investigation will foster further studies promoting the adoption of adaptive learning rates in online learning and beyond. Indeed, in recent years adaptive methods in optimization proved to be one of the preferred methods for training deep neural networks. On the other hand, this work conﬁrm the robustness of implicit updates and opens up to new possibilities in this ﬁeld. From a societal aspect, this work in mainly theoretical and does not present any foreseeable consequence.

Acknowledgements

This material is based upon work supported by the National Science Foundation under grants no. 1925930 Collaborative Research: TRIPODS Institute for Optimization and Learning and no. 1908111 AF: Small: Collaborative Research: New Representations for Learning Algorithms and Secure Computation . NC thanks Nicol o Cesa-Bianchi for supporting his visit to Boston University.

[1] A. Ajalloeian, A. Simonetto, and E. Dall Anese. Inexact online proximal-gradient method for time-varying convex optimization. ar Xiv preprint ar Xiv:1910.02018, 2019.

[2] H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257 2290, 2019.

[3] D. Baby and Y.-X. Wang. Online forecasting of total-variation-bounded sequences. In Advances in Neural Information Processing Systems, pages 11069 11079, 2019.

[4] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

[5] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations research, 63(5):1227 1244, 2015.

[6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[7] N. Chen, G. Goel, and A. Wierman. Smoothed online convex optimization in high dimensions via online balanced descent. In Conference On Learning Theory, pages 1574 1594, 2018.

[8] X. Chen, Y. Wang, and Y.-X. Wang. Nonstationary stochastic optimization under Lp,qvariation measures. Operations Research, 67(6):1752 1765, 2019.

[9] L. Cheng, S. V. N. Vishwanathan, D. Schuurmans, S. Wang, and T. Caelli. Implicit online learning with kernels. In Advances in Neural Information Processing Systems 19, pages 249 256, 2007.

[10] C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations. In Proc. of the Conference on Learning Theory (COLT), volume 23, pages 6.1 6.20, 2012.

[11] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551 585, 2006.

[12] R. Dixit, A. S. Bedi, R. Tripathi, and K. Rajawat. Online learning with inexact proximal online gradient descent algorithms. IEEE Transactions on Signal Processing, 67(5):1338 1352, 2019.

[13] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3 4):157 325, 2016.

[14] E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In Proc. of the 21st Conference on Learning Theory, 2008.

[15] A. Jadbabaie, A. Rakhlin, S. Shahrampour, and K. Sridharan. Online optimization: Competing with dynamic comparators. In Artiﬁcial Intelligence and Statistics, pages 398 406, 2015.

[16] N. Karampatziakis and J. Langford. Online importance weight aware updates. In Proc. of the Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence, UAI 11, pages 392 -399, Arlington, Virginia, USA, 2011. AUAI Press.

[17] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, January 1997.

[18] B. Kulis and P. L. Bartlett. Implicit online learning. In International Conference on Machine Learning, pages 575 582, 2010.

[19] B. Martinet. R egularisation d in equations variationnelles par approximations successives. rev. franc aise informat. Recherche Op erationnelle, 4:154 158, 1970.

[20] H. B. Mc Mahan. A uniﬁed view of regularized dual averaging and mirror descent with implicit updates. ar Xiv preprint ar Xiv:1009.3240, 2010.

[21] H. B. Mc Mahan and M. J. Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010.

[22] J.-J. Moreau. Proximit e et dualit e dans un espace hilbertien. Bulletin de la Soci et e math ematique de France, 93:273 299, 1965.

[23] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer, 2004.

[24] F. Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

[25] F. Orabona and D. P al. Scale-free algorithms for online linear optimization. In International Conference on Algorithmic Learning Theory, pages 287 301. Springer, 2015.

[26] F. Orabona and D. P al. Scale-free online learning. Theoretical Computer Science, 716:50 69, 2018. Special Issue on ALT 2015.

[27] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3): 127 239, 2014.

[28] A. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pages 3066 3074, 2013.

[29] R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877 898, 1976.

[30] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 2012.

[31] C. Song, J. Liu, H. Liu, Y. Jiang, and T. Zhang. Fully implicit online learning. ar Xiv preprint ar Xiv:1809.09350, 2018.

[32] J. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In Proc. of the International Conference on Machine Learning (ICML), pages 1593 1601, 2014.

[33] P. Toulis and E. M. Airoldi. Asymptotic and ﬁnite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4):1694 1727, 2017.

[34] P. Toulis, E. M. Airoldi, and J. Rennie. Statistical analysis of stochastic gradient methods for generalized linear models. In International Conference on Machine Learning, pages 667 675, 2014.

[35] M. K. Warmuth and A. K. Jagota. Continuous and discrete-time nonlinear gradient descent: Relative loss bounds and convergence. In Electronic proceedings of the 5th International Symposium on Artiﬁcial Intelligence and Mathematics, 1997.

[36] T. Yang, L. Zhang, R. Jin, and J. Yi. Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In International Conference on Machine Learning, pages 449 457, 2016.

[37] L. Zhang, T. Yang, and Z.-H. Zhou. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning, pages 5882 5891, 2018.

[38] M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proc. of ICML, pages 928 936, 2003.