# generalized_implicit_followtheregularizedleader__29c791fe.pdf

Generalized Implicit Follow-The-Regularized-Leader

Keyi Chen 1 Francesco Orabona 1

We propose a new class of online learning algorithms, generalized implicit Follow-The Regularized-Leader (FTRL), that expands the scope of FTRL framework. Generalized implicit FTRL can recover known algorithms, such as FTRL with linearized losses and implicit FTRL, and it allows the design of new update rules, as extensions of a Prox and Mirror-Prox to FTRL. Our theory is constructive in the sense that it provides a simple unifying framework to design updates that directly improve the worst-case upper bound on the regret. The key idea is substituting the linearization of the losses with a Fenchel-Young inequality. We show the ﬂexibility of the framework by proving that some known algorithms, like the Mirror-Prox updates, are instantiations of the generalized implicit FTRL. Finally, the new framework allows us to recover the temporal variation bound of implicit OMD, with the same computational complexity.

1. Introduction

Online learning is a setting where the learner receives an arbitrary sequence of loss functions, selects points before knowing the loss functions, and is evaluated on the values of the loss functions on the points it selects (Cesa-Bianchi & Lugosi, 2006; Orabona, 2019; Cesa-Bianchi & Orabona, 2021). More in detail, at round t the learner outputs a point xt in a feasible set V Rd. Then, it receives a loss function ℓt : V R and it pays the value ℓt(xt). Given the arbitrary nature of the losses, the learner cannot guarantee to have a small cumulative loss, PT t=1 ℓt(xt). On the other hand, it is possible to minimize the regret, that is the difference between the cumulative loss of the algorithm and the one of

1Boston University, Boston, MA, USA. Correspondence to: Keyi Chen <keyichen@bu.edu>, Francesco Orabona <francesco@orabona.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

any arbitrary comparator u V :

Regret T (u)

t=1 ℓt(u) .

In particular, a successful online learning algorithm must guarantee a regret that grows sublinearly in time for any u V . In this way, its average performance approaches the one of the best comparator in hindsight.

There are two families of online learning algorithms: Online Mirror Descent (OMD) (Nemirovskij & Yudin, 1983; Warmuth & Jagota, 1997) and Follow-the-Regularized-Leader (FTRL) (Shalev-Shwartz, 2007; Abernethy et al., 2008; Hazan & Kale, 2008). They stem from two similar but complementary approaches: the update of OMD aims at minimizing a linearization of the current loss without going too far from its previous prediction xt, while FTRL minimizes the sum of all the losses (or their linear approximation) plus a regularization term. On the contrary to the ﬁrst approaches in online learning that focused on speciﬁc algorithms (e.g., the Winnow algorithm (Littlestone, 1988)), the theory of these two frameworks is particularly interesting because it allows both the design and the analysis of generic online learning algorithms.

While FTRL and OMD provide similar bounds in most situations, they are not completely equivalent. For example, FTRL has an advantage over OMD in unbounded domains, where it allows to use time-varying regularizers. In fact, OMD allows the use of time-varying stepsizes only in domains where its associated Bregman divergence is bounded. On the other hand, in the cases where we can use timevarying stepsizes, OMD can achieve a superior adaption to the gradients (see, e.g., Theorem 2 in Streeter & Mc Mahan (2010) versus Theorem 2 in Orabona & P al (2015)). In this view, these two frameworks are complementary.1 Moreover, there exists another orthogonal axis on the use of the actual loss functions or a linear surrogate for both frameworks. We summarize all the variants of OMD and FTRL in Table 1.

Our motivation stems from the fact that in practical cases, all the variants that use full losses offer a big advantage in terms of empirical performance at the cost of a higher

1See also the blog post on this topic by Tim van Erven at https://www.timvanerven.nl/blog/ ftrl-vs-omd/.

Generalized Implicit Follow-The-Regularized-Leader

Table 1. Summary of implicit and linearized updates for FTRL and OMD. (The Bregman divergence Bψ(x; y) is deﬁned as ψ(x) ψ(y) ψ(y), x y . The denotes the Fenchel conjugate.)

Algorithm Update OMD (Warmuth & Jagota, 1997) xt+1 = argminx V Bψ(x; xt) + ηt(ℓt(xt) + gt, x xt ) Implicit OMD (Warmuth & Jagota, 1997) xt+1 = argminx V Bψ(x; xt) + ηtℓt(x) FTRL (linearized) (Abernethy et al., 2008) xt+1 = argminx V ψt+1(x) + Pt i=1(ℓi(xi) + gi, x xi ) FTRL (full losses) (Mc Mahan, 2017) xt+1 = argminx V ψt+1(x) + Pt i=1 ℓi(x) Implicit FTRL (Mc Mahan, 2010) xt+1 = argminx V ψt+1(x) + ℓt(x) + Pt 1 i=1(ℓi(xi) + gi, x xi ) Generalized Implicit FTRL [This work] xt+1 = argminx V ψt+1(x) + Pt i=1 zi, x zi such that ψ i+1,V (Pi j=1 zj) + ℓ i (zi) ψ i+1,V (Pi 1 j=1 zj gi) + ℓ i (gi)

computational complexity. On the theoretical side, the situation is not so clear given that in the worst case using the full losses can be equivalent to their linearized version, as it should be clear considering linear losses. In particular, the standard theoretical framework for FTRL does not allow a clear analysis of the implicit case. Moreover, while for implicit OMD it has been proven that one can achieve lower regret if the temporal variation of the losses is small, it is unclear if the same guarantee can be achieved for FTRL without the computational cost of using full losses.

In this paper, we aim at bridging this gap by proposing a generalized version of implicit FTRL. We go beyond implicit and linearized updates: we directly construct the update rule in a way that minimizes an upper bound on the regret. Our framework effectively expands the scope of the FTRL framework, fully retaining its coupling between design and analysis. Also, our updates come with a worstcase guarantee to never be worse than the standard linearized ones.

We show the ﬂexibility of our framework recovering known update schemes, like the Mirror-Prox update (Nemirovski, 2004), or extending updates speciﬁcally designed for OMD to the FTRL case, like the a Prox one (Asi & Duchi, 2019). Moreover, for the ﬁrst time, we show an implicit version of FTRL that recovers the temporal variation bound of implicit OMD (Campolongo & Orabona, 2020), but with the same computational complexity of implicit OMD.

Related Work While there are many works on implicit mirror descent in both the online and ofﬂine setting (see, e.g., Moreau, 1965; Martinet, 1970; Rockafellar, 1976; Kivinen & Warmuth, 1997; Parikh & Boyd, 2014; Campolongo & Orabona, 2020; Shtoff, 2022), the number of works that deal with implicit updates for FTRL is quite limited. We are only aware of Mc Mahan (2010), which quantiﬁes a gain only for speciﬁc regularizers. However, the framework in Mc Mahan (2010) is non-constructive in the sense that it is difﬁcult to see how to generalize implicit updates. Joulani et al. (2017) extends this last result, but it does not provide a link with the maximization of the dual function that governs

the regret upper bound.

The closest approach to our framework is the one of Shalev Shwartz & Singer (2007a;b), which develop a theory of FTRL updates as maximization of a dual function. However, their framework is limited to a speciﬁc shape of regularizers and it does not deal with implicit updates.

For implicit OMD, Campolongo & Orabona (2020) showed that implicit updates give rise to regret guarantees that depend on the temporal variability of the losses, so that constant regret is achievable if the variability of the losses is zero. They suggest that FTRL with full losses can achieve the same guarantee, but they also point out that given its computational complexity it would be not worth pursuing. Here, we show how to achieve the same bound of implicit OMD with our generalized implicit FTRL, while retaining the same computational complexity of implicit OMD.

Proximal updates on truncated linear models were introduced in Asi & Duchi (2019) for the OMD algorithm. Chen et al. (2022b) used gradient ﬂow on the same truncated linear models with a coin-betting algorithm (Orabona & P al, 2016), but their approach does not seem to satisfy a regret guarantee. Chen et al. (2022a) have used truncated linear models in an FTRL-based parameter-free algorithm (Orabona & P al, 2021) with a novel decomposition of the regret. However, their approach is ad-hoc and seems difﬁcult to generalize.

2. Deﬁnitions and Basic Tools

We deﬁne here some basic concepts and tools of convex analysis, we refer the reader to, e.g., Rockafellar (1970); Bauschke & Combettes (2011) for a complete introduction to this topic. We will consider extended value function that can assume inﬁnity values too. A function f is proper if it is nowhere and ﬁnite somewhere. A function f : V Rd [ , + ] is closed if {x : f(x) α} is closed for every α R. For a proper function f : Rd ( , + ], we deﬁne a subgradient of f in x Rd as a vector g Rd

that satisﬁes f(y) f(x) + g, y x , y Rd. We denote the set of subgradients of f in x by f(x). The indicator function of the set V , i V : Rd ( , + ],

Generalized Implicit Follow-The-Regularized-Leader

has value 0 for x V and + otherwise. We denote the dual norm of a norm by . A proper function f : Rd ( , + ] is µ-strongly convex over a convex set V int dom f w.r.t. if x, y V and g f(x), we have f(y) f(x)+ g, y x + µ

2 x y 2. A function f : V R, differentiable in an open set containing V , is Lsmooth w.r.t. if f(y) f(x)+ f(x), y x + L

2 x y 2 for all x, y V . For a function f : Rd [ , ], we deﬁne the Fenchel conjugate f : Rd [ , ] as f (θ) = supx Rd θ, x f(x). From this deﬁnition, we immediately have the Fenchel-Young inequality: f(x) + f (θ) θ, x , x, θ. We will also make use of the following properties of Fenchel conjugates. Theorem 2.1 ((Orabona, 2019, Theorem 5.7)). Let f : Rd ( , + ] be proper. Then, the following conditions are equivalent: (a) θ f(x). (b) θ, y f(y) achieves its supremum in y at y = x.

(c) f(x) + f (θ) = θ, x . Moreover, if f is also convex and closed, we have an additional equivalent condition (d) x f (θ).

Theorem 2.2 ((Orabona, 2019, Theorem 6.11)). Let ψ : Rd ( , + ] be a proper, closed, convex function, and dom ψ be non-empty. Then, ψ is λ > 0 strongly convex w.r.t. iff ψ is 1

λ-smooth w.r.t. on Rd.

3. Generalized Implicit FTRL

In this section, we introduce our novel generalized formulation of the implicit FTRL algorithm. The main idea is to depart from the implicit or linearized updates, and directly design updates that improve the upper bound on the regret. More in detail, the basic analysis of most of the online learning algorithms is based on the deﬁnition of subgradients:

ℓt(xt) ℓt(u) gt, xt u , gt ℓt(xt) . (1)

This allows to study the regret on the linearized losses as a proxy for the regret on the losses ℓt. However, we can do better. We introduce a new fundamental and more general strategy: using the Fenchel-Young inequality, we have

ℓt(xt) ℓt(u) ℓt(xt) zt, u + ℓ t (zt), zt .

In particular, the algorithm will choose zt to make a certain upper bound involving this quantity to be tighter. This is a better inequality than (1) because when we select zt = gt ℓt(xt), using Theorem 2.1, we recover (1). So, this inequality subsumes the standard one for subgradients, but, using zt ℓt(xt+1), it also subsumes the similar inequality used in the implicit case, as we show in Section 3.1. Moreover, we will see in Section 6 that it covers cases where zt is not a subgradient of ℓt.

Algorithm 1 Generalized Implicit FTRL

Require: Non-empty closed set V Rd, a sequence of regularizers ψ1, . . . , ψT : Rd ( , + ] 1: θ1 = 0 2: for t = 1 to T do 3: Output xt argminx V ψt(x) θt, x 4: Receive ℓt : V R and pay ℓt(xt) 5: Set gt ℓt(xt) 6: Set zt such that Ht(zt) Ht(gt) or H t(zt) H t(gt) where Ht and H t are deﬁned in (2) and (3) 7: Set θt+1 = θt zt 8: end for

The analysis shows that the optimal setting of zt is the one that minimizes the function

Ht(z) ψ t+1,V (θt z) + ℓ t (z) (2)

or H t(z) ψ t,V (θt z) + ℓ t (z), (3)

where ψt,V is the restriction of the regularizer used at time t on the feasible set V , i.e., ψt,V ψt + i V . However, we can show that any setting of zt that guarantees Ht(zt) < Ht(gt) (or H t(zt) < H t(gt)) guarantee a strict improvement in the worst-case regret w.r.t. using the linearized losses.

One might wonder why the need for two different updates using Ht or H t. The reason is that when using time-varying regularizers that depend on the data, like in the FTRL version of Ada Grad (Mc Mahan & Streeter, 2010; Duchi et al., 2011), if λt+1 depends on zt it might make the calculation of the update particularly difﬁcult. This can be avoided using the update involving H t.

Once we have the zt, we treat them as the subgradient of surrogate linear losses. So, putting it all together, Algorithm 1 shows the ﬁnal algorithm. We now show a regret guarantee for this algorithm. First, we state a general Lemma and then instantiate it in a few interesting cases. Theorem 3.1. Let V Rd be closed and non-empty and ψt : V R. With the notation in Algorithm 1, deﬁne by Ft(x) = ψt(x) + Pt 1 i=1 zi, x , so that xt argminx V Ft(x). Finally, assume that argminx V Ft(x) and ℓt(xt) are not empty for all t.

For any zt Rd and any u Rd, we have

Regret T (u) ψT +1(u) min x V ψ1(x)

t=1 [ψ t+1,V (θt gt) ψ t,V (θt) + xt, gt δt]

+ FT +1(x T +1) FT +1(u),

where δt Ht(gt) Ht(zt).

Generalized Implicit Follow-The-Regularized-Leader

If ψt+1(x) ψt(x) for any x V , then, for any zt Rd, we have

Regret T (u) ψT +1(u) min x V ψ1(x)

t=1 [ψ t,V (θt gt) ψ t,V (θt) + xt, gt δ t]

+ FT +1(x T +1) FT +1(u),

where δ t H t(gt) H t(zt).

Proof. The proof is composed of simple but not obvious steps. The ﬁrst important observation is that the deﬁnition of xt in the algorithm corresponds exactly to the one of FTRL on the linear losses zt, . Hence, we can use the FTRL equality in Orabona (2019, Lemma 7.1):

t=1 zt, u =

t=1 [Ft(xt) Ft+1(xt+1)] + ψT +1(u)

min x V ψ1(x) + FT +1(x T +1) FT +1(u),

where we have cancelled out the terms zt, xt on both sides.

Now, use Fenchel-Young inequality, to have zt, u ℓt(u) + ℓ t (zt). Hence, we have

t=1 [Ft(xt) Ft+1(xt+1) + ℓ t (zt)]

+ ψT +1(u) min x V ψ1(x)

+ FT +1(x T +1) FT +1(u) .

Observe that

Ft(xt) = min x V ψt(x) +

= max x V θt, x ψt(x) = ψ t,V (θt) .

In the same way, we have Ft+1(xt+1) = ψ t+1,V (θt+1). Also, for any gt ℓt(xt), by Theorem 2.1 we have ℓ t (gt) = xt, gt ℓt(xt). Hence, each term in the sum can be written as

Ft(xt) Ft+1(xt+1) + ℓ t (zt)

= ψ t+1,V (θt+1) ψ t,V (θt) + ℓ t (zt)

= Ht(zt) ψ t,V (θt) .

Now, we just add and subtract Ht(gt) = ψ t+1,V (θt gt)+ gt, xt ℓt(xt) to obtain the stated bound.

The second case is similar. We just have to observe that if ψt+1,V ψt,V , then ψ t+1,V ψ t,V . Hence, each term in

the sum can be upper bounded as

Ft(xt) Ft+1(xt+1) + ℓ t (zt)

ψ t,V (θt+1) ψ t,V (θt) + ℓ t (zt)

= H t(zt) ψ t,V (θt) .

As before, adding and subtracting H t(gt) = ψ t,V (θt gt) + xt, gt ℓt(xt) gives the stated bound.

The Theorem is stated with very weak assumption to show its generality, but it is immediate to obtain concrete regret guarantees just assuming, for example, strongly convex regularizers and convex and Lipschitz losses and using wellknown methods as Orabona (2019, Lemma 7.8)

However, we can already understand why this is an interesting guarantee. Let s ﬁrst consider the case that zt = gt. In this case, we exactly recover the linearized FTRL algorithm. Even the guarantee in the Theorem exactly recovers the best known one (Orabona, 2019, Corollary 7.9), with δt = 0 and δ t = 0. Now, if we set zt such that Ht(zt) < Ht(gt) or H t(zt) < H t(gt) we will have that δt > 0 or δ t > 0. Hence, in each single term of the sum we have a negative factor that makes the regret bound smaller. While it might be difﬁcult to give a lower bound to δt and δ t without additional assumptions, the main value of this analysis is in giving a unifying way to design generalized implicit updates for FTRL. In fact, in the next sections we will show a number of possibilities that this framework enables.

Next, we will gain more understanding on the updates in Algorithm 1, comparing them to implicit OMD.

3.1. Comparison with Implicit Online Mirror Descent

In this section, we show that when zt is set to minimize Ht(z) or H t(z), we recover different variants of implicit updates.

Assume that the ℓt are closed and convex. Also, assume that ψ t,V is differentiable, that is true, for example, when ψt is strongly convex by Theorem 2.2. Then, observe that by the ﬁrst-order optimality condition and Theorem 2.1, we have

zt = argmin z Ht(z)

ψ t+1,V (θt zt) ℓ t (zt)

zt ℓt( ψ t+1,V (θt zt)) = ℓt(xt+1) . (4)

Hence, in this case, we have that the optimal zt is the gradient at the next point xt+1. This is exactly what happens in the implicit updates. This condition that depends on the gradient in the next iterate is similar to other implicit updates.

Generalized Implicit Follow-The-Regularized-Leader

Under the same assumptions, we also have

zt = argmin z H t(z) ψ t,V (θt zt) ℓ t (zt)

zt ℓt( ψ t,V (θt+1)) . (5)

In this other case, the update also has an implicit ﬂavor but the subgradient is queried on a point different from the next point, where the difference depends on how much ψ t,V differs from ψ t+1,V .

Let s see this connection even more precisely, considering proximal updates. Hence, for simplicity, let s consider the case that V = Rd, similar considerations hold in the constrained case. Consider the case that ψt(x) = λt

2 x 2 2. In this case, the update can be written with the proximal operator of the loss functions. In particular, the proximal operator of ηf, is deﬁned as

Proxηf(y) argmin x Rd 1 2 x y 2 2 + ηf(x) .

If the function f is differentiable we have that Proxηf(y) = y η f(Proxηf(y)). In words, the proximal update moves by a quantity that depends on the gradient on the updated point. The implicit nature of these updates justiﬁes the name implicit updates used in the online learning literature. More generally, we have that Proxηf(y) y η f(Proxηf(y)). We list some common proximal operators in Appendix A.

Assuming λt+1 does not depend on zt, using the proximal operator we can rewrite the update in (4) as

xt+1 = θt+1

λt+1 = Prox ℓt λt+1

= Prox ℓt λt+1

Similarly, we can rewrite the update in (5) as

λt ℓt( ψ t,V (θt+1))

Hence, we have that θt+1

λt = Prox ℓt

λt (xt) and we get

xt+1 = θt+1

λt+1 = λt λt+1 Prox ℓt

λt (xt) . (7)

It is instructive to compare both updates with the one of Implicit Online Mirror Descent using ψ(x) = 1

2 x 2 2 as distance generating function and stepsizes 1 λt . In this case, we would update with

xt+1 = argmin x 1 2 xt x 2 2 + 1

λt (xt) . (8)

Comparing (4) and (5) to (8), we see, when λt λt+1 as it is usual, the two updates above shrink a bit towards the zero vector, that is the initial point x1, before or after the proximal operator. This shrinking is given by the FTRL update and it is the key difference with Implicit OMD update. The different update also corresponds to a different guarantee: the regret of the generalized implicit FTRL holds for unbounded domains too, while in Implicit OMD with time-varying stepsizes can have linear regret on unbounded domains (Orabona & P al, 2018). Interestingly, a similar shrinking has been proposed in Fang et al. (2020) to ﬁx the unbounded issue in OMD. Clearly, the updates (4) and (5) become equivalent to (8) for λt constant in t, that is exactly the only case when implicit/proximal online mirror descent works for unbounded domains.

4. Temporal Variability Bound

In this section, we quantify the advantage of the generalized implicit FTRL updates in the case of slow temporal variability of the loss functions.

It was observed in Campolongo & Orabona (2020) that implicit OMD satisﬁes regret guarantees that depend on the temporal Variability VT :

t=2 max x V ℓt(x) ℓt 1(x) .

In Campolongo & Orabona (2020, Appendix E) they also show that FTRL with full losses guarantees a similar guarantee, but at a much higher computational price. Indeed, FTRL with full losses requires solving a ﬁnite sum optimization problem at each step, whose size increases with the number of iterations. Such computational burden induced Campolongo & Orabona (2020) to say that such approach is not worth of pursuing.

Here, we show that the Algorithm 1 can satisfy the same guarantee of implicit OMD with the same computational complexity too. First, we show the following Lemma.

Lemma 4.1. Under the assumptions of Theorem 3.1, further assume V to be convex, ψt : V R closed, λt-strongly convex w.r.t. , and subdifferentiable in V , ℓt closed, convex, and subdifferentiable in V , and λt+1 λt. Set zt argminz Ht(z). Then, we have

Regret T (u) ψT +1(u) min x V ψ1(x)

ℓt(xt) ℓt(xt+1) λt

2 xt+1 xt 2 , u V.

Proof. First of all, the existence and unicity of xt is guaranteed by ψt being closed and strongly convex (see, e.g., Orabona, 2019, Theorem 6.8).

Generalized Implicit Follow-The-Regularized-Leader

From Theorem 2.1, for any g t ℓt(xt+1), we have ℓ t (g t) = xt+1, g t ℓt(xt+1). Hence, from (4), we have

ψ t+1,V (θt+1) ψ t,V (θt) + ℓ t (zt)

= ψ t+1,V (θt zt) ψ t,V (θt) + xt+1, zt ℓt(xt+1) .

Using this identity, we have

ψ t+1,V (θt zt) ψ t,V (θt) + xt+1, zt

= θt zt, xt+1 ψt+1(xt+1) θt, xt + ψt(xt)

ψt(xt) ψt(xt+1) + θt, xt+1 xt .

From the ﬁrst-order optimality condition of xt, we have that θt ψt(xt) + i V (xt). Moreover, for all g t i V (xt), by deﬁnition we have g t , y xt 0 for all y V . Hence, for g t ψt(xt) and g t i V (xt) such that θt = g t + g t , we have

ψt(xt) ψt(xt+1) + θt, xt+1 xt

= ψt(xt) ψt(xt+1) + g t + g t , xt+1 xt

2 xt+1 xt 2,

where in the inequality we also used the strong convexity of ψt. Using this inequality in Theorem 3.1 and summing over time, we have

t=1 ℓt(xt+1)

ψT +1(u) min x V ψ1(x)

2 xt+1 xt 2 .

By adding and subtracting PT t=1 ℓt(xt) to both sides and reordering the terms, we have the stated bound.

This Lemma mirrors Theorem 5.2 in Campolongo & Orabona (2020), with the important difference that here we do not need the Bregman divergence to be bounded on the feasible set V , thanks to the use of FTRL instead of OMD. We can now state the immediate corollary on a regret bound that depends on the temporal variation.

Corollary 4.2. Under the assumptions of Lemma 4.1, for any u V , we have

Regret T (u) ψT +1(u) min x V ψ1(x)

+ ℓ1(x1) ℓT (x T +1) + VT .

From this result, following (Campolongo & Orabona, 2020), it is relatively easy to obtain the following adaptive regret guarantee. The only difﬁculty is the fact that we need ψt+1 to be independent of zt to have a simpler update rule. We

solve this problem using an increasing regularizer that is behind of two steps . In this way, we have that λt+1 depends on quantities that are all known at the beginning of round t. The proof is in Appendix B.

Corollary 4.3. Under the assumptions of Lemma 4.1, further assume gt G for all t. Deﬁne γt = ℓt(xt)

ℓt(xt+1) λt

2 xt+1 xt 2 and λt = 1 β2 Gβ + Pt 2 i=1 γi . Assume that ψ is closed and 1-strongly convex w.r.t. and set ψt = λtψ. Then, for any u V , we have

Regret T (u) min 1

β (ℓ1(x1) ℓT (x T +1) + VT ),

5. Two-step Updates

The choice of zt that minimizes the regret upper bound requires solving the optimization problem minz Ht(z) or minz H t(z). We have seen in Section 3.1 that this corresponds to (some variant) of a implicit/proximal update and, depending on ℓt, it can be of difﬁcult calculation. However, as we said, any choice better than gt will cause a provable gain. Hence, a viable solution is to approximately solve for the optimal zt.

Here, we propose a simple approximation: set zt as

zt ℓt( ψ t+1,V (θt gt)) (9)

or as zt ℓt( ψ t,V (θt gt)) . (10)

In words, we set zt to be a subgradient after one fake update. This is exactly the approach used in the Mirror-Prox algorithm (Nemirovski, 2004), an ofﬂine optimization algorithm. In the next theorem, when the loss functions ℓt are smooth and the regularizer is chosen appropriately, we show that this choice can be used in the generalized implicit FTRL too and it cannot be worse than using gt.

Theorem 5.1. Assume ψt(x) proper, closed, and λtstrongly convex with respect to . Assume ℓt(x) closed and λt-smooth w.r.t. for all t. Then, using (9) and assuming λt+1 Lt, we have Ht(zt) Ht(gt). On the other hand, when using (10) and assuming λt Lt we have H t(zt) H t(gt).

Proof. We only prove that statement for (9), the other one is similar. We would like to prove that

Ht(zt) = ψ t+1,V (θt zt) + ℓ t (zt)

ψ t+1,V (θt gt) + ℓ t (gt) = Ht(gt) .

Generalized Implicit Follow-The-Regularized-Leader

This is equivalent to prove

ψ t+1,V (θt zt) ψ t+1,V (θt gt) ℓ t (gt) ℓ t (zt) .

Given that ψt+1(xt) is λt+1-strongly convex, by Theorem 2.2, we have ψ t (θ) is 1/λt+1-smooth with respect to . By the deﬁnition of smoothness, we have

ψ t+1,V (θt zt) ψ t+1,V (θt gt)

ψ t+1(θt gt), gt zt + 1 2λt+1 gt zt 2 .

Given that ℓt(xt) is Lt-smooth w.r.t , by Theorem 2.2 ℓ t (g) is 1/Lt strongly convex w.r.t. . So, by the deﬁnition of the strong convexity, we have

ℓ t (gt) ℓ t (zt) qt, gt zt + 1 2Lt gt zt 2 ,

for all qt ℓ t (zt). Deﬁning x t+1 ψ t+1(θt gt), by Theorem 2.1, we have x t+1 ℓ t (zt). Hence, we can select qt such that x t+1 = qt. Finally, using the assumption on λt+1 Lt, we have the stated bound.

6. Going Beyond Subgradients with a Prox

Till now, in all the updates we have considered zt was set to be a subgradient of ℓt in a speciﬁc point. In this section, we show that we can go beyond this idea.

Asi & Duchi (2019) introduced a Prox updates, that is proximal updates on surrogate loss functions. In particular, they used truncated linear lower bounds to the loss functions as surrogate functions. These simple surrogates are motivated by the fact that they are strictly better than linear approximation and at the same time they allow writing the proximal update in a closed form. Moreover, they showed empirically that in certain situations the performance of the algorithms becomes much more resistant to the tuning of the stepsizes.

One might just use the same truncated lower bounds in implicit FTRL, but it would not be clear why this should give any advantage in the theoretical bound. Indeed, even in Asi & Duchi (2019) it is not completely clear what part of the theory tells us that we should expect a better performance from these updates.

Here, we show how the updates in the generalized implicit FTRL are actually a generalization of the a Prox ones. In particular, we generalize the a Prox updates to arbitrary regularizers and show that all of them satisfy Ht(zt) Ht(gt) and H t(zt) H t(gt). In words, the a Prox updates are guaranteed to be at least as good as the subgradient gt in minimizing the worst-case regret.

In order to consider truncated linear lower bounds to the functions ℓt, in this section we will assume that the loss

functions ℓt are lower bounded. Given that the regret is invariant to additive constants in the losses, without loss of generality we can assume the lower bound to be 0 for all the loss functions. Hence, deﬁne the truncated linear model ˆℓt : V R around xt to be

ˆℓt(x) max(ℓt(xt) + gt, x xt , 0),

where gt ℓt(xt). For brevity of notation, our notation does not stress the fact that the truncated linear model depends on xt and the speciﬁc subgradient gt.

The idea to extend a Prox to the case of generalized implicit FTRL, we use the truncated linear lower bound in the update of zt. So, we deﬁne

zt = argmin z ψ t+1,V (θt zt) + ˆℓ t (zt) (11)

or zt = argmin z ψ t,V (θt zt) + ˆℓ t (zt) . (12)

Theorem 6.1. Assume the loss functions ℓt : V R to be convex, closed, and subdifferentiable in V for all t. Set zt using (11) or (12). Then, we have that Ht(zt) Ht(gt) or H t(zt) H t(gt) respectively.

Proof. We consider the update (11), the other case is very similar and we omit it.

First, we derive some inequalities on the quantities of interest. From Theorem 2.1, given that gt ˆℓt(xt) and gt ℓt(xt) we have both ℓt(xt) + ℓ (gt) = gt, xt and ˆℓt(xt) + ˆℓ (gt) = gt, xt . Moreover, given that ˆℓt(x) ℓt(x) for any x, we have ˆℓ t (z) ℓ t (z) for any z. Finally, by the deﬁnition of truncated linear lower bound, we have ℓt(xt) = ˆℓt(xt).

Hence, we have

ψ t+1,V (θt zt) + ℓ t (zt)

ψ t+1,V (θt zt) + ˆℓ t (zt)

= min z ψ t+1,V (θt z) + ˆℓ t (z)

ψ t+1,V (θt gt) + ˆℓ t (gt)

= ψ t+1,V (θt gt) + gt, xt ˆℓt(xt)

= ψ t+1,V (θt gt) + gt, xt ℓt(xt)

= ψ t+1,V (θt gt) + ℓ t (gt) = Ht(gt) .

We can also immediately write closed form updates for generalized implicit FTRL with regularizer ψt(x) = λt

2 x 2, that mirror the ones of a Prox. The proof is in Section C. Corollary 6.2. Set ψt = λt

2 x 2 2 and gt ℓt(xt). Setting zt as in (11), we have that the update of generalized implicit FTRL is

xt+1 = λt λt+1 xt min 1 λt+1 , ℓt(xt)

Generalized Implicit Follow-The-Regularized-Leader

Figure 1. Hinge loss, averaged loss vs. hyperparameter β.

On the other hand, setting zt as in (12), the update is

xt+1 = λt λt+1 xt min 1 λt+1 , λt

7. Empirical Evaluation

As we said, in the worst case scenario any kind of implicit update cannot give any advantage over the usual updates. However, in practice it is well-known that things are vastly different. Hence, in this section, we compare the performance of different choices of zt in Algorithm 1 when ψt(x) = λt

2 x 2 2. In particular, we consider:

FTRL with linearized losses (Linear): zt = gt; Implicit FTRL with a Prox updates (Trunc): zt =

Figure 2. Logistic loss, averaged loss vs. hyperparameter β.

min n 1, λtℓt(xt)

gt 2 o gt; Implicit FTRL with two-step updates (Twostep): zt = ℓt(xt gt/λt); Implicit FTRL with (6) when the proximal operator has a closed form (Proximal).

We adopt the choice of λt from Corollary 4.3.

We conduct linear prediction experiments on datasets from Lib SVM (Chang & Lin, 2011). We show here experiments on classiﬁcation tasks using the hinge loss and the logistic loss, and regression tasks with absolute loss. We normalize the datasets and added a constant bias term to the features. Given that in the online learning setting, we do not have the training data and validation data to tune the β, we will plot the averaged loss, 1

t Pt i=1 ℓi(xi), versus different choice of

Generalized Implicit Follow-The-Regularized-Leader

Figure 3. Absolute loss, averaged loss vs. hyperparameter β.

β, that at the same time show the algorithms sensitivity to the hyperparameter β and their best achievable performance. We consider β [10 3, 103] for hinge loss and logistic loss, and β [10 3, 108] for the absolute loss. Each algorithm is run 15 times, we plot the average of the averaged losses and the 95% conﬁdence interval. Note that the conﬁdence intervals so small to be invisible, but for the larger values of the β for the Linear updates.

Figure 1 and Figure 2 show the averaged loss versus different selections of hyperparameter β for classiﬁcation tasks with hinge loss and logistic loss respectively. Note that with the hinge loss a Prox updates and proximal updates are completely equivalent. In all experiments, FTRL with linearized updates is more sensitive to the setting of β, and

its performance is almost uniformly worse than all the other generalized implicit updates. This is in line with previous results in Asi & Duchi (2019) in the ofﬂine setting. With the logistic loss, the proximal operator does not have a closedform solution. In all the classiﬁcation experiments, the performance of generalized implicit FTRL with two-step updates seems remarkable and a possible viable alternative to a Prox. The conﬁdence intervals for all implicit updates have a width smaller than 0.01, making them too narrow to be visible in the ﬁgures. In contrast, when using hinge loss, the performance of FTRL with linear models exhibits significant ﬂuctuations across different repetitions when a large learning rate is used. This observation provides evidence supporting our assertion that the selection of hyperparameter β greatly affects the performance of FTRL with linear models, while implicit updates demonstrate robustness.

Figure 3 shows that FTRL with linearized updates is very sensitive to the choice of the hyperparameter β, while the implicit FTRL updates are robust. Again, Implicit FTRL with two-step updates achieves essentially the best performance. The conﬁdence intervals in the regression tasks lead to a similar conclusion as in the classiﬁcation tasks.

8. Conclusion and Future Work

In this work, we propose a new framework: generalized implicit Follow-the-Regularized-Leader. We show that generalized implicit FTRL can not only recover known algorithms, e.g., implicit FTRL and FTRL with linearized losses, but it also provides a theoretical guideline to design new algorithms, such as the extensions of a Prox and Mirror-Prox. Indeed, we believe that the main contribution of our work lies precisely in the fact that it provides a unifying framework that is general, ﬂexible, and theoretically grounded.

In the future, we plan to explore further this framework designing new zt with low computational complexity. This is a promising direction because the two-step update seems to be already a valid alternative to the a Prox updates, even if it comes at the computational expense of querying an additional gradient in each round.

Acknowledgements

We thank Alex Shtoff for discussion and feedback on a preliminary version of this paper. Francesco Orabona is supported by the National Science Foundation under the grants no. 2022446 Foundations of Data Science Institute and no. 2046096 CAREER: Parameter-free Optimization Algorithms for Machine Learning .

Generalized Implicit Follow-The-Regularized-Leader

Abernethy, J. D., Hazan, E., and Rakhlin, A. Competing in the dark: An efﬁcient algorithm for bandit linear optimization. In Servedio, R. A. and Zhang, T. (eds.), Proc. of Conference on Learning Theory (COLT), pp. 263 274. Omnipress, 2008.

Asi, H. and Duchi, J. C. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257 2290, 2019.

Bauschke, H. H. and Combettes, P. L. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011.

Campolongo, N. and Orabona, F. Temporal variability in implicit online learning. In Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, 2006.

Cesa-Bianchi, N. and Orabona, F. Online learning algorithms. Annual Review of Statistics and Its Application, 8:165 190, 2021.

Chang, C.-C. and Lin, C.-J. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1 27, 2011. Software available at http://www.csie.ntu.edu. tw/ cjlin/libsvm.

Chen, K., Cutkosky, A., and Orabona, F. Implicit parameterfree online learning with truncated linear models. In International Conference on Algorithmic Learning Theory, pp. 148 175. PMLR, 2022a.

Chen, K., Langford, J., and Orabona, F. Better parameterfree stochastic optimization with ODE updates for coinbetting. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2022b.

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551 585, 2006.

Duchi, J. C., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

Fang, H., Harvey, N., Portella, V., and Friedlander, M. Online mirror descent and dual averaging: keeping pace in the dynamic case. In Daum, III, H. and Singh, A. (eds.), Proc. of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 3008 3017. PMLR, 13 18 Jul 2020.

Hazan, E. and Kale, S. Extracting certainty from uncertainty: Regret bounded by variation in costs. In Proc. of the 21st Conference on Learning Theory, 2008.

Joulani, P., Gy orgy, A., and Szepesv ari, C. A modular analysis of adaptive (non-)convex optimization: Optimism, composite objectives, and variational bounds. In Proc. of the International Conference on Algorithmic Learning Theory (ALT), volume 76, pp. 681 720, 2017.

Kivinen, J. and Warmuth, M. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1 63, January 1997.

Kulis, B. and Bartlett, P. L. Implicit online learning. In International Conference on Machine Learning, pp. 575 582, 2010.

Littlestone, N. Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2(4):285 318, 1988.

Martinet, B. R egularisation din equations variationnelles par approximations successives. rev. franc aise informat. Recherche Op erationnelle, 4:154 158, 1970.

Mc Mahan, H. B. A uniﬁed view of regularized dual averaging and mirror descent with implicit updates. ar Xiv preprint ar Xiv:1009.3240, 2010.

Mc Mahan, H. B. A survey of algorithms and analysis for adaptive online learning. The Journal of Machine Learning Research, 18(1):3117 3166, 2017.

Mc Mahan, H. B. and Streeter, M. J. Adaptive bound optimization for online convex optimization. In COLT, 2010.

Moreau, J.-J. Proximit e et dualit e dans un espace hilbertien. Bulletin de la Soci et e math ematique de France, 93:273 299, 1965.

Nemirovski, A. Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

Nemirovskij, A. S. and Yudin, D. Problem complexity and method efﬁciency in optimization. Wiley, New York, NY, USA, 1983.

Orabona, F. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019. Version 6.

Orabona, F. and P al, D. Scale-free algorithms for online linear optimization. In International Conference on Algorithmic Learning Theory, pp. 287 301. Springer, 2015.

Generalized Implicit Follow-The-Regularized-Leader

Orabona, F. and P al, D. Coin betting and parameter-free online learning. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 577 585. Curran Associates, Inc., 2016.

Orabona, F. and P al, D. Scale-free online learning. Theoretical Computer Science, 716:50 69, 2018. Special Issue on ALT 2015.

Orabona, F. and P al, D. Parameter-free stochastic optimization of variationally coherent functions. ar Xiv preprint ar Xiv:2102.00236, 2021.

Parikh, N. and Boyd, S. Proximal algorithms. Foundations and Trends in optimization, 1(3):127 239, 2014.

Rockafellar, R. T. Convex Analysis. Princeton University Press, 1970.

Rockafellar, R. T. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877 898, 1976.

Shalev-Shwartz, S. Online Learning: Theory, Algorithms, and Applications. Ph D thesis, The Hebrew University, 2007.

Shalev-Shwartz, S. and Singer, Y. A primal-dual perspective of online learning algorithms. Machine Learning, 69:115 142, 2007a.

Shalev-Shwartz, S. and Singer, Y. Convex repeated games and Fenchel duality. In Advances in neural information processing systems, pp. 1265 1272, 2007b.

Shtoff, A. Efﬁcient implementation of incremental proximalpoint methods. ar Xiv preprint ar Xiv:2205.01457, 2022.

Streeter, M. and Mc Mahan, H. B. Less regret via online conditioning. ar Xiv preprintar Xiv:1002.4862, 2010.

Warmuth, M. K. and Jagota, A. K. Continuous and discretetime nonlinear gradient descent: Relative loss bounds and convergence. In Electronic proceedings of the 5th International Symposium on Artiﬁcial Intelligence and Mathematics, volume 326, 1997.

Generalized Implicit Follow-The-Regularized-Leader

A. Update Rule for Common Losses

In this section, we report the proximal operator of common losses for easy referencing. These formulas are well-known and they can be found, for example, in Crammer et al. (2006); Kulis & Bartlett (2010).

ℓt(x) = max(1 yt st, x , 0) Prox ℓt

λ (x) = x + min 1

λ, max(1 yt st, x , 0)

ℓt(x) = | st, x yt| Prox ℓt

λ (x) = x min 1

λ, | st, x yt|

2( st, x yt)2 Prox ℓt

λ (x) = x ( st, x yt)st

λ + st 2 2 .

B. Proof of Corollary 4.3

Proof. From the regret guarantee in Lemma 4.1, we have that

Regret T (u) ψT +1(u) + β2 T X

t=1 γt λT +1(ψ(u) + β2), u V .

Now, we upper bound PT t=1 γt in two different ways. In the ﬁrst upper bound, we have

ℓt(xt) ℓ(xt+1) λt

2 xt+1 xt 2 = ℓ1(x1) ℓT (x T +1) +

t=2 (ℓt(xt) ℓt 1(xt))

ℓ1(x1) ℓT (x T +1) + VT .

For the second upper bound, we have

γt = ℓt(xt) ℓ(xt+1) λt

2 xt+1 xt 2 gt, xt xt+1 λt

2 xt+1 xt 2

gt 2 2λt + λt

2 xt+1 xt 2 λt

2 xt+1 xt 2 = gt 2 2λt β gt

where we used Fenchel-Young inequality and the second lower bound is obtained by using the fact that λt λ1 = G

β . Hence, we have

λt+1 = λt + γt

β2 λt + min gt

2β , gt 2 2β2λt

Using Lemma 6.1 in Campolongo & Orabona (2020) and taking into account the fact that λ1 = G

β , we have

Putting all together, we have the stated bound.

C. Proof of Corollary 6.2

Proof. The proximal operator of ˆℓt λ is

λ (x) = x min 1

Generalized Implicit Follow-The-Regularized-Leader

Hence, from (7), we have

xt+1 = λt λt+1

λt , ℓt(xt)

= λt λt+1 xt min 1 λt+1 , λt

Instead, from (6), we have

xt+1 = λt λt+1 xt min 1 λt+1 , ℓt(xt)