# bestcase_lower_bounds_in_online_learning__d757534d.pdf

Best-case lower bounds in online learning

Cristóbal Guzmán 1,2 Nishant A. Mehta 3 Ali Mortazavi 3

1Department of Applied Mathematics, University of Twente 2 Institute for Mathematical and Computational Eng., Pontiﬁcia Universidad Católica de Chile 3 Department of Computer Science, University of Victoria

c.guzman@utwente.nl,nmehta@uvic.ca,alithemorty@gmail.com

Much of the work in online learning focuses on the study of sublinear upper bounds on the regret. In this work, we initiate the study of best-case lower bounds in online convex optimization, wherein we bound the largest improvement an algorithm can obtain relative to the single best action in hindsight. This problem is motivated by the goal of better understanding the adaptivity of a learning algorithm. Another motivation comes from fairness: it is known that best-case lower bounds are instrumental in obtaining algorithms for decision-theoretic online learning (DTOL) that satisfy a notion of group fairness. Our contributions are a general method to provide best-case lower bounds in Follow the Regularized Leader (FTRL) algorithms with time-varying regularizers, which we use to show that best-case lower bounds are of the same order as existing upper regret bounds: this includes situations with a ﬁxed learning rate, decreasing learning rates, timeless methods, and adaptive gradient methods. In stark contrast, we show that the linearized version of FTRL can attain negative linear regret. Finally, in DTOL with two experts and binary losses, we fully characterize the best-case sequences, which provides a ﬁner understanding of the best-case lower bounds.

1 Introduction

A typical work in online learning would develop algorithms that provably achieve low regret for some family of problems, where low regret means that a learning algorithm s cumulative loss is not much larger than that of the best expert (or action) in hindsight. Such a work often focuses on algorithms that exhibit various forms of adaptivity, including anytime algorithms, which adapt to an unknown time horizon T; timeless algorithms, which obtain ﬁrst-order regret bounds that replace dependence on the time horizon by the cumulative loss of the best expert; and algorithms like Ada Grad [8], which adapt to the geometry of the data. These examples of adaptivity all involve competing with the best expert in hindsight, but adaptivity comes in many guises. Another form of adaptivity involves upgrading the comparator itself: in the shifting regret (also known as the tracking regret) [11], the learning algorithm competes with the best sequence of experts that shifts, or switches, k times for some small k. Naturally, an algorithm with low shifting regret can potentially perform much better than the single best expert in hindsight, thereby obtaining classical regret that is substantially negative.

Our work serves as a counterpoint to previous works: we show that a broad class of learning strategies provably fails, against any sequence of data, to substantially outperform the best expert in hindsight. Thus, these strategies are unable to obtain low shifting regret and, more generally, low regret against any comparator sequence that can be substantially better than the best expert in hindsight. More concretely, this paper initiates the study of best-case lower bounds in online convex optimization (OCO) for the general family of learning strategies known as Follow the Regularized Leader (FTRL) [1]. That is, we study the minimum possible regret of a learning algorithm over all possible sequences.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

As we will show, many instances of FTRL including adaptive instances that are anytime, timeless, or adapt to gradients like Ada Grad never have regret that is much less than the negation of the corresponding regret upper bounds. Thus, while these instances can be adaptive in some ways, they are in a sense prohibited from uniformly obtaining low regret for adaptive notions of regret like the shifting regret. For example, in the setting of decision-theoretic online learning (DTOL) with d experts [9], the well-known anytime version of Hedge (which uses the time-varying learning rate ηt p

log(d)/t) enjoys O( T log d) worst-case regret and, as we show, has O( T log d) best-case regret. Moreover, in the same setting under the restriction of two experts and binary losses, we exactly identify the best-case sequence, thereby showing that our best-case lower bound for this setting is tight. The structure of this sequence is surprisingly simple, but the arguments we use to pinpoint this sequence are playfully complex, bearing some similarity to the techniques of [20] and [15]. The latter work [15] considers the regret of Thompson Sampling in adversarial bit prediction; they use swapping rules and identify best-case sequences, as do we. However, the algorithms and problem settings have important differences.

A key motivation for our work is a recent result [4] which shows, in the setting of DTOL, that Hedge with constant learning rate has best-case regret lower bounded by O(

T). This result, taken together with worst-case upper bounds of order O(

T), is then used to show that if each of ﬁnitely many experts approximately satisﬁes a certain notion of group fairness, then a clever use of the Hedge algorithm (running it separately on each group) also approximately satisﬁes the same notion of group fairness while still enjoying O(

T) regret. However, we stress that their result is very limited in that it applies only to Hedge when run with a known time-horizon. The ﬁxed time horizon assumption also implies that their notion of group fairness also is inherently tied to a ﬁxed time horizon (see Section 4.2 for a detailed discussion), and this latter implication can lead to experts that seem very unfair but which, based on a ﬁxed horizon view of group fairness, are technically considered to be fair. Our best-case lower bounds enable the results of [4] to hold in much greater generality; in particular, our results enable the use of an anytime version of group fairness, which we feel is truly needed.

To our knowledge, our work is the ﬁrst to study best-case lower bounds for Adaptive FTRL [16], i.e., FTRL with time-varying regularizers that can adapt to the learning algorithms past observations. The most closely related work is a paper by Gofer and Mansour (GM) [10] which, in the setting of online linear optimization (OLO) and when using FTRL with a ﬁxed regularizer,1 provides various lower bounds on the regret. For instance, they show that for any sequence of data, the regret is nonnegative; we recover this result as a special case of our analysis, and our analysis extends to OCO as well. GM also lower bound what they call the anytime regret, which superﬁcially may seem similar to our providing best-case lower bounds for anytime algorithms. Yet, as we explain in Section 3, these two notions greatly differ. In short, their analysis lower bounds the maximum regret (over all preﬁxes of a sequence) for ﬁxed horizon algorithms, whereas our analysis lower bounds the regret for all preﬁxes (including the minimum) for adaptively regularized algorithms, which includes anytime algorithms.

A natural question is whether results similar to our results for FTRL also hold for online mirror descent (OMD). In some situations, such as in OLO when the action space is the probability simplex, the regularizer is the negative Shannon entropy, and the learning rate is constant, our results automatically apply to OMD because the methods are then the same. More generally, it is known that OMD with a time-varying learning rate can fail spectacularly by obtaining linear regret (see Theorem 4 of [18]). Since so much of our work is tied to obtaining anytime guarantees (which would require a time-varying learning rate), we forego providing best-case lower bounds for OMD.

Our main contributions are as follows:

1. We give a general best-case lower bound on the regret for Adaptive FTRL (Section 3). Our analysis crucially centers on the notion of adaptively regularized regret, which serves as a potential function to keep track of the regret.

2. We show that this general bound can easily be applied to yield concrete best-case lower bounds for FTRL with time-varying negative regularizers, one special case being the negative Shannon entropy. We also show that an adaptive gradient FTRL algorithm (which can be viewed as a non-linearized version of the dual averaging version of Ada Grad [8]; see Section 4.3 for details) admits a best-case lower bound that is essentially the negation of its upper bound (Section 4).

1[10] uses learners that follow the gradient of a concave potential, which is essentially equivalent to FTRL.

3. A widely used variant of FTRL for OCO is to ﬁrst linearize the losses, leading to linearized FTRL. This method works well with respect to upper bounds, as a basic argument involving convexity goes in the right direction. However, with regards to best-case lower bounds, we show a simple construction (Section 5) for which linearized FTRL obtains Ω(T) regret.2

4. In the setting of DTOL with 2 experts and binary losses, we explicitly identify the best-case sequence, proving that our best-case lower bounds are tight in this setting (Section 6).

The next section formalizes the problem setting and FTRL. We then develop the main results.

2 Problem Setting and General Prediction Strategies

Before giving the problem setting, we ﬁrst set some notation. We denote the norm of a vector w W as w , the corresponding dual norm is denoted as , and log is always the natural logarithm.

Problem setting. We consider the OCO setting. This is a game between Learner and Nature. In each round t = 1, 2, . . . , T, Learner selects an action wt belonging to a closed, bounded, convex set W Rd. Then, with knowledge of w1, . . . , wt, Nature responds with a convex loss function ft : W 7 R. Learner then observes ft and suffers loss ft(wt). Learner s goal is to minimize its regret, deﬁned as

RT := sup w W

t=1 [ft(wt) ft(w)],

which is the gap between Learner s cumulative loss and that of the best action in hindsight.

This paper will cover several examples of OCO. The ﬁrst example is the subclass of OLO problems. In OLO, the loss functions ft are linear, with ft(w) = ℓt, w for some loss vector ℓt Rd. A noteworthy special case of OLO is DTOL, also known as the Hedge setting. In DTOL, we take W to be equal to the simplex d over d outcomes and restrict the loss vectors as ℓt [0, 1]d. We introduce some notation that will be useful in the DTOL setting. For any expert j [d], let Lt,j := Pt s=1 ℓs,j denote the cumulative loss of expert j until the end of round t. We denote the loss of Learner in round t as ˆℓt := ℓt, wt and Learner s cumulative loss at the end of round t as ˆLt := Pt s=1 ˆℓs.

In this work, we consider the general prediction strategy of FTRL.

FTRL. Let Φ1, Φ2, . . . be a possibly data-dependent sequence of regularizers where, for each t, the regularizer Φt is a mapping Φt : W R which is allowed to depend on (fs)s t. Then FTRL chooses actions according to the past regularized cumulative loss:

wt+1 = arg min w W

s=1 fs(w) + Φt(w) o . (1)

We would like to emphasize that this is a very general template. It includes a ﬁxed learning rate regularization, Φt 1

ηΦ, as well as its variable learning rate counterpart, Φt = 1 ηt Φ, and arbitrary forms of adaptive choices of Φt based on the past. Despite this adaptivity, in the next section we will show that proving best-case lower bounds for this strategy is quite straightforward.

The regret attained by FTRL is summarized in the following known result.

Theorem 1 (Theorem 1 of [16]). Let (Φt)t 1 be a sequence of nonnegative regularizers such that, for each t 1, Φt is 1-strongly convex with respect to a norm (t). The regret of the FTRL algorithm (1) with this sequence of regularizers is upper bounded as

RT ΦT (w ) + 1

2 PT t=1 ft(wt) 2 (t 1), , (2)

where w W is the best action in hindsight. If we do not require the regularizers to be nonnegative but instead assume that, for each w W, the sequence (Φt(w))t 1 is non-increasing, then (2) still holds if we replace ΦT (w ) by ΦT (w ) infw W ΦT (w).

2This negative construction, combined with the fact that the dual averaging version of Ada Grad is a linearized version of FTRL, is why we only prove best-case lower bounds for the adaptive gradient FTRL algorithm.

3 A general best-case lower bound

We now present our general best-case lower bound for FTRL. Key to our analysis is the concept of adaptively regularized regret (hereafter abbreviated as regularized regret ), deﬁned as

RΦt t = sup w W

s=1 [fs(ws) fs(w)] Φt(w) o . (3)

The regularized regret can easily be related to the regret as RΦt t Rt inf w W Φ(w). (4)

Also, applying (1), the following re-expression of the regularized regret is immediate:

s=1 [fs(ws) fs(wt+1)] Φt(wt+1). (5)

Theorem 2 (Best-case lower bound on regret for adaptive FTRL). Consider the setting of online convex optimization and the adaptive FTRL strategy (1). Suppose that there exists a sequence (αt)t [T ] such that Φt(wt) Φt 1(wt) + αt for all t [T]. Then

RT inf w W ΦT (w) inf w W Φ0(w)

Proof. We start by inductively bounding the adaptively regularized regret:

RΦt+1 t+1 RΦt t = sup w W

s=1 [fs(ws) fs(w)] Φt+1(w) o

s=1 [fs(ws) fs(wt+1)] + Φt(wt+1)

Φt+1(wt+1) + Φt(wt+1) αt+1,

where the ﬁrst equality is from (5). We conclude that RΦT T RΦ0 0 PT t=1 αt. Next, from (4),

RT RΦT T + inf w W ΦT (w) inf w W ΦT (w) + RΦ0 0

= inf w W ΦT (w) inf w W Φ0(w)

where in the last equality we used that RΦ0 0 = supw W{ Φ0(w)}.

The closest results to Theorem 2 of which we are aware are in the intriguing work of Gofer and Mansour (GM) [10], who provided lower bounds for FTRL with a ﬁxed regularizer for OLO. Their Theorem 1 shows that the best-case regret is nonnegative. One wonders if the doubling trick can extend their results to anytime best-case lower bounds; regrettably, the doubling trick s restarts can allow the algorithm to achieve -Ω(T) regret (see Appendix A). GM also lower bound a notion they call the anytime regret (see Theorem 5 of [10]). The anytime regret for a sequence as they deﬁne it is actually the maximum regret over all preﬁxes of the sequence (where the sequence has a ﬁxed time horizon); ultimately, the lower bound they obtain depends on the quadratic variation of the sequence as computed on a ﬁxed time horizon. Related to this, GM s analysis is for algorithms that use the same regularizer in all rounds. They lament that it is unclear how to extend their analysis to handle time-varying learning rates3. Our goal is rather different, as taking the maximum regret over all preﬁxes says nothing about how large (in the negative direction) the regret could be for some particular preﬁx. Thus, our style of analysis, which provides a lower bound on the regret for all time horizons (which, in particular, provides a lower bound on the minimum over all preﬁxes), differs greatly from theirs and is what is needed. We think the different styles of our work and theirs stems from the regret being nonnegative in their paper, whereas it need not be for time-varying regularizers.

Theorem 2 possesses considerable generality, owing to its applying to FTRL with adaptive regularizers. We highlight just a few applications in the next section.

3A time-varying learning rate gives rise to perhaps the most basic form of an adaptive regularizer.

4 Best-case lower bounds in particular settings

Theorem 2 presented in last section, despite its simplicity, is an extremely powerful method, capable of addressing several of the settings where FTRL attains sublinear regret. Next, we proceed to enumerate some important examples of instances of FTRL, comparing existing worst-case upper bounds on the regret with our best-case lower bounds.

4.1 Non-increasing learning rates and timeless algorithms

We ﬁrst present several examples in which the elements of the sequence (Φt)t 1 take the form Φt = 1 ηt Φ for a ﬁxed regularizer Φ and a time-varying learning rate ηt.

Constant learning rate. The simplest example is that of a ﬁxed learning rate ηt η, which means Φt(w) = 1

ηΦ(w). Taking αt = 0 for all t, Theorem 2 immediately implies that the regret is always nonnegative; this implication was previously shown by Gofer and Mansour (see Thm. 1 of [10]). A simple consequence is that the Follow the Leader (FTL) strategy (i.e., Φt 0) has nonnegative regret, which also can be inferred from the results of [10]. Although we cannot ﬁnd a precise reference, we believe that it was already known that FTL always obtains nonnegative regret (even prior to [10]).

Time-varying learning rate. More generally, taking a time-varying learning rate, Theorem 2 gives

inf w W Φ(w)

Φ(wt+1). (6)

A typical strategy to obtain sublinear anytime worst-case regret is to set the learning rate as ηt = η/ t + 1 for some constant η that depends on various known problem-dependent constants. Continuing from (6) with this setting further implies that

T + 1 1) inf w W Φ(w) 1

η sup w W Φ(w)

If Φ is nonpositive as holds when W is the d-dimensional simplex and Φ is the negative Shannon entropy the above is further lower bounded by

T + 1 1) inf w W Φ(w) 1

η sup w W Φ(w)

A particularly interesting example is the DTOL setting. In this setting, when we take Φ to be the negative Shannon entropy Φ(w) = Pd j=1 wj log wj and set η = 2 log d so that ηt = 2 p

(log d)/(t + 1) recovers the standard anytime version of Hedge which has also been called Decreasing Hedge [17]. In round t, this algorithm plays wt such that wt,j exp( ηt 1Lt 1,j) for j [d], and we have the following anytime best-case lower bound and worst-case upper bound on the regret:

T log d RT p

The lower bound holds from (7) combined with

T and log d Φ 0, while the upper bound is from Theorem 2 of [6]. The upper bound is minimax optimal in terms of the rate and, asymptotically (letting both d and T go to inﬁnity) has a constant that is optimal up to a factor of

2. As we show in Section 6, in the case of d = 2 the lower bound also has the optimal rate.

Timeless algorithms. In the context of DTOL, it is straightforward to adapt our analysis for Decreasing Hedge to Hedge with any non-increasing learning rate. One example of interest is

ηt = log 1 min n 1 4, q

where L T = minj [d] LT,j is the cumulative loss of the best expert. This choice of adaptive learning rate yields an anytime upper bound on the regret of O p

L T log d + log d in the DTOL setting (see Theorem 2.1 of [2], who actually prove this result in the more general setting of prediction with

expert advice). Such a bound is called timeless because rounds in which all experts suffer the same loss have no effect on the bound [7]. This is a natural property to have in this setting, and with the above choice of learning rate, we have the following timeless4 best-case lower bound:

2 + 4 log d

a brief derivation is in Appendix B. Again, notice the similarity between the upper and lower bounds.

4.2 Group fairness in online learning

In a pioneering work, Blum, Gunasekar, Lykouris, and Srebro (BGLS) [4] considered a notion of group fairness in DTOL. In their setup, the DTOL protocol is augmented so that, at the start of each round t, Nature selects and reveals to Learner a group gt belonging to a set of groups G prior to Learner s playing its action wt. They assume that for a known, ﬁxed time horizon T, each expert j [d] has balanced mistakes across groups in the following sense: For any g G, let T (g) := {t [T]: gt = g} denote the rounds belonging to group g, and let LT (g),j := P

t T (g) ℓt,j denote the cumulative loss of expert j [d] when considering only the rounds in T (g); then we say that expert j is fair in isolation if

|T (g)| = LT (g ),j

|T (g )| for all g, g G.

BGLS devised a strategy based on the multiplicative weights algorithm which simultaneously satisﬁes an approximate notion of group fairness while still enjoying O( T log d) worst-case regret. The multiplicative weights algorithm (which we hereafter refer to as Hedge as the algorithms are equivalent for constant learning rate) sets wt as wt,j (1 η)Lt 1,j for j [d] for a learning rate parameter η. Their strategy is to run a separate copy of Hedge for each group, so that for any group g, the copy corresponding to group g is run on the subsequence corresponding to the rounds in T (g). For brevity, we call this Interleaved Hedge . In BGLS s analysis (see the proof of their Theorem 3), they ﬁrst give worst-case regret upper and lower bounds for Hedge. Speciﬁcally, they show that if Hedge is run with a constant learning rate, then

(1 4 η) L T ˆLT (1 + η)L T + (log d)/ η. (9)

An optimal, non-anytime worst-case tuning of η then yields matching-magnitude regret lower and upper bounds of O( T log d) and O( T log d) respectively. Then on the one hand, the regret of Interleaved Hedge satisﬁes RT = O( p

|G|T log d). In addition, by virtue of BGLS s O( T log d) lower bound for Hedge when fed T rounds (along with the standard upper bound), their Interleaved Hedge enjoys the following group fairness guarantee:

ˆLT (g) |T (g)| ˆLT (g ) |T (g )| = O q

for all g, g G and T0 := ming |T (g)|,

where we adopt the notation ˆLT (g) := P

t T (g) ˆℓt.

By using our improved nonnegative best-case lower bound for Hedge with constant learning rate (which again, is not a new result) together with an optimal, non-anytime worst-case tuning of η, we can obtain the following improvement over (9):

As explained in the appendix (the remaining steps are essentially due to BGLS), we can then obtain the following improved group fairness guarantee:

ˆLT (g) |T (g)| ˆLT (g ) |T (g )| q

T0 for all g, g G and T0 := ming |T (g)|.

The above result holds when the learning rate for each group g is set as η(g) = q

log d |T (g)|. Of course, running Hedge instances for each group with the correct constant learning rates means that the

4Technically, the upper and lower bounds as stated are not timeless, but it is easy to see that any round for which all experts have the same loss can be replaced by a round where all experts have zero loss, with no change in the algorithm s behavior nor its regret. Our regret bounds on the modiﬁed loss sequence then become timeless.

algorithm must know |T (g)| for each group g G, at least within a reasonable constant factor. This is a far stronger assumption than the already strong assumption of a known time horizon.

Using our anytime best-case lower bound for Decreasing Hedge, combined with the analysis of BGLS, it is straightforward to vastly extend their results in each of the following ways:

1. Using BGLS s ﬁxed horizon notion of fairness and a copy of Decreasing Hedge for each group s instance, we can drop the assumption that each group s cardinality |T (g)| is known. 2. The most interesting extension, whose possibility was the original basis of our entire work, is that we can now upgrade BGLS s notion of group fairness to its anytime sibling. This involves measuring, for every preﬁx of the length-T game, the discrepancy between the error rates of any pair of groups. This is an arguably more natural notion of fairness, as it avoids situations where an expert purports to be fair while having all of its mistakes for one group occur in the ﬁrst half of the game. Since we now have an anytime best-case lower bound for Decreasing Hedge, we have the requisite piece needed to show that Interleaved (Decreasing) Hedge satisﬁes the same notion of anytime group fairness. Our timeless best-case lower bounds also apply here, giving that extension as well.

We now brieﬂy sketch each of these extensions, leaving more detailed derivations to the appendix.

Decreasing Hedge (Anytime results). Suppose now that Interleaving Hedge uses copies of Decreasing Hedge with time-varying learning rate ηt = 2 p

(log d)/(t + 1). Note that in this case, the copy for group g increments its internal round only each time a new round for group g appears. We can then automatically apply our anytime lower bound on the regret of Decreasing Hedge to obtain

ˆLT (g) |T (g)| ˆLT (g ) |T (g )| q

log d |T (g)| + 1

log d |T (g )| for g such that ˆLT (g ) |T (g )| = ming G ˆLT (g) |T (g)|.

Therefore, if in hindsight we have T0 = ming G |T (g)|, then the following fairness guarantee holds:

ˆLT (g) |T (g)| ˆLT (g ) |T (g )| 3

T0 for all g, g G.

Note that even though Learner need not know the time horizon T nor |T (g)| for each g, the above guarantee still can only hold for a ﬁxed time horizon T. This is because the deﬁnition of fairness in isolation only holds for a ﬁxed time horizon.

Anytime fairness. As mentioned earlier, the ﬁxed horizon view of fairness in isolation can be very limiting. Rather than opting for fairness in isolation to hold for a ﬁxed time horizon, we argue that it is more natural for this deﬁnition to hold in the following anytime sense:

|T (g)| = LT (g ),j

|T (g )| for all g, g G and for all T;

note that the time horizon T is implicit in each T (g).

The attentive reader may notice that this deﬁnition can be overly restrictive on Nature (perhaps impossibly so), and therefore it is natural to allow the equality to hold only approximately (which BGLS did explore). Just as in BGLS s work, it is possible to extend our results to cases where fairness holds only approximately, and such an extension is certainly warranted in the case of anytime fairness. This extension requires only straightforward modiﬁcations to the analysis and so we do not give further details here. In the case of anytime fairness, it further makes sense for approximate fairness to be deﬁned according to a rate. We leave this extension to a future, longer paper.

Using the above anytime version of fairness in isolation, it is straightforward to extend the previous result to the following new result. Just like above, suppose now that Interleaved Hedge uses copies of Decreasing Hedge with time-varying learning rate. Then, for all time horizons T,5

ˆLT (g) |T (g)| ˆLT (g ) |T (g )| 3

T0 for all g, g G.

Mini-conclusion. Whereas a key message of [4] is adaptive algorithms (in the sense of low shifting regret) cannot satisfy group fairness, at least when using the interleaved approach, our best-case lower bounds do cover many other types of adaptivity. This shows that with regards to group fairness and the interleaved strategy, the group-fair adversarial online learning tent is actually quite large.

5Note that in the below, each T (g) (and hence T0) implicitly depends on the time horizon T.

4.3 Adaptive gradient FTRL

Inspired by the adaptive gradient (Ada Grad) algorithm for OCO [8], we consider an adaptive gradient FTRL algorithm with quadratic regularizers, Φt(w) = 1 2η w, Htw : this corresponds to the adaptive FTRL (1) with regularizers Φt as above. We include its two variants. In the below, we use the notation gt := ft(wt) and δ > 0 is a ﬁxed number:

(a) Diagonal: Ht = δI + Diag(st), where st = q Pt τ=1 g2 j,t

(b) Full-matrix: Ht = δI + G1/2 t where Gt = Pt τ=1 gtg t .

We emphasize that the proposed algorithm does not exactly match Ada Grad from [8]. To resolve this inconsistency, we use the regret upper bounds from Theorem 1, combined with upper bounds on the gradient norms that appear in such upper bounds, proved in [8]. This leads to the following result.

Theorem 3 (From [8]). Consider the setting of OCO. Given 1 p , we denote by Dp the diameter of W in the p norm, and Mp is a bound on the p norm of the gradients for any possible loss. Then the adaptive gradient FTRL algorithm satisﬁes the following bounds:

(a) Diagonal: If δ = M , then PT t=1 ft(wt) 2 (t 1), 2η Pd j=1 q PT t=1 g2 t,j. In particular,

setting η = D , we obtain an upper bound on the regret D2 2 D M + D Pd j=1 q PT t=1 g2 t,j.

(b) Full matrix: If δ = M2, then PT t=1 ft(wt) 2 (t 1), 2η Tr(G1/2 T ). In particular, setting

η = D2, we obtain an upper bound on the regret D2[M2/2 + Tr(G1/2 T )].

We proceed to best-case lower bounds, deferring the proof to Appendix D. The proof strategy follows similar telescopic recursions to those used in [8, 16]. Notice that these bounds closely match the respective worst-case upper bounds.

Proposition 4. In the OCO setting, the adaptive gradient FTRL strategy with parameter tuning as in Theorem 3 attains best-case lower bounds on the regret of (D /2) Pd j=1 q P t [T ] g2 t,j in the

diagonal case and (D2/2)Tr(G1/2 T ) in the full-matrix case.

5 On best-case lower bounds for other algorithms

So far, our study of best-case lower bounds has focused on adaptive FTRL algorithms. A major drawback of FTRL is that the iteration cost of each subproblem grows linearly with the number of rounds. It is then tempting to consider more efﬁcient algorithms attaining comparable regret upper bounds. We discuss such possibilities for two natural algorithms: linearized FTRL and OMD.

5.1 Linearized FTRL

In this section, we give a simple construction in which linearized FTRL obtains negative linear regret, i.e., regret which is Ω(T). The problem is a forecasting game with binary outcomes, the action space D equal to the 2-simplex [0, 1], and the squared loss ℓ(p, y) = 1

2(p y)2. This can be cast as an OCO problem by taking W = [0, 1] and, for each t [T], setting ft(p) = (p yt)2 for some outcome yt {0, 1} selected by Nature.

We consider linearized FTRL using the negative Shannon entropy regularizer; this is equivalent to exponentiated gradient descent [12] and uses the update pt = 1 1+exp(ηt Gt 1), where Gt 1 = Pt 1 s=1 gs and each gs = ps ys is the gradient of the loss with respect to ps under outcome ys. In this situation, an O(

T) anytime regret upper bound can be obtained by employing the time-varying learning rate ηt = 1/

t (see Theorem 2.3 of [5] or Theorem 2 of [6]).

Let Nature s outcome sequence be the piecewise constant sequence consisting of T/2 zeros followed by T/2 ones. Therefore, the best action in hindsight is p = 0.5. We now state our negative result and brieﬂy sketch a proof; the full proof can be found in Appendix E.

Theorem 5. In the construction above, linearized FTRL obtains regret RT = Ω(T).

Proof (sketch). The core idea behind the proof is that linearized FTRL exhibits a type of switching behavior. Let q0 and q1 be constants satisfying 0 < q0 < 1/2 < q1 < 1. Observe that in the ﬁrst half of the game (considered in isolation), the optimal action is 0. Thus, if pt drops below q0 in the ﬁrst half of the game, then for all rounds thereafter in the ﬁrst half, Learner picks up negative constant regret in each round. We show that this drop happens after only a constant number of rounds for a simple reason: the gradient must be at least q0 whenever pt > q0 in the ﬁrst half, and pt changes according to the exponentiated cumulative gradient. Thus, Learner picks up negative linear regret in the ﬁrst half of the game. Next, consider the second half of the game. For this half (considered in isolation), the optimal action is 1. Thus, once pt surpasses q1, for all rounds thereafter Learner will again pick up negative constant regret. Via an arguably coarse analysis, we show that the number of rounds in the second half before Learner s action pt surpasses q1 is at most a small constant fraction of T. To show this, we rely on the fact that whenever pt < q1 in the second half, we have that the gradient must be at least q1 (combined with the exponential effect of the gradients on pt). That is, Learner quickly switches from having pt below q0 to having pt above q1. Thus, for most of the second half, Learner again picks up negative linear regret.

5.2 Online mirror descent

For OLO over the simplex, the online mirror descent method with entropic regularization and constant learning rate attains best-case lower bounds O( T log d) [4]. This algorithm is known to attain upper bounds on the regret of the same order. The extension of this result for non-increasing learning rates,6 albeit possible, seems of limited interest, as this algorithm is known to achieve linear regret in the worst case [18], which is attributed to the unboundedness of the Bregman divergence.

6 Best-case loss sequence for binary DTOL with two experts

In this section, we characterize the binary loss sequence for DTOL with two experts in which the regret for Decreasing Hedge is minimized. Recall that Decreasing Hedge chooses expert i at round t with probability pi,t = exp( ηt Li,t 1) P

j {1,2} exp( ηt Lj,t 1), where ηt = p

1/t. Losses are the form ℓt = ℓ1,t ℓ2,t

where ℓ1,t, ℓ2,t {0, 1}. Note that ηt+1 < ηt for all t.

Our approach is to introduce a set of operations on a sequence of binary loss vectors ℓ1, ℓ2, . . . , ℓT that do not increase the regret of Decreasing Hedge. Using this set of operations, we will show that any loss sequence with T = 2K rounds ℓ1, ℓ2, . . . , ℓ2K, can be converted to a speciﬁc loss sequence which we call the canonical best-case sequence. Deﬁnition 6. For any length T = 2K 16, the canonical best-case loss sequence is deﬁned as:

Canonical(2K) = 0 1 K 1 0 0 2 1 0 K 1. (10)

We now sketch our proof that (10) is indeed the best-case sequence. The full proof is in Appendix F.

First, consider an arbitrary binary loss sequence for two experts ℓ1, . . . , ℓT . We can substitute all (1, 1) loss vectors with (0, 0) loss vectors as it can be easily shown that this operation does not change the regret. Thus, there is no need for (1, 1) loss vectors in the best-case sequence.

Next, we introduce the notion of a leader change and show that there is no leader change in the best-case sequence. Therefore, we only need to consider loss sequences without leader changes. Formally, we say expert j is a leader at round t if Lj,t = mini {1,2} Li,t. Moreover, if j is the only leader at round t, then we call i t = j the strict leader in round t. We say that a sequence has a leader change if there exists times t1 and t2 where in both times the strict leader is deﬁned7 and i t1 = i t2. Deﬁning t := L2,t L1,t, observe that if a sequence has a leader change, then the sign of should change at least once. We show (Lemma 15) that any loss sequence ℓ1, ℓ2, . . . , ℓT can be converted, without increasing the regret, to a sequence ℓ 1, ℓ 2, . . . , ℓ T where is always nonnegative (hence, the resulting sequence has no leader change). Removing all leader changes facilitates the next operation to successfully convert loss sequences.

Next, consider the operation of swapping two consecutive loss vectors. When we swap a loss sequence ℓ1, . . . , ℓt 1, ℓt, ℓt+1, ℓt+2, . . . , ℓT at round (t, t + 1), the resulting sequence becomes

6In fact, we have found a simple argument showing the regret is nonnegative, strengthening the result in [4]. 7If in round t, L1,t = L2,t, then the strict leader is not deﬁned in that round.

ℓ1, . . . , ℓt 1, ℓt+1, ℓt, ℓt+2 . . . , ℓT . It can be easily shown that this swap only changes the loss of Decreasing Hedge at rounds t and t + 1. Since we only consider loss sequences with no leader change, we can assume without loss of generality that expert 1 is the only leader. Now based on the swapping rule lemma (Lemma 16), we can always move 0 1 to the earlier rounds and 1 0 to the later rounds. By repeatedly applying this swapping rule, the resulting loss sequence will be of the form ℓ1, . . . , ℓT = 0 1 a 0 0 c 1 0 b. Note that as we have shown that is always nonnegative, a b.

So far, we have shown that any loss sequence can be converted to a loss sequence of the form ℓ1, . . . , ℓT = 0 1 a 0 0 c 1 0 b. We then show in Lemma 17 that if c is not even, then the sequence 0 1 a 1 0 0 c+1 1 0 b has smaller regret and a 1 b. Therefore, we only need to consider loss

sequences of the form 0 1 a 0 0 c 1 0 b where a b and c is even. As we assume T is even, a + b is also even. Now it is shown (in Lemma 18) that among all sequence of losses of form ℓ1, . . . , ℓT = 0 1 a 0 0 c 1 0 b, where a + b = 2K, the loss sequence 0 1 K 0 0 c 1 0 K has the least regret. Therefore, we only need to consider loss sequences of this latter form where 2K + c = T.

Finally, among all possible (K, c) for loss sequences 0 1 K 0 0 c 1 0 K such that 2K + c = T, we show that the regret is minimized when c = 2. This form coincides with the canonical best-case sequence.

Bounding the regret. As shown in Appendix F.6, the regret on the canonical best-case loss sequence can be lower and upper bounded as follows:

2 R(T) 1 1+e

T + 12 e + 1

As both the lower and upper bounds are Θ(

T), the analysis in Section 3 is tight in this case.

7 Discussion

In this work, we provided a systematic treatment of best-case lower bounds in online learning for adaptive FTRL algorithms, discussed the impossibility of certain natural extensions, and also provided a tighter analysis of such lower bounds in the binary prediction experts setting. As one application, we have shown that our results for adaptive FTRL enable the use a broad class of adaptive online learning algorithms that satisfy a balanced mistakes notion of group fairness. Naturally, many questions still remain open, and we hope this work motivates further research in this intriguing topic. A ﬁrst question relates to algorithms that can achieve negative regret: can we characterize the conditions under which this happens? The goal would be to reveal, beyond the particular example we gave in Section 5.1, structural properties of an instance that lead to substantial outperformance of the best ﬁxed action. Returning to that example, another question arises. We have shown an example of OCO with a strongly convex loss (the squared loss) for which linearized FTRL obtains negative linear regret, whereas our best-case lower bounds for (non-linearized) FTRL with constant learning rate imply O(

T) regret and known upper bounds imply O(log T) regret. Thus, while in this situation, linearized FTRL pays a price with respect to regret upper bounds, it can exhibit a switching behavior that might allow it to compete with a shifting sequence of comparators. Investigating this blessing of linearization would be a fascinating direction for future work. Finally, for DTOL with two experts, we showed that our best-case lower bounds are tight by constructing the best-case sequence. It would be interesting to show tightness without an explicit construction, as we could then say that the best-case lower bounds are achievable more generally. For instance, while our swapping-based technique is most related to that of [20] for worst-case sequences and [15] (who give both worstand best-case sequences), it could also be worthwhile to draw from previous works [19, 13, 3, 14] which explicitly work out the minimax regret and minimax strategies. Even so, our results on identifying the structure of the best-case loss sequence show that Decreasing Hedge has the lowest regret for loss sequences for which the best shifting sequence of experts has precisely one shift. One research direction would be to see if this pattern holds more generally, for other FTRL-type algorithms.

We conclude our discussion pointing out that the theoretical nature of our work does not entail direct negative societal consequences. As far as our discussions on group fairness are concerned, one should be careful when applying these methods as the notion of balanced mistakes may be susceptible to strong forms of discrimination (see, e.g., the Discussion section in [4]). In particular, balanced mistakes is a sound notion of non-discrimination when all examples that contribute towards the performance also contribute towards fairness.

Acknowledgments and Disclosure of Funding

CG s work was partially supported by FONDECYT Regular project 1210362, INRIA through the INRIA Associate Teams project, and ANID Millennium Science Initiative Program NCN17 059. NM and AM were supported by the NSERC Discovery Grant RGPIN-2018-03942. We also thank Sajjad Azami for his involvement in the earlier stages of this work.

[1] Jacob Abernethy, Elad E Hazan, and Alexander Rakhlin. Competing in the dark: An efﬁcient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, pages 263 273, 2008.

[2] Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-conﬁdent on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48 75, 2002.

[3] Peter L Bartlett, Wouter M Koolen, Alan Malek, Eiji Takimoto, and Manfred K Warmuth. Minimax ﬁxed-design linear regression. In Conference on Learning Theory, pages 226 239. PMLR, 2015.

[4] Avrim Blum, Suriya Gunasekar, Thodoris Lykouris, and Nati Srebro. On preserving nondiscrimination when combining expert advice. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings. neurips.cc/paper/2018/file/2e855f9489df0712b4bd8ea9e2848c5a-Paper.pdf.

[5] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[6] Alexey Chernov and Fedor Zhdanov. Prediction with expert advice under discounted loss. In International Conference on Algorithmic Learning Theory, pages 255 269. Springer, 2010.

[7] Steven De Rooij, Tim Van Erven, Peter D Grünwald, and Wouter M Koolen. Follow the leader if you can, hedge if you must. The Journal of Machine Learning Research, 15(1):1281 1316, 2014.

[8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

[9] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

[10] Eyal Gofer and Yishay Mansour. Lower bounds on individual sequence regret. Machine Learning, 103(1):1 26, 2016.

[11] Mark Herbster and Manfred K Warmuth. Tracking the best expert. Machine learning, 32(2): 151 178, 1998.

[12] Jyrki Kivinen and Manfred K Warmuth. Exponentiated gradient versus gradient descent for linear predictors. information and computation, 132(1):1 63, 1997.

[13] Wouter M Koolen, Alan Malek, and Peter L Bartlett. Efﬁcient minimax strategies for square loss games. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper/2014/file/ 8d420fa35754d1f1c19969c88780314d-Paper.pdf.

[14] Wouter M Koolen, Alan Malek, Peter L Bartlett, and Yasin Abbasi Yadkori. Minimax time series prediction. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ 4dcf435435894a4d0972046fc566af76-Paper.pdf.

[15] Yuval Lewi, Haim Kaplan, and Yishay Mansour. Thompson sampling for adversarial bit prediction. In Algorithmic Learning Theory, pages 518 553. PMLR, 2020.

[16] H. Brendan Mc Mahan. A survey of algorithms and analysis for adaptive online learning. Journal of Machine Learning Research, 18(90):1 50, 2017. URL http://jmlr.org/papers/v18/ 14-428.html.

[17] Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the hedge algorithm in the stochastic regime. Journal of Machine Learning Research, 20:1 28, 2019.

[18] Francesco Orabona and Dávid Pál. Scale-free online learning. Theoretical Computer Science, 716:50 69, 2018.

[19] Eiji Takimoto and Manfred K Warmuth. The minimax strategy for gaussian density estimation. pp. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 100 106, 2000.

[20] Tim Van Erven, Wojciech Kotłowski, and Manfred K Warmuth. Follow the leader with dropout perturbations. In Conference on Learning Theory, pages 949 974. PMLR, 2014.