# a_regretvariance_tradeoff_in_online_learning__8d7d8047.pdf

A Regret-Variance Trade-Off in Online Learning

Dirk van der Hoeven dirk@dirkvanderhoeven.com Dept. of Computer Science Università degli Studi di Milano, Italy

Nikita Zhivotovskiy zhivotovskiy@berkeley.edu Dept. of Statistics University of California, Berkeley

Nicolò Cesa-Bianchi nicolo.cesa-bianchi@unimi.it Dept. of Computer Science Università degli Studi di Milano, Italy

We consider prediction with expert advice for strongly convex and bounded losses, and investigate trade-offs between regret and variance (i.e., squared difference of learner s predictions and best expert predictions). With K experts, the Exponentially Weighted Average (EWA) algorithm is known to achieve O(log K) regret. We prove that a variant of EWA either achieves a negative regret (i.e., the algorithm outperforms the best expert), or guarantees a O(log K) bound on both variance and regret. Building on this result, we show several examples of how variance of predictions can be exploited in learning. In the online to batch analysis, we show that a large empirical variance allows to stop the online to batch conversion early and outperform the risk of the best predictor in the class. We also recover the optimal rate of model selection aggregation when we do not consider early stopping. In online prediction with corrupted losses, we show that the effect of corruption on the regret can be compensated by a large variance. In online selective sampling, we design an algorithm that samples less when the variance is large, while guaranteeing the optimal regret bound in expectation. In online learning with abstention, we use a similar term as the variance to derive the first high-probability O(log K) regret bound in this setting. Finally, we extend our results to the setting of online linear regression.

1 Introduction

In the online learning protocol, the learner interacts with an unknown environment in a sequence of rounds. In each round t = 1, 2, . . . , T the learner makes a prediction byt [ 1

2M] and suffers loss ℓt(byt), where ℓ1, ℓ2 . . . is a sequence of differentiable loss functions unknown to the learner. The goal is to achieve small regret RT , defined by RT = PT t=1 ℓt(byt) ℓt(y t ) where y t are the predictions of some reference forecaster. We consider two special cases of online learning. In prediction with expert advice, the learner receives the predictions yt(1), . . . , y K(t) [ 1

2M] of K experts at the beginning of each round t. The learner then predicts with a convex combination byt = PK i=1 pt(i)yt(i) of the experts predictions, where pt = pt(1), . . . , pt(K) is a probability distribution over experts updated after each round. The goal in this setting is to have small regret with respect to the predictions y t = yt(i ) of any fixed expert i . In online linear regression, the learner has access to a feature vector xt Rd at each round. This is used to compute predictions byt = wt, xt , where wt Rd is a parameter vector to be updated at the end of the round. The goal

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

is to have small regret with respect to the predictions y t = u, xt of any fixed linear forecaster u Rd such that u 2 D and |y t | 1

2M for all t.

In both settings, we assume that the losses ℓt are µ-strongly convex, which is to say that there exists µ > 0 such that for all y, x [ 1

ℓt(y) ℓt(x) (y x)ℓ t(y) µ

2 (x y)2 t = 1, . . . , T , (1)

where we write ℓ t to denote the derivative of ℓt. Without loss of generality, throughout the paper we assume that µ 2.

A typical example of a strongly convex loss is the squared loss ℓt(y) = (y yt)2 with µ = 2. While standard online learning algorithms, such as, for example, Exponentially Weighted Average (EWA) [Vovk, 1990, Littlestone and Warmuth, 1994], are directly applied to the (mixable) losses ℓt, we take a different approach. Our algorithms provide bounds on the linearized regret e RT of the form

t=1 (byt y t )ℓ t(byt) CT

t=1 (byt y t )2. (2)

for some η (0, H] to be chosen freely by the learner and where H, CT , BT 0 are problem-specific parameters. In the expert setting, several algorithms provide such a guarantee: Squint [Koolen and Van Erven, 2015], Adapt-ML-Prod [Gaillard et al., 2014], BOA [Wintenberger, 2017], Squint+C and Squint+L [Mhammedi et al., 2019]. In online linear regression, the Meta Grad algorithm [Van Erven et al., 2021] and its variants by Mhammedi et al. [2019], Wang et al. [2020], Chen et al. [2021] satisfy bounds like (2). Although in both settings the aforementioned algorithms achieve the optimal tuning of η in (2) without any preliminary information, our applications do not require this tuning property. As a consequence, the algorithms we derive in Section 3 are simpler and have slightly better guarantees. More importantly, unlike the aforementioned algorithms, we focus on exploiting the curvature of the loss. In particular, by combining equations (1) and (2), we obtain the following lemma, which features the central inequality of our work. Lemma 1. Consider a sequence ℓ1, . . . , ℓT of µ-strongly convex differentiable losses. Suppose that predictions byt guarantee the second-order bound (2). Then

t=1 (byt y t )2.

Next, we show a prototypical example of the bounds we derive in the rest of this work. Example 1. Consider the expert setting with the squared loss, ℓt(y) = (y yt)2, where y, yt [ 1, 1] for all t 1. Then the predictions byt of our algorithm satisfy

RT 32 log(K) 1

t=1 (byt y t )2 . (3)

Even though (3) seems relatively inconsequential, we exploit the negative variance term in several applications. A straightforward implication is that (3) recovers the usual constant regret bound RT = O(log(K)). More interestingly, however: if we attain the worst-case performance O(log(K)), then the variance is small and bounded by O(log(K)), if the variance is large enough, then the regret becomes negative. For the role of negative variance terms in the analysis of statistical and online learning, we refer to Section 2. Remark. Negative quadratic terms similar to the one in Lemma 1 appear in online convex optimization, for example in the analysis of the online gradient descent [Hazan, 2016, Theorem 3.3]. However, as far as we are aware, in online convex optimization the algorithms are not tuned to obtain a negative term in the regret bounds. In this paper we argue that for online regression, and especially for prediction with expert advice, setting η < µ/2 in Lemma 1 and thus obtaining a negative term can be a powerful tool, which plays a central role in all of our applications. In prediction with expert advice the standard approach to exploit the curvature of the loss is through mixability rather than through the negative quadratic terms. In particular, we show that the regret bound in (3) cannot be achieved by EWA with a fixed learning rate, despite the fact that constant regret is achievable by this algorithm via mixability.

Proposition 1 (Informal). Consider the setup of online prediction with expert advice and bounded strongly convex losses. The bound (3) cannot be achieved by the standard EWA algorithm.

For the sake of presentation, we defer the formal description of this result to Section 3.1. The proof of Proposition 1 is motivated by a result of Audibert [2007] on the sub-optimality of online to batch converted EWA in deviation.

Contributions and Outline. In Section 4, we show our first three applications of Lemma 1 in the framework of statistical learning. In the expert setting, we show that online to batch conversion may be stopped early if the empirical variance of our predictions is sufficiently large. In particular, in high-variance regimes we may stop early because the excess risk bound is negative with high probability, and when the variance is small, we recover the optimal excess risk bound up to a log log T factor. By exploiting the negative variance term again, we show an optimal high-probability excess risk bound for online to batch conversion of algorithms that satisfy (2). The optimal high-probability excess risk bound (called the optimal rate of model selection aggregation) was previously known to be achieved by several estimators appearing in [Audibert, 2007, Lecué and Mendelson, 2009, Lecué and Rigollet, 2014] whose analyses are specific to the statistical learning setup. Our result can also be seen as a simplification and strengthening of a result by Wintenberger [2017]. We also show a high-probability excess risk bound for online to batch conversion of an online regression algorithm in the bounded setup, which is to say that both the feature vectors and derivatives of the losses are bounded. It was previously shown by Mourtada et al. [2021] that online to batch converted versions of the optimal Vovk-Azoury-Warmuth forecaster [Vovk, 2001, Azoury and Warmuth, 2001] have constant excess risk with constant probability.

In Section 5, we use the negative variance term to counteract corrupted feedback in online learning. Our result complements previous results where losses are assumed to be stochastic and corrupted by an adversary see, e.g., Lykouris et al. [2018], Zimmert and Seldin [2021], Amir et al. [2020], Ito [2021]. In Section 6 we consider the selective sampling setting [Atlas et al., 1989], where the learner s goal is to control regret while saving on the number of time the current loss is observed. We show that the optimal bound can be recovered while only observing a fraction of all losses if losses are observed proportionally to the cumulative variance. Finally, in Section 7 we discuss how our ideas are not limited to online learning with strongly convex losses, but may also be applied to online learning with abstention and to online multiclass classification. All bounds in the main text with suppressed constants have detailed statements in the appendix. Before discussing our applications, we introduce some notation and discuss the related work. The algorithms that we use to derive most of our results are introduced in Section 3.

Notation. We use the standard O( ) to suppress constants and use e O( ) to suppress constants and log(T)-factors. We use log( ) to denote the logarithm with base e. The symbol 1[A] denotes the indicator of the event A. For an integer K we denote [K] = {1, . . . , K}. We use p(i) g(i) to denote p(i) = g(i) . PK i =1 g(i ) and yℓ(y, Y ) to denote the derivative of ℓwith respect to y. The symbol I denotes the identity matrix whose dimensions are clear from the context. For a set of random variables Y, X1, . . . , Xt 1 we write Et 1[Y ] = E[Y |X1, . . . , Xt 1].

2 Related work

Online learning with losses with curvature. For a thorough introduction to online learning, we refer the reader to [Cesa-Bianchi and Lugosi, 2006, Hazan, 2016, Orabona, 2019]. Strongly convex losses are a special case of mixable losses. With mixable losses, EWA on a finite set of experts achieves O(log(K)) regret see, e.g., [Cesa-Bianchi and Lugosi, 2006, Mhammedi and Williamson, 2018]. Mhammedi and Williamson [2018] observe that in some cases regret may be negative for mixable losses, but they do not explore this topic further.

The role of the negative terms due to curvature in online and statistical learning. Although bounds of the form (3) are not explicitly present in the literature, the appearance of negative terms is common when proving fast rates in statistical learning with squared loss. In particular, the empirical star algorithm of Audibert [2007] as well as other aggregation algorithms [Lecué and Mendelson, 2009, Lecué and Rigollet, 2014, Wintenberger, 2017] exploit the curvature of the loss

Algorithm 1: An algorithm for prediction with expert advice

Input η (0, 1

2], M > 0 ; Initialize γ = η M 2 , p1(i) = 1

K for all i ; for t = 1, . . . , T do Receive expert predictions yt(1), . . . , yt(K) Predict byt = PK i=1 pt(i)yt(i) Receive gt and κt Set eℓt(i) = γ(yt(i) byt)gt + κt 1(γ(yt(i) byt)gt)2

Set pt+1(i) exp( κt Pt s=1 eℓs(i))

through the negative term which compensates the variance term. We also refer to some more recent statistical learning results whose analysis is based on the same idea [Mendelson, 2019, Bousquet and Zhivotovskiy, 2021, Puchkin and Zhivotovskiy, 2022, Saad and Blanchard, 2021] . Similarly, in the context of online learning the negative quadratic term appears in Rakhlin and Sridharan [2014], where the so-called sequential offset Rademacher complexity is studied. Van Erven et al. [2021] also obtain a bound that is very similar to the bound in Lemma 1 (specifically in the proof of their Theorem 1), but they choose η to match the negative term in their bound. Importantly, in these papers the role of the negative term is only to get the fast rate by compensating the variance term. In contrast, in our case the negative variance terms also appear in the final regret bound and play their role in applications.

Suboptimality of EWA for prediction with expert advice. In the setup of prediction with expert advice, the classical Exponentially Weighted Average (EWA) algorithm [Vovk, 1990, Littlestone and Warmuth, 1994] is known to give a constant regret in the case of strongly convex losses. However, despite being optimal in this setting, this algorithm has several known drawbacks. In the case of general losses, EWA does not deliver the second-order bound (2). As a result, unlike more advanced algorithms such as Squint [Koolen and Van Erven, 2015], EWA with a fixed learning rate (and even a decreasing learning rate) cannot adapt to certain benign stochastic environments where, for example, the Bernstein assumption holds [Mourtada and Gaïffas, 2019]. The second source of suboptimality comes from online to batch conversions in the strongly convex case. Although in the statistical setting EWA performs optimally in expectation, it does not do so with high probability [Audibert, 2007]. As a matter of fact, it will be clear from our analysis that this is related to the fact that EWA does not satisfy a bound of the form (3) (see Theorem 1). This problem is one of the motivations behind the work of Wintenberger [2017].

Exploiting negative regret. The possibility of getting a negative excess risk has been recently explicitly exploited by Puchkin and Zhivotovskiy [2022, see their Algorithm 3.3 and Theorem 3.4] in the setup of active learning with abstentions. In the context of online to batch conversion of online learning algorithms, a similar idea is exploited in Section 4.1. Moreover, our selective sampling results in Section 6 are of the same flavor.

3 Our Algorithms

In the following, g1, g2, . . . are real numbers in a bounded interval. However, in most applications we use gt = ℓ t(byt). The algorithms in this section are simplified versions of algorithms in the literature. Namely, in Section 3.1 we present a simplified version of Squint [Koolen and Van Erven, 2015] and in Section 3.2 we present a simplified version of Meta Grad [Van Erven et al., 2021]. The simplifications lie in the tuning of the learning rate. Squint and Meta Grad optimize the learning rate online at a small cost, whereas in our applications we only need a fixed learning rate that is known in advance, in which case we do not pay the small cost for optimizing the learning rate. The proofs of the results in this section are postponed to Appendix A.

Algorithm 2: An algorithm for online linear regression Input η > 0, σ > 0, G > 0, Z > 0 ; Initialize γ = η G2 , w1 = 0, and Σ 1 1 = 1

σI ; for t = 1, . . . , T do Receive xt Set Wt = Tt s=1{w : | w, xs | Z} Set wt = argminw Wt(w e wt) Σ 1 t (w e wt) Predict byt = wt, xt Receive gt and κt Set zt = γxtgt Set Σ 1 t+1 = κt2ztz t + Σ 1 t Set e wt+1 = wt ztΣt+1

3.1 A Simple Algorithm for Prediction with Expert Advice

Algorithm 1 is a simplified version of Squint [Koolen and Van Erven, 2015]. The κt parameter is relevant only to the selective sampling setting, where it is used to control the range of loss estimates. In all other settings we set κt = 1 for all t. Note that in round t both κt and κt 1 are used to update the algorithm. This is due to a technicality in the analysis of the algorithm, where a κt 1eℓt(i) term appears and we want to use the inequality x x2 log(1 + x) for |x| 1

2 (specifically in equation (7)). The regret bound of Algorithm 1 can be found in Lemma 2. Lemma 2. For all g1, . . . , g T [ M, M] and κ0 = κ1 κ2 κT such that κt (0, 1], the predictions byt of Algorithm 1 run with input M and η (0, 1

t=1 (byt yt(i))gt M 2 log(K)

t=1 κt 1(byt y t )2

provided maxi |yt(i) y t | M for all t 1.

As an immediate corollary of Lemma 2 and Lemma 1 we have the following regret bound. Corollary 1. Fix an arbitrary sequence ℓ1, . . . , ℓT of µ-strongly convex differentiable losses such that maxt |ℓ t| M. Provided that maxi |y t yt(i)| M for all t 1, the predictions byt of Algorithm 1 run with inputs M and η [0, 1

2), κt = 1 for all t, and feedback gt = ℓ t(byt), satisfy

RT M 2 log(K)

t=1 (byt y t )2 .

As an example, let us consider the squared loss, which is 2-strongly convex. In the setup of Example 1, since |ℓ t(byt)| = 2|(byt yt)| 4, Algorithm 1 with gt = ℓ t(byt) and η = 1

2 gives us the regret bound claimed in Example 1, namely RT 32 log(K) 1

2 PT t=1(byt y t )2. Our next result is a formal version of Proposition 1 saying that the above regret bound cannot be achieved by the standard EWA algorithm. For standard notation and explicit details on this algorithm we refer to Appendix E. Theorem 1. Consider the squared loss and two experts yt(1) = 0 and yt(2) = 1 for all t 1. Let by EWA t be the EWA predictions. There is a sequence y1, y2, . . . such that yt [0, 1], t 1 and, for large enough T, the regret of EWA with η = 1

2 satisfies 12 log T RT 2 log 2 and, at the same time, T X

t=1 (by EWA t y t )2 T/2 .

3.2 A Simple Algorithm for Online Regression

In the following we use yt(u) = u, xt . We prove a regret bound for Algorithm 2, which is a simplified version of Meta Grad [Van Erven et al., 2021]. The role of the parameter Z in the algorithm is to ensure the predictions byt are bounded, which will be important in the statistical learning setting in Section 4. Similarly to Algorithm 1, the κt parameter is only used in the selective sampling setting.

Lemma 3. For all g1, . . . , g T R and κ1 κT (0, 1], the predictions byt of Algorithm 2 run with inputs η > 0, σ = D2, G maxt |gt|, and Z > 0

t=1 (byt yt(u))gt d G2

2κT η log 1 + D2η2 max t=1,...,T xt 2 2

t=1 κt(byt yt(u))2 ,

for any x1, . . . , x T Rd, and for any u WT TT t=1{w : | w, xt | Z} such that u 2 D. Example 2. Consider the setup of Example 1 and suppose that maxt xt 2, u 2 1 and M = 1. We have that |ℓ t(byt)| = |2(byt yt)| 4 and thus, by Lemma 3, an appropriately tuned Algorithm 2 with gt = ℓ t(byt) satisfies (2) with CT = 8 + 8d log 1 + η2 T

d and BT = 0. Thus, by Lemma 1, setting η = 1

RT 16 + 16d log 1 + T

t=1(byt yt(u))2.

4 Statistical Learning

We discuss an application of our general results in the context of statistical learning where we are interested in the generalization of estimators to unseen samples. A tool often used in converting online learning algorithms to the statistical learning setting is online to batch conversion [Cesa-Bianchi et al., 2004]. Let us recall the setup.

Assume that we are given a family F of real-valued functions defined on the instance space X. We observe T i.i.d. observations (Xt, Yt)T t=1 distributed according to some unknown distribution P on X R. Given the loss function ℓ: R2 R, define the risk R(f) of f : X R as R(f) = E ℓ(f(X), Y ), where the expectation is taken with respect to the joint distribution of X and Y . We are interested in bounding the excess risk R( bf) inf f F R(f) , where bf is constructed

based on the sample (Xt, Yt)T t=1. Assume that there is a sequence of predictors bf1, . . . , bf T trained in an online manner using (Xt, Yt)T t=1 (that is, bfk depends on (Xt, Yt)k 1 t=1 ) such that almost surely PT t=1 ℓ( bft(Xt), Yt) ℓ(f (Xt), Yt) RT , where RT is non-random. In this case, a standard

online to batch conversion approach gives an in-expectation excess risk bound E R 1 T PT t=1 ft

inf f F R(f) RT

T , for any loss convex in its first argument and where the expectation is taken with

respect to the learning sample (Xt, Yt)T t=1. However, getting a high-probability version of this result is a known challenge if one wants to get the fast rate O 1

T . A standard way of proving a high-probability result is to apply Freedman s inequality for martingales [Kakade and Tewari, 2008] that leads to a variance term scaling as O 1

. For example, Audibert [2007] showed that this is the case if one wants to prove a high-probability excess risk bounds based on EWA. A way to handle the variance term in Freedman s inequality is by exploiting the Bernstein assumption as in Kakade and Tewari [2008]. Unfortunately, this assumption is not necessarily satisfied by the stochastic environments we are considering. The main idea in this section is to use the negative term from Lemma 1 to cancel out this variance term appearing due to Freedman s inequality1. We use the following notation when applying the online algorithms in the statistical setting: ℓt( ) = ℓ( , Yt) and yt(f) = f(Xt).

4.1 Statistical Learning: Model Selection Aggregation

In this section, we discuss the application of our results to the model selection (MS) aggregation. This setup was introduced by Nemirovski [2000] and further studied by Tsybakov [2003] and by Audibert [2007], Lecué and Mendelson [2009], Lecué and Rigollet [2014], Wintenberger [2017], Mourtada et al. [2021] among other works. We introduce some notation. Assume that we are given a finite dictionary F = {f1, . . . , f K} of real-valued functions defined on the instance space X. We observe T i.i.d. observations (Xt, Yt)T t=1 distributed according to some unknown distribution P on X R.

1We remark that Wintenberger [2017] uses a similar but technically more involved idea to compensate the variance of predictions using the term appearing because of the curvature of the loss.

Algorithm 3: Early Stopping online to batch for Model Selection Aggregation Input T, M, η, stopping threshold S ; Initialize S = 0, provide η, and M as input for Algorithm 1 ; while S < T and S > µ

8 minf F PS t=1(byt f(Xt))2 do Receive Xt and send f1(Xt), . . . , f K(Xt) as expert predictions to Algorithm 1 Receive pt and byt = PK i=1 pt(i)fi(Xt) from Algorithm 1 Predict byt and receive ℓt Send gt = ℓ t(byt) and κt = 1 to Algorithm 1 Set S = S + 1 Output bf = 1

S PS t=1 PK i=1 pt(i)fi ;

Given the loss function ℓ: R2 R, define the risk R(f) of f : X R as R(f) = E ℓ(f(X), Y ), where the expectation is taken with respect to the joint distribution of X and Y . In the model selection aggregation, one is interested in constructing an estimator bf based on the random sample (Xt, Yt)T t=1 such that, with probability at least 1 δ,

R( bf) min f F R(f) = O log(K) + log(1/δ)

under appropriate boundedness and curvature assumptions on the loss function ℓ. Analogously to Tsybakov [2003], the bound of the form (4) will be called the optimal rate of aggregation. We make use of a variant of online to batch conversion [Cesa-Bianchi et al., 2004] where we stop the procedure early if the empirical variance of predictions is sufficiently large. We sketch the idea. Let S be the number of samples we have used before we terminated the procedure. We use Algorithm 1 as our aggregation procedure and use bf = 1

S PS t=1 PK i=1 pt(i)fi. By Jensen s inequality we have

R( bf) 1 S PS t=1 Et 1 h ℓt PK i=1 pt(i)fi(Xt) i . To motivate stopping early, observe that if the empirical variance in Lemma 1 is sufficiently large, we may conclude that the excess risk is negative and we have outperformed the best f F. The result can be found in Theorem 2 below, whose proof is implied by Theorem 8 in Appendix B. Theorem 2. Suppose that for all f F |f(X)| 1

2 almost surely, that | yℓ(y, Y )| 1 almost surely for all y such that |y| 1

2, and that ℓis µ-strongly convex in its first argument. Then, with

probability at least 1 δ, Algorithm 3 with input parameters T, S = O log(K)+log(log(T )/δ))

8 , and M = 1 satisfies

( minf F R(f) if S < T

minf F R(f) + O log(K)+log(log(T )/δ))

µT if S = T,

where S is the number of steps of Algorithm 3.

When Algorithm 3 terminates at step S = T, we recover the optimal high probability bound for model selection aggregation (4) up to an additive log log T term. However, when S = T our bound tells us slightly more, because we know that for all t < T, minf F Pt

t=1(byt f(Xt))2 = O(log K + log log T) , which means that on the sequence (Xt, Yt)T t=1 our predictions byt are essentially following the prediction of the currently best expert at each round.

In the special case where we are solely interested in the best possible performance of the online to batch conversion of Algorithm 1, we can remove the log log T term appearing in the previous bound. We remark that, apart from the work of Wintenberger [2017], no known analysis based on the online to batch conversion achieved the optimal rate of aggregation (4). We also believe that our analysis is simpler than for previously known algorithms. The result can be found in Theorem 3 below, whose result is implied by Theorem 9 in Appendix B. Theorem 3. Suppose that for all f F |f(X)| 1

2 almost surely, that | yℓ(y, Y )| 1 almost surely for all y such that |y| 1

2, and that ℓis µ-strongly convex in its first argument. Then, with probability at least 1 δ, Algorithm 3 with input parameters T, η = µ

4 , S = , and M = 1,

R( bf) min f F R(f) = O log(K) + log(1/δ)

4.2 Statistical Learning: Regression

We consider the statistical learning setting where one has access to T i.i.d. samples of pairs (Xt, Yt) Rd R. In this case, we consider F {x 7 w, x : w Rd}. For w Rd we define the risk as R(w) = E [ℓ( w, X , Y )]. We assume that ℓis µ-strongly convex in its first argument.

To the best of our knowledge there are no known high probability excess risk bounds in regression based on online to batch conversions with convergence rate O d log(T )

T . We provide such a result in the bounded setup where the feature vectors, derivatives of the losses, and the norm of the reference vector are bounded. Similarly to before, for a result that holds with high probability, one needs to control the cumulative variance of our prediction. For standard online learning algorithms the control of the variance may prove troublesome. For example, Mourtada et al. [2021] showed that a version of Vovk-Azoury-Warmuth forecaster [Vovk, 2001, Azoury and Warmuth, 2001] may have a O(1) excess risk bound with constant probability, whereas in expectation the Vovk-Azoury-Warmuth forecaster guarantees a O d log(T )

T excess risk bound. Instead, we leverage the negative empirical variance of Lemma 1 to control the variance of the online to batch conversion, leading to the following excess risk bound, whose result is implied by Theorem 10 in Appendix B. Theorem 4. Suppose that X 2 1 and supy [ 1,1] | yℓ(y, Y )| 1 almost surely, w 2 1, and that ℓis µ-strongly convex in its first argument. Then, with probability at least 1 δ,

R(w) = O d log(T) + log(1/δ)

where wt are given by Algorithm 2 with η = µ

4 , σ = 1, Z = 1, G = 1, κt = 1, and feedback gt = ℓ t( wt, Xt ) for t = 1, . . . , T.

5 Corrupted Feedback

In this section, we study a setting where the loss derivatives ℓ t(byt) observed by the learner at each round t may be adversarially corrupted by unknown additive constants ct, and we are interested in the best possible dependence on c1, . . . , c T in the regret bound. To better explain our setting, we start with the following example. Example 3. In the online regression setting, suppose that ℓt(byt) = (byt yt)2 for all t, but the learner observes corrupted outcomes yt ct/2. Hence, the squared loss derivative computed by the learner is 2(byt yt + ct/2) = ℓ t(byt) + ct, which can be handled by the algorithms developed in this section.

Several variants of this setting have been studied in prior work, see for example [Lykouris et al., 2018, Amir et al., 2020, Zimmert and Seldin, 2021, Ito, 2021] and the references therein. The main difference between our setting and these previous settings is that we assume our losses to be strongly convex and the environment is not necessarily stochastic. Although the results in this section are rather straightforward corollaries of our bounds, we believe that it is instructive to provide some explicit results. All proofs of the results in this section are postponed to Appendix C.

Our first result shows the performance of Algorithm 1 in the setup with corrupted gradients. The proof follows from observing that ℓ t(byt)(byt y t ) = (ℓ t(byt) + ct)(byt y t ) ct(byt y t ) and that for any λ > 0, the inequality |ct(byt y t )| c2 t λ + λ

4 (byt y t )2 holds. The λ

4 (byt y t )2 term can be compensated for by the negative µ

2 (byt y t )2 appearing in Lemma 1, leading to a PT t=1 c2 t additive term in the regret bound. In particular, our result implies that as long as PT t=1 c2 t is of order O(log K), the same regret bound as if the losses were not corrupted. The formal statement can be found in Theorem 5 below. Theorem 5. Fix an arbitrary sequence ℓ1, . . . , ℓT of µ-strongly convex differentiable losses and corruptions c1, . . . , c T R. Then the predictions byt of Algorithm 1 run with inputs M

maxt |ℓ t(byt) + ct|, η = µ

4 , feedback gt = ℓ t(byt) + ct, and κt = 1 satisfy

RT 8M 2 log(K)

t=1 c2 t µ µ

t=1(byt yt(i ))2

provided maxi maxt |byt yt(i)| M.

Next we prove an analog of Theorem 5 in the online regression setup. Theorem 6. Fix an arbitrary sequence ℓ1, . . . , ℓT of µ-strongly convex differentiable losses and corruptions c1, . . . , c T R. Then the predictions byt of Algorithm 2 run with inputs η = µ

8 , σ = D2, G maxt |ℓ t(byt) + ct|, Z > 0, feedback gt = ℓ t(byt) + ct, and κt = 1, satisfy

µ log 1 + TD2µ2 maxt xt 2 2 2d

t=1 (byt y t )2,

for any x1, . . . , x T Rd, and for any u WT TT t=1{w : | w, xt | Z} such that u 2 D and y t = u, xt for all t 1.

6 Selective Sampling

We consider a variant of the selective sampling setting see, e.g., [Atlas et al., 1989, Freund et al., 1997, Cesa-Bianchi et al., 2003, 2006, Orabona and Cesa-Bianchi, 2011] where the learner has access to the expert predictions (or, equivalently, to feature vectors), but can observe its own loss only upon request. The goal is to trade off the number of loss requests with regret guarantees. We show that if the variance is high, with only a fraction of all losses requested we obtain the same guarantee (in expectation and up to constants) as when all losses are requested.

Let ot = 1 with probability qt and ot = 0 with probability 1 qt. In each round, if ot = 1 the loss ℓt at round t is requested, and we use the loss estimator ot qt 1 ℓt to update. Note that this is not the importance weighted estimator, as Et 1[ ot

qt 1 ] = qt qt 1 . The reason for choosing this particular loss estimator is that we have better control of the range of the loss which allows us to tune κt in Algorithm 1 accordingly. The probability of requesting a loss is

s=1(bys ys(i))2 )

where β > 0 is chosen by the learner. Our result for the selective sampling setting can be found in Theorem 7 below, whose statement is implied by Theorem 11 in Appendix D. Theorem 7 implies the following: if β = O(µ 3/2 log(K)), then with only an expected number PT t=1 qt of loss requests, we obtain (up to constants) the same regret guarantee as we would have obtained if we had requested all losses. With this particular choice of β, qt < 1 as soon the bound in Lemma 1 becomes negative. In other words, when the variance is high we only need a fraction of the losses to recover the worst-case optimal regret bound (in expectation). A similar result can be obtained in the regression setting, see Appendix D.1. Theorem 7. Fix an arbitrary sequence ℓ1, . . . , ℓT of µ-strongly convex differentiable losses. Provided maxi maxt |byt yt(i)| 1, the predictions byt of Algorithm 1 run with inputs M = 1 maxt |ℓ t(byt)|, η = µ

4 , feedback gt = ot qt 1 ℓ t(byt), and κt = qt satisfy

t=1 (ℓt(byt) ℓt(y t ))

µ + log(K)2

7 Further Extensions

In Appendix G we present another application of Lemma 1. Namely, we show that we may restart Algorithm 1 for free whenever the regret becomes negative. This gives us a regret bound where we compete with a new expert after each restart, which is a stronger notion of regret than when we compete with a fixed expert in all rounds.

The ideas of Lemma 1 also naturally extend beyond online learning with strongly convex losses. We discuss one such extension to the online prediction with expert advice setting. The online prediction with abstention setting was introduced by Neu and Zhivotovskiy [2020] and proceeds as follows. In each round t = 1, . . . , T the learner receives expert predictions yt(i) { 1, 1}, i = 1, . . . , K and the learner can then either predict eyt { 1, 1} or abstain from prediction. If the learner predicts with eyt, the learner suffers the binary loss ℓt(eyt) = 1[eyt = yt], where yt { 1, 1}. If the learners abstains from prediction, the learner suffers abstention cost ρ [0, 1

2). Let at = 1 for prediction and at = 0 for abstention. The total loss of the learner is therefore equal to

t=1 (at1[eyt = yt] + (1 at)ρ).

Assume that the prediction strategy is random is a sense that at are Bernoulli random variables whose means might depend on previous observations. The work of Neu and Zhivotovskiy [2020] shows that there is a randomized prediction strategy such that for any data generating mechanism it holds that

t=1(at1[eyt = yt] + (1 at)ρ) XT

t=1 1[y t = yt] = O log K

independently of T, where the expectation is taken with respect to the randomness of at; here y t = yt(i ) and i = argmini PT t=1 1[yt(i) = yt]. Although it was shown that the randomization is necessary to achieve the regret bound (6), it is unclear if the same regret bound can be achieved with high probability with respect to the randomization of the algorithm. In Appendix F, we answer this question using the techniques we developed and provide a randomized algorithm such that, with probability at least 1 δ,

t=1 (at1[eyt = yt] + (1 at)ρ)

t=1 1[y t = yt] = O log K + log(1/δ)

In Appendix F we prove Lemma 7, which is the analog of Lemma 1 for the abstention setting. The equivalent of the negative term in Lemma 1 is used to compensate for the variance of the highprobability statement, which allows us to recover the above bound. This also implies that our other applications of Lemma 1 can be exported to the online learning with abstention setting.

Acknowledgments and Disclosure of Funding

Dirk van der Hoeven and Nicolò Cesa-Bianchi gratefully acknowledge partial support from the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR), the EU Horizon 2020 ICT-48 research and innovation action under grant agreement 951847, project ELISE (European Learning and Intelligent Systems Excellence), and the project One Health Action Hub: University Task Force for the resilience of territorial ecosystems funded by Università degli Studi di Milano. A significant part of this work was done while Nikita Zhivotovskiy was at ETH, Zürich. During this period, Nikita Zhivotovskiy was funded in part by ETH Foundations of Data Science (ETH-FDS).

Idan Amir, Idan Attias, Tomer Koren, Yishay Mansour, and Roi Livni. Prediction with corrupted expert advice. Advances in Neural Information Processing Systems, 33:14315 14325, 2020.

Les Atlas, David Cohn, and Richard Ladner. Training connectionist networks with queries and selective sampling. Advances in Neural Information Processing Systems, 2, 1989.

Jean-Yves Audibert. Progressive mixture rules are deviation suboptimal. Advances in Neural Information Processing Systems, 2007.

Katy S Azoury and Manfred K Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211 246, 2001.

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In International Conference on Artificial Intelligence and Statistics, pages 19 26, 2011.

Olivier Bousquet and Nikita Zhivotovskiy. Fast classification rates without standard margin assumptions. Information and Inference: A Journal of the IMA, 10(4):1389 1421, 2021.

Sébastien Bubeck. Introduction to online optimization. Lecture Notes, 2011.

Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

Nicolò Cesa-Bianchi, Alex Conconi, and Claudio Gentile. Learning probabilistic linear-threshold classifiers via selective sampling. In Learning Theory and Kernel Machines, pages 373 387. Springer, 2003.

Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004.

Nicolo Cesa-Bianchi, Claudio Gentile, Luca Zaniboni, and Manfred Warmuth. Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7(7), 2006.

Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Impossible tuning made possible: A new expert algorithm and its applications. In Conference on Learning Theory, pages 1216 1259, 2021.

Ashok Cutkosky. Artificial constraints and hints for unbounded online learning. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99, pages 874 894, 25 28 Jun 2019.

Tim van Erven, Wouter M. Koolen, and Dirk van der Hoeven. Metagrad: Adaptation using multiple learning rates in online learning. Journal of Machine Learning Research, 22(161):1 61, 2021.

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2):133 168, 1997.

Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176 196, 2014.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2 (3-4):157 325, 2016.

Dirk van der Hoeven. Exploiting the surrogate gap in online multiclass classification. Advances in Neural Information Processing Systems, 33:9562 9572, 2020.

Dirk van der Hoeven, Tim van Erven, and Wojciech Kotlowski. The many faces of exponential weights in online learning. In Conference On Learning Theory, pages 2067 2092, 2018.

Shinji Ito. On optimal robustness to adversarial corruption in online decision problems. Advances in Neural Information Processing Systems, 34:7409 7420, 2021.

Sham M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. Advances in Neural Information Processing Systems, 21, 2008.

Wouter M. Koolen and Tim van Erven. Second-order quantile methods for experts and combinatorial games. In Conference on Learning Theory, pages 1155 1175, 2015.

Guillaume Lecué and Shahar Mendelson. Aggregation via empirical risk minimization. Probability Theory and Related Fields, 145(3-4):591 613, 2009.

Guillaume Lecué and Philippe Rigollet. Optimal learning with Q-aggregation. The Annals of Statistics, 42(1):211 224, 2014.

Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994.

Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114 122, 2018.

Shahar Mendelson. An unrestricted learning procedure. Journal of the ACM (JACM), 66(6):1 42, 2019.

Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2001.

Zakaria Mhammedi and Robert C Williamson. Constant regret, generalized mixability, and mirror descent. Advances in Neural Information Processing Systems, 31, 2018.

Zakaria Mhammedi, Wouter M Koolen, and Tim van Erven. Lipschitz adaptivity with multiple learning rates in online learning. In Conference on Learning Theory, pages 2490 2511, 2019.

Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the Hedge algorithm in the stochastic regime. Journal of Machine Learning Research, 20:1 28, 2019.

Jaouad Mourtada, Tomas Vaskevicius, and Nikita Zhivotovskiy. Distribution-free robust linear regression. Mathematical Statistics and Learning, 4(3-4):253 292, 2021.

Arkadi Nemirovski. Topics in non-parametric statistics. Ecole d Eté de Probabilités de Saint-Flour, 28:85, 2000.

Gergely Neu and Nikita Zhivotovskiy. Fast rates for online prediction with abstention. In Conference on Learning Theory, pages 3030 3048, 2020.

Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Francesco Orabona and Nicolò Cesa-Bianchi. Better algorithms for selective sampling. In International Conference on Machine Learning, page 433 440, 2011.

Nikita Puchkin and Nikita Zhivotovskiy. Exponential savings in agnostic active learning through abstention. IEEE Transactions on Information Theory, 68(7):4651 4665, 2022.

Alexander Rakhlin and Karthik Sridharan. Online non-parametric regression. In Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 1232 1264, 2014.

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, page 1571 1578, 2012.

Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the Leader if you can, Hedge if you must. Journal of Machine Learning Research, 15:1281 1316, 2014.

El Mehdi Saad and Gilles Blanchard. Fast rates for prediction with limited expert advice. Advances in Neural Information Processing Systems, 34:23582 23591, 2021.

Alexandre B Tsybakov. Optimal rates of aggregation. In Learning Theory and Kernel Machines, pages 303 313. Springer, 2003.

Volodimir G Vovk. Aggregating strategies. In Proceedings of the 3rd Annual Workshop on Computational Learning Theory, 1990.

Volodimir G Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213 248, 2001.

Guanghui Wang, Shiyin Lu, and Lijun Zhang. Adaptivity and optimality: A universal algorithm for online convex optimization. In Uncertainty in Artificial Intelligence, pages 659 668, 2020.

Olivier Wintenberger. Optimal learning with Bernstein online aggregation. Machine Learning, 106 (1):119 141, 2017.

Lijun Zhang, Tianbao Yang, and Zhi-Hua Zhou. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning, pages 5882 5891, 2018.

Julian Zimmert and Yevgeny Seldin. Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22(28):1 49, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] All the results claimed in the abstract have proofs either in the main body or appendix (b) Did you describe the limitations of your work? [Yes] For each result we state the parameters under which the result holds. (c) Did you discuss any potential negative societal impacts of your work? [No] Our work is primarily theoretical and we do not foresee any potential negative societal impacts of our work. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] Although some proofs may only appear in the appendix. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]