# luckiness_in_multiscale_online_learning__2bb574b2.pdf

Luckiness in Multiscale Online Learning

Muriel Felipe Pérez-Ortiz Centrum Wiskunde & Informatica (CWI) muriel.perez@cwi.nl

Wouter M. Koolen CWI and University of Twente wmkoolen@cwi.nl

Algorithms for full-information online learning are classically tuned to minimize their worst-case regret. Modern algorithms additionally provide tighter guarantees outside the adversarial regime, most notably in the form of constant pseudoregret bounds under statistical margin assumptions. We investigate the multiscale extension of the problem where the loss ranges of the experts are vastly different. Here, the regret with respect to each expert needs to scale with its range, instead of the maximum overall range. We develop new multiscale algorithms, tuning schemes and analysis techniques to show that worst-case robustness and adaptation to easy data can be combined at a negligible cost. We further develop an extension with optimism and apply it to solve multiscale two-player zero-sum games. We demonstrate experimentally the superior performance of our scale-adaptive algorithm and discuss the subtle relationship of our results to Freund s 2016 open problem.

1 Introduction

The abstract problem of online prediction with expert advice [Littlestone and Warmuth, 1994, Freund and Schapire, 1997] is of fundamental importance in computational learning theory. Efficient and optimal algorithms for solving it have a substantial impact on various problems in general online convex optimization [Hazan, 2019], online model selection [Foster et al., 2017], boosting [Freund and Schapire, 1997], and maximal probabilistic inequalities [Rakhlin and Sridharan, 2017], to name a few. Concretely, a decision maker chooses among experts advices sequentially, and the environment assigns each advice a scalar loss. If all losses have the same numerical range [ σ, σ], the situation is well understood. Indeed, Freund and Schapire [1997] showed that, for K experts and t rounds, the Hedge algorithm guarantees the minimax regret (defined below) σ

2t ln K. Furthermore, modern algorithms additionally guarantee lower or even constant regret when the sequence of losses is more benign [see De Rooij et al., 2014, Koolen and Van Erven, 2015, Mourtada and Gaïffas, 2019].

In the multiscale setting, where the experts loss ranges may differ by orders of magnitude, it is natural to ask about the existence of algorithms that guarantee an optimal worst-case regret bound that scales with the loss range of the best expert instead of the maximum range. This question has been answered affirmatively [Chen et al., 2021, Bubeck et al., 2019, Cutkosky and Orabona, 2018, Foster et al., 2017]. The algorithms developed in this line of work have had a significant impact in different areas of computational learning theory and practice. Unfortunately, as we will see, the best known algorithms still fail to guarantee lower regret even for the simplest benign statistical cases. Ensuring these goals poses serious technical challenges. In particular, Bernstein s inequality, the engine of classical same-scale luckiness arguments, has no suitable multiscale upgrade. Moreover, intuitive candidate upgrades of same-scale results would contradict recent lower bounds (see Section 7). To make things worse, in order to obtain multiscale regret bounds, close attention needs to be paid to terms that are conventionally insignificant but now carry the maximum scale of the problem. This motivates our main question: can a single algorithm have multiscale worst-case regret guarantees and, in addition, exhibit constant (pseudo)regret in stochastic lucky cases?

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

We answer the previous question affirmatively. The key contribution in this article is MUSCADA (multiscale adaptive), a computationally efficient algorithm that simultaneously guarantees a worstcase regret that grows with the scale of the best expert, and constant expected pseudoregret under a stochastic margin condition. MUSCADA uses a refined version of Follow the Regularized Leader based on the multiscale entropy of Bubeck et al. [2019]. Its crucial improvement is a second-order variance-like adaptation, the tightest possible for the analysis of this regularizer. This second-order adaptation is close in spirit to, and an improvement of, that of Ada Hedge by De Rooij et al. [2014] and those of Chen et al. [2021]. As a result of careful analysis, MUSCADA has the following attractive properties: it does not need knowledge of the length of the game in advance without resorting to any doubling trick, the presence of zero-regret rounds does not change the state of the algorithm or its regret guarantees; it is invariant both under per-round, possibly unknown, translations of each expert s losses, and under a global known scaling common to all losses and ranges.

As an application of MUSCADA and its analysis techniques, we build an optimistic variant of the algorithm and use it to solve two-person zero-sum games that have a multiscale structure. The optimistic variant makes use of a guess of what the losses in the next round will be, and achieves lower regret when the guesses are adequate. This interest originates in the fact that optimistic algorithms converge to the solutions of such games at faster rates than their nonoptimistic counterparts [Syrgkanis et al., 2015]. We find experimentally that MUSCADA outperforms existing single-scale algorithms when the payoff matrix of the game exhibits a multiscale structure.

In the rest of this introduction we lay out formally the multiscale experts problem, review existing work, present a summary of the main contributions (Section 1.1), and outline the rest of the article.

Full-information online learning. In its simplest form, we must decide sequentially in rounds how to aggregate the predictions made by a fixed number K of experts. At each round t, we choose an aggregation strategy, a probability distribution wt P(K) over experts. After choosing wt, we assess the quality of the experts predictions with a numerical loss ℓt = (ℓt,k)k K and judge the performance of our aggregation strategy by the wt-weighted losses wt, ℓt = P

k K wt,kℓt,k. Our objective is to minimize the cumulative gap between the losses incurred by our aggregation strategy t 7 wt and the best expert in hindsight. This cumulative gap is the regret Rt = Pt s=1 ws, ℓs mink K Pt s=1 ℓt,k. Other than range restrictions on the losses, no assumptions are made about the mechanism that generates them. More precisely, for each expert k K and all rounds t, we only assume that ℓk,t [ σk, σk] for known nonnegative scales {σk}k K. We call Rt the vector of regrets with respect to each expert, that is, the vector with entries Rt,k = Pt s=1 { ws, ℓs ℓs,k}.

Existing results. Several algorithms have been proposed that achieve the worst-case regret in the multiscale setting, but none of them achieve constant regret in stochastic lucky cases. Motivated by the problem of online model selection, Foster et al. [2017] used a technique of adaptive relaxations to produce randomized algorithms that guarantee

EP[Rt,k] = O σk p

t(ln t + ln(1/πk) + ln(σk/σmin)) as t ,

where π is a prior distribution on experts that generalizes the uniform 1/K of the Hedge algorithm and the expectation is over the algorithm s randomness. Bubeck et al. [2019] first proposed a Follow-the-Regularized-Leader algorithm with a multiscale entropy regularization that guarantees

Rt,k = O σk p

t(ln K + ln(σmax/σmin)) as t

when the number of rounds t is known in advance. Bubeck et al. [2019, Theorem 20] also construct an instance of the K = 2 experts problem in which there exists a time t for which any algorithm must have Rt,k σk p

t(ln K + ln(σmax/σmin)) for some expert k , shedding some light on the minimax picture. Recently, Chen et al. [2021] designed an optimistic algorithm that uses the same regulatization as Bubeck et al. [2019] with an additional ingredient: at each round, a second-order correction is added to the losses before computing the next round s weights. At every round, their algorithm makes use of a guess vector mt that can depend on the losses up to time t 1. The scale of the guesses mt are assumed to be the same as that of the losses; |mt,k| σk. For instance, valid choices for the guess mt are 0 and the loss ℓt 1 of the previous round. The algorithm of Chen et al. [2021] achieves Rt,k = O σk p

βt,k ln t + σmax ln t as t ,

now scaling with the expert-dependent time βt,k = Pt s=1 (ℓs,k ms,k)2

σ2 k 4t. Furthermore, they show that a different single-scale tuning of their algorithm exhibits stochastic luckiness. Namely, if the losses are sampled from a distribution with a gap dmin > 0 between the expected loss of the best expert k and that of any other expert, their algorithm guarantees that

where P is the distribution of the losses. Their technique for stochastic luckiness uses the upcoming learner s loss as the guess mt,k = wt, ℓt . Unfortunately, this approach cannot be extended to the multiscale case, as these guesses may violate the experts loss ranges.

1.1 Main results

In this section we present succinctly the regret guarantees for MUSCADA. Firstly, we present multiscale worst-case regret guarantees. Secondly, we present the stochastic luckiness results and Massart s margin condition. We then prove analogs of these results for an optimistic modification of MUSCADA in Section 4. We close this introduction with an outline of the rest of the article.

Worst-case bounds. We propose two tunings for MUSCADA; they cover the cases where there is or is not an expert with loss range equal to zero. Our results imply Theorem 1.1 below; it contains the regret guarantees for MUSCADA, expressed in terms of vt, an implicitly defined variance-like second-order data-dependent quantity. The quantity vt, defined by the algorithm, is the tightest allowed by our analysis and enables our luckiness result, Theorem 3.1. We interpret vt through the upper bounds of Theorem 1.2, also below, as an internal scale-free measure of time, as vt 4t. Theorem 1.1 (Regret Bounds). Consider MUSCADA, t 7 vt defined in Figure 1, and any initial probability distribution π.

If σmin = mink K σk > 0, Tuning 1 guarantees, for any loss sequence,

Rt,k c σk p

vt(ln(1/πk) + ln(σk/σmin)) + O(1) as t , (1)

where c is a constant depending only on π. The constant c is well-behaved: if maxk K πk = 1 ε, then c 4

2(1 + 1/(2 ln(1 + ε))).

Even if mink K σk = 0, Tuning 2 ensures, for any loss sequence,

2 vt(ln(1/πk) + ln(1 + vt))(1 + o(1)) as t . (2)

The following theorem (proven in Appendix G) shows that vt is bounded by a second-order quantity. If wt,k are the weights played by MUSCADA at round t and ηt 1,k are its learning rates, vt is bounded by the variance over experts of the losses w.r.t. a tilted probability distribution wt,k wt,kηt 1,k. The shape of this quantity may seem surprising, but it is not artificial; our analysis shows that it is the tightest and, consequently, the natural second-order quantity associated to this choice of regularization. In Appendix G, we further motivate, via a Taylor approximation, the shape of the resulting upper bound. Theorem 1.2. Let wt,k be the probability distribution wt,k wt,kηt 1,k and let vt = vt vt 1. Then, with either tuning from Figure 2, vt, from Figure 1, satisfies

vt 4var wt(ℓt) wt, σ2 4, where var wt(ℓt) = wt, (ℓt wt, ℓt )2 .

Stochastic luckiness. We now turn to our results for stochastic easy data. Not all stochastic scenarios are easy (in fact, worst-case regret lower bounds are proved using stochastic data). We use Massart s margin condition, a standard benchmark for easy data. Definition 1.3 (Massart s easiness condition). The losses ℓ1, ℓ2, . . . satisfy Massart s easiness condition if they are generated i.i.d. from a distribution P with the following property: there exists a constant c M and an expert k K such that

EP[(ℓt,k ℓt,k )2] c MEP[ℓt,k ℓt,k ]

for all k K and t 1. In that case, k = arg mink K EP[ℓt,k] for all t.

Massart s condition is implied by a more interpretable gap condition [Koolen et al., 2016, Lemma 3]. If there exist a gap dmin > 0 in expectation between the loss of any expert and that of the best one k , that is, if, for every k = k , EP[ℓ1,k] dmin + EP[ℓ1,k ], Massart s condition is satisfied with c M = 1/dmin. We show the following theorem.

Theorem 1.4 (Constant regret under Massart s condition). Under Massart s condition (Definition 1.3), MUSCADA with either Tuning 1 or 2 has constant expected pseudoregret over time, that is,

EP[Rt,k ] 1.

Outline. The rest of this article is organized as follows. In Section 2, we introduce and analyze MUSCADA. In Section 3, we state the main results on stochastic luckiness for MUSCADA. In Section 4, we introduce an optimistic variant of MUSCADA, give remarks about its numerical implementation in Section 5, and apply it to accelerating the solution of multiscale games in Section 6. We end this article with a discussion of our results in Section 7.

2 The MUSCADA Multiscale Online Learning Algorithm

In this section, we describe our algorithm and motivate its design. We present two useful tunings and prove the corresponding worst-case regret guarantees. For the sake of intuition, we specialize the algorithm to the case of same-scale experts with uniform prior and compare its resulting closed form to Ada Hedge [De Rooij et al., 2014]. Stochastic luckiness results are found in Section 3. We begin by introducing some notation.

Notation. We use boldface type for vectors in RK (Rt, Lt, µt, ηt, σ, u) and distributions on K experts (p, w, π). We number rounds so that all quantities indexed by t depend on the information witnessed by the learner in the first t rounds. Exceptionally, we use weights wt at round t. For two functions f and g we write f = O(g) as t if there exists c > 0 such that limt f(t)/g(t) c. Similarly, we write f(t) g(t) as t if limt f(t)/g(t) = 1, and f g if there is c > 0 so that f cg. We denote the simplex of probability distributions on K experts by P(K) and use K interchangeably for a number K N and the set {1, . . . , K}.

We define MUSCADA in Figure 1 and give its two main tunings in Figure 2. At round t, after observing cumulative corrected losses Lt 1 + µt 1, MUSCADA plays weights

wt,k = uke ηt 1,k(Lt 1,k+µt 1,k+a t 1),

where uk > 0 is a tuning parameter related to the prior weights, ηt 1 are learning rates that decrease over time, µt are corrections incrementally computed at every round, and the scalar a t 1 ensures normalization (see Lemma F.7). The weights wt are reminiscent of those played by the Hedge algorithm, but the normalization a t cannot be computed explicitly in general. The weights wt are the result of a Follow-the-Regularized-Leader update on a vector of corrected losses Lt 1 + µt 1. The regularizer employed is the multiscale entropy: for a fixed u > 0, its Bregman divergence is

w 7 Dη(w, u) = X

k K wk ln(wk/uk) (1 uk/wk)

ηk , w P(K) (3)

[see Bubeck et al., 2019, Chen et al., 2021]. The goal substracting the data-dependent second-order corrections µt from the experts regrets is to keep a scalar potential function Φt negative. Here, the potential t 7 Φt is defined by convex conjugacy with respect to the multiscale entropy as

Φt := Φ(Rt µt, ηt) = max w P(K) w, Rt µt Dηt(w, u), (4)

for which wt+1 is the maximizer. The corrections µt and the consequent negativity of the potential Φt are the main ingredients in the regret analysis of MUSCADA. We next motivate these choices.

The shape of the corrections µt. We designed MUSCADA to favor experts with low corrected regret Rt µt. For the sake of informal discussion, our goal is to obtain µt,k σk p

vt ln(1/πk). The algorithm achieves this by additively correcting the regrets in each round. Indeed, from the analysis

of entropy-regularized algorithms, one would expect learning rates of the shape ηt,k 1 σk

Parameters: A vector uk > 0 of initial weights, initial strictly positive learning rates η0,k 1/(2σk), and real, continuous nonincreasing functions Hk : R+ 7 R with Hk(0) = 1. Initialization: Let µ0,k = 0, v0 = 0, R0,k = 0 and L0,k = 0. For each round t = 1, 2, 3, . . .

1. Play (follow the multiscale-entropy regularized leader of the corrected losses)

wt = arg min w P(K) w, Lt 1 + µt 1 + Dηt 1(w, u), (5)

where Dη is the multiscale relative entropy given in (3). 2. Observe loss ℓt. Update Rt,k = Rt 1,k + wt, ℓt ℓt,k and Lt,k = Lt 1,k + ℓt,k. 3. Compute vt, the value v 0 such that

Φ(Rt µt 1 σ2ηt 1 v, ηt 1) = Φ(Rt 1 µt 1, ηt 1), (6)

where Φ is the potential function defined in (4). 4. Compute µt,k = σ2 kηt 1,k vt. Update µt,k = µt 1,k + µt,k and vt = vt 1 + vt. 5. Set the new learning rate ηt,k = η0,k Hk(vt).

Figure 1: MUSCADA

to be optimal. With this learning rates in mind, the desired correction µt can be approximated using a Riemann-sum approximation of vt = R vt 0 1 2 vdv. Indeed, for the conjectured learning rates, our target µt,k satisfies µt,k σ2 k P

s t ηs 1,k vs, where vt = vt vt 1. This implies that the choice µt,k = σ2 kηt 1,k vt as our per-round additive correction is helpful for achieving our goal. We discuss our precise choice of learning rates after the formal statement of Proposition 2.2 below.

Negativity of Φ. Our regret bounds are a direct consequence of the negativity of the potential t 7 Φt. Indeed, by its definition, Φ0 0, and, because of our choice of nonincresing learning rates and corrections, the change in potential Φt = Φt Φt 1 can be bounded by

Φt Φ(Rt µt, ηt 1) Φ(Rt 1 µt 1, ηt 1) = 0,

where the last equality follows from (6), the choice of corrections µt. This implies the following lemma, of which we give a more general proof in Section C.1. Lemma 2.1. The potential t 7 Φt starts at Φ0 0 and is decreasing for t 0.

Once we prove that the potential Φt is negative, we are ready to derive regret guarantees for MUSCADA. The maximal nature of the potential t 7 Φt and its nonpositivity together imply that, simultaneously for all distributions p P(K),

p, Rt µt Dηt(p, u). (7)

We choose p concentrated on each expert k K to deduce the next proposition (proof in Section C.1). Proposition 2.2. Assume that the learning rates t 7 ηt are decreasing. MUSCADA guarantees that, for any t = 1, 2, 3, . . . and all k K,

Rt,k µt,k + ln(1/uk)

uj ηt,j 1 ηt,k , (8)

where µt,k = σ2 k P s t ηs 1,k vs. Furthermore, for ηt,k = η0Hk(vt) as in Figure 1, µt satisfies

µt,k σ2 kη0,k

0 Hk(x)dx + σ2 k(η0,k ηt,k) max s t vs. (9)

Choice of learning rates. Proposition 2.2 guides us in choosing the learning rates presented in Figure 2. The starting value of the learning rates influences our ability to control vt in terms of the variance of the losses of the algorithm while their behavior for large vt determines the longterm growth of the regret bounds. The learning rates presented in Figure 2 interpolate smoothly

Let π P(K) be a probability distribution on K experts.

Tuning 1 Requires σmin > 0. Set uk = πk σmin

σk , η0,k = 1 2σmax , γk = 8 σ2 max σ2 k ln(1/uk) and

H1,k(v) = d

= v/γk+2 2(1+v/γk)3/2 .

Tuning 2 Set uk = πk, η0,k = 1 2σmax , αk = 32 σ2 max σ2 k , γk = αk ln(1/uk) and

H2,k(v) = d

α2 k {(1 + v/αk) ln (1 + v/αk) v/αk} + v2 2(1+v/(2γk))

= αk ln (1 + v/αk) + 1

2 2v+v2/(2γk) (1+v/(2γk))2

α2 k {(1 + v/αk) ln (1 + v/αk) v/αk} + v2 2(1+v/(2γk)) .

If, for some k, σk = 0, define H2,k to be the limit value limσ 0 H2,k(vt) = 1.

Figure 2: Tunings

between these two regimes by taking the form η(1) t,k = η0,k H1,k(vt) and η(2) t,k = η0,k H2,k(vt). Here, the starting learning rates are set to η0,k = 1/(2σmax). The functions H1,k, H2,k 1 decrease monotonically from their initial values H1,k(0) = H2,k(0) = 1 in such a way that, as vt ,

vt and η(2) t,k

ln(1/πk) + ln vt

The asymptotic expresion for η(1) t,k is reminiscent of the optimal learning rates for the Hedge algorithm with the number of rounds t replaced by the refined vt and the uniform ln K replaced by ln(1/πk). Finally, with the Riemann sum bound (9) from Proposition 2.2 in mind, the learning rates were chosen as the derivatives of functions that will become the dominant term in the regret guarantees.

Tuned regret bounds. The learning rates from Figure 2 can be readily used in Proposition 2.2 to derive regret guarantees for MUSCADA. However, to facilitate interpretation, we bound the learning rates and their reciprocals in order to obtain the regret bounds contained in the following proposition (proof in Appendix C.2). After its statement, we prove Theorem 1.1 from the introduction.

Proposition 2.3. Let π be a probability distribution on K.

MUSCADA run with Tuning 1 depicted in Figure 2 guarantees that, for any t = 1, 2, . . . ,

2vt ln(1/uk)+cσ,πσmin

2vt+8σmax ln(1/uk)+4σmax+ σk

2 max s t vs, (10)

where the constant cσ,π = P k K πk(1/ p

ln(1/uk)) and uk = πk σmin

MUSCADA run with Tuning 2 depicted in Figure 2 guarantees that, for any t = 1, 2, . . . ,

2vt ln 1 + σ2 kvt 32σ2max

+ ln(1/πk) +σk ln(1/πk)Zk+ X

j K πjσj Zj+ σk

2 max s t vt,

where Zk = s vt

1+ σ2 kvt 32σ2 max

min{ln(1/πk), σ2 kvt 16σ2 max }

1+ σ2 kvt 32σ2 max

Proof of Main Theorem 1.1. With Proposition 2.3 at hand, we can prove the claims made in Section 1.1. Use the fact that σmin σk to conclude from (10) that, as t ,

2vt ln(1/uk) + 2cσ,πσk

2vt + O(1).

We can bound cσ,π/ p

ln(1/uk) 1/ ln(1/πmax), where πmax = maxk K πk. Consequently,

Rt,k 2σk {1 + 1/(2 ln(1 + ε))} p

2vt ln(1/uk) + O(1)

as t any time that πmax = 1 ε. This coincides with (1). Similarly, (11) implies (2).

2.1 Closed-form solutions in the single-scale uniform-prior case

To help in the interpretation and to illustrate the challenges of the multiscale problem, we instantiate MUSCADA to a situation where all calculations can be carried out in closed form: when all scales are the same and equal to σ, and the initial weights πUnif are uniform; πUnif,k = 1/K. This is the setting in which Ada Hedge by De Rooij et al. [2014] operates. In this case, the learning rates and corrections of MUSCADA are the same for all experts; ηt,k = ηt and µt,k = µt. The potential Φt and the corrections µt take the familiar form

k K eηt(Rt,k µt,k) !

, and µt = 1 ηt 1 ln X

k K wt,keηt 1( wt,ℓt ℓt).

These two quantities play a central role in the analysis of Ada Hedge, where De Rooij et al. [2014] called µt the mixability gap, the difference between the average wt, ℓt and the mixed average 1 ηt 1 ln P

k K wt,ke ηt 1ℓt,k. The main quantity in our analysis, vt, becomes

vt = 1 η2 t 1σ2 ln X

k K wt,keηt 1( wt,ℓt ℓt,k).

Using well-known estimates for cumulant generating functions, vt can be bounded by the ratio varwt(ℓt)/σ2 . Indeed, Hoeffding s inequality implies the worst-case bound vt 1

2; Bernstein s, the second-order vt varwt(ℓt)/σ2. Since it is vt that appears in the regret bounds in Proposition 2.3, they are a refinement over those of Ada Hegde1. Additionally, the present analysis yields improvements that are apparent in lower-order terms. Indeed, the last two terms in the regret bound (8) in Proposition 2.2 vanish, and the analysis used in the proof of Proposition 2.3 with η0 =

2/σ and the instantiation of H1 from Figure 2, H1(x) = x/ ln(K)+2 2(1+x/ ln(K))3/2 , give the regret bound

Rt c1σvt + c2σ ln K + σ/2 if vt ln K, 2σ 2vt ln K + σ/2 if vt > ln K

with c1 = 3/

2 and c2 = 1/

2. Unfortunately, multiscale analogs of Bernstein and Hoeffding s inequalities on vt are not available; considerably more technical work needs to be carried out to prove Theorem 1.2. A multiscale analog of Bernstein s estimate for vt is only available when all the learning rates are smaller than 1/(2σmax) (see the proof of Theorem 1.2 in Appendix G).

3 Multiscale Stochastic Luckiness

In this section we show, under easiness conditions, that the expected pseudoregret of MUSCADA is constant. Assume that the loss vectors ℓ1, ℓ2, . . . are i.i.d. and are generated according to a distribution P that satisfies Massart s easiness condition (see Definition 1.3). For Tuning 1, assume that the minimum scale among experts σmin is strictly positive. The analysis technique in this case is similar to that of Koolen et al. [2016] with an extra step. A use of Theorem 1.2 shows that vt can be estimated in terms of varwt(ℓt). This estimate possibly incurs in a multiplicative factor that can be as high as 1/σ2 min. There are examples for which this constant is necessary (not shown). After this, standard arguments show that the expected pseudoregret is constant. See Appendix E for proofs.

1Our algorithm with learning rate tuning function H(v) = q

4v comes closest to Ada Hedge.

Theorem 3.1. Under Massart s condition and using Tuning 1 from Figure 2, the expected pseudoregret of MUSCADA is bounded by a constant in the number of rounds. Specifically, for any t 0, EP[Rt,k ] a2c M + b,

where a = r

2 maxi,j K n 1 σiσj ln(1/πi)+ln(σi/σmin) ln(1/πj)+ln(σj/σmin) o 4σk p

2 ln(1/uk ) + 2

2cσ,πσmin and

b = 8σmax ln(1/uk ) + 4σmax + 2σk .

For Tuning 2, where we do not assume that σmin > 0, still EP[Rt,k ] 1 using a different proof technique. Using the expression for the weights of the algorithm, we show that they concentrate on the best expert k . The analysis here is similar to that of Mourtada and Gaïffas [2019], but the lack of an expression for the normalizing a t presents with an additional technical difficulty. The result is the following theorem. Theorem 3.2. Let dk = EP[ℓt,k ℓt,k ] and assume that mink =k dk > 0. Using Tuning 2 in Figure 2, MUSCADA guarantees constant expected pseudoregret. Specifically,

EP[Rt,k ] X

k K f(dk), where f(d) = O σ2 max

d ln σ2 max d2 as d 0.

Standard modifications of the arguments presented may be used to prove that the pseudoregret is constant with P-high probability (not shown).

In this section we show an optimistic variant of MUSCADA. Suppose that, before round t, we count on guesses mt for what ℓt will be. Assume that mt is of the same scale as ℓt, that is, |mt,k| σk. In particular, this entails that |ℓt,k mt,k| 2σk. A modification of MUSCADA, presented in Figure 1, puts these guesses to good use. These modifications allow for regret guarantees similar to those contained in Proposition 2.3, but in this case v t var w t (ℓt mt)/ w t , σ2 , where the superscript signals the optimistic analogs of the quantities from MUSCADA. These modifications are shown in Figure 3 and the regret bounds in the following proposition (proofs in Appendix D). Proposition 4.1. If t 7 v t is the variance process defined by Optimistic MUSCADA in Figure 3, the same regret bounds presented Proposition 2.3 hold with two modifications: v t instead of vt and all scales doubled, that is, 2σ instead of σ. Furthermore, for each t = 1, 2, . . . , v t 4 var w t (ℓt mt)/ w t , σ2 4t, where w t,k w t,kηt 1,k.

1 Compute the guess mt and play

w t = arg min w P(K) w, Lt 1 + mt + µt 1 Dηt 1(w, u).

3 Let v t be the value v 0 such that

Φ(Rt µt 1 ηt 1σ2 v , ηt 1) = Φ(Rt 1+ w t , mt mt µt 1, ηt 1). (12)

Tuning 1 and Tuning 2 . As in Figure 2 but with halved starting learning rate η0,k = 1 4σmax .

Figure 3: Optimistic MUSCADA, given as update w.r.t. Figure 1.

5 Computation

At each round, MUSCADA requires two computations. We now argue that both can be executed to machine precision in O(K) time. First, computing the weights (5) given the losses Lt 1 and correction terms µt 1 can be reduced, by Lemma F.6, to a single scalar convex minimization problem. Cancelling the derivative of the objective amounts to searching for the normalizing offset at. To that

Figure 4: Left: empirical mean and quartiles of 2000 realizations of the regret t 7 Rt,k of MUSCADA. For easy i.i.d. Massart distribution, the regret is constant; for a hard distribution without a gap, Ω(

t). Right: optimistic MUSCADA (solid red) achieves an iterate-average saddle-point gap of σreal/t where σreal = σmax/100 is the relevant scale of the Nash equilibrium. Other methods scale as σmax/t.

end, binary search to machine precision takes O(K) time per round. Notice that this also allows us to compute the potential value. Second, for computing the variance contribution (6), we observe that the right hand side of (6) is decreasing in vt. Since the potential can be computed in O(K) time, we can use an outer binary search to compute vt to machine precision in O(K) time as well. Alternatively, Newton s method may be employed; both of the previous problems require finding a root of a convex function. When deferring to a convex optimization library, a convenient expression is the jointly convex minimization form (see Lemma F.6)

vt = inf a, v v subject to a + X

k K wt,k eηt 1,k( wt,ℓt ℓt,k a) η2 t 1,kσ2 k v 1 ηt 1,k 0.

6 Experiments on Synthetic Data

We investigate the performance of our multiscale method on two experiments: one for illustrating the performance of MUSCADA under Massart s condition, another for solving multiscale two-player zero-sum games.

The aim of the first experiment is to compare the performance of MUSCADA in easy and hard stochastic data sequences. To this end, we compared a sequence of hard stochastic data with no gap vs. easy data sampled i.i.d. from a distribution satisfying Massart s condition. We witnessed constant regret for the easy data, as shown in Figure 4 (Left). We take K = 50 experts and set σk = 1/k for each k K. To generate our data, we fix some mean λk [ σk, σk] and generate binary expert losses ℓt,k { σk, +σk} independently between rounds and experts, with probability P {ℓt,k = σk} = σk+λk

2σk . For the hard case, we set λk = 0 for all k. For the lucky case, we set λ2 = 1/5 instead. Generating this figure with the code in the supplementary material takes 3 seconds on an Intel i7-7700 processor.

The aim of the second experiment is to show the performance of MUSCADA for solving multiscale zero-sum games. Here, the payoff matrix is unknown, but row and column scales are available and vastly different. As detailed in Appendix A, we run two instances of appropriately tuned Optimistic MUSCADA against each other. As shown in Figure 4 (Right), the pair of time-average iterates converges to the saddle point with a suboptimality gap of order σreal/t instead of the worst-case σmax/t, where σreal is the maximum range within the support of the saddle point. In Appendix A, we conjecture that this rate holds for any such game and prove a weaker result: without optimism, the slower but scale-adaptive rate σreal/

t is achieved.

7 Discussion

We developed a new algorithm for multiscale online learning that is both worst-case safe and achieves constant pseudoregret in stochastic lucky cases. Our method is a refinement of the Follow-the Regularized-Leader template with a weighted entropy. The main innovation is in the correction terms added to the losses, which are the tightest the technique admits. This suggests that these variance-like terms are in fact intrinsic to the problem of obtaining scale-dependent regret bounds. Lastly, we relate this newfound variance to the variance asked for by Freund [2016], we comment on the advantage of second-order guarantees over zeroth-order ones, and we state an open problem.

Quantile bounds and solving Freund s problem. Freund [2016] asked whether quantile adaptivity and variance adaptivity are compatible, that is, whether one can have p, Rt q

s t varws(ℓs) for all comparator distributions p P(K) simultaneously. Even though our tuning of ηt does not yield quantile bounds, these can, however, be added employing a now-standard method [Koolen and Van Erven, 2015]. Namely, instead of only including every expert with a private learning rate tuned to its prior complexity level (the typical ln K or ln(1/πk) term), we include multiple copies of each expert, each with a learning rate tuned to a smaller complexity level. We then start from (7) with comparator distribution p concentrated on the ε-quantile of interest and carry out all future steps (from Proposition 2.2 on), ending up with the quantile regret bound p, Rt maxk:pk>0 σk p

vt (ln C + Dη0(p, u)), where C is the number of learning rates thus created. As these learning rates can be exponentially spaced in an interval of width ln K, C is of order ln ln K. Does this procedure answer Freund s question? For our notion of variance, vt, which our results suggest is a rather useful notion, the answer is yes. However, to relate vt to varwt(ℓt), we incur a multiplicative ratio ηt,max/ηt,min, which, for the quantile case, is of order

ln K, turning the prior-in-the-square-root bound into a prior-outside-the-square-root bound. The latter was already achievable by not tuning η to the prior complexities at all. This problem does not arise in the same-scale uniform-prior case; there, vt is bounded by a small multiple of varwt(ℓt) [De Rooij et al., 2014]. Note that this problem is present even when K is fixed while t grows, which is narrowly outside the scope of the impossibility results of Marinov and Zimmert [2021]. This discussion sheds light from another angle on why Freund s problem is hard; we present a desirable multiscale alternative.

Luckiness, gap, and Massart s condition. We now address the advantage of MUSCADA s refined second-order measure of time vt over the zeroth-order number of rounds t. Multiscale zeroth-order regret bounds (growing with t) can be guaranteed either by tuning MUSCADA crudely to a constant multiple of t or by building an any-time improvement of the algorithm of Bubeck et al. [2019], also tuned to t. Both t-tuned and vt-tuned algorithms have constant expected pseudoregret in stochastic lucky cases, but the constant can be widely different. Indeed, the constant for t-tuned algorithms scales with the inverse 1/dmin of the gap dmin = mink =k E[ℓt,k ℓt,k ], while the constant for vt-tuned algorithms scales with the constant c M from Massart s condition (see Definition 1.3). The difference stems from the fact that c M is at most 1/dmin, but it can be arbitrarily smaller. This separation appears to be fundamental. In the single-scale uniform-prior case, the above t-tuned algorithms are closely related to Decreasing Hedge [Mourtada and Gaïffas, 2019], just as MUSCADA is related to Ada Hedge (see Section 2.1). Mourtada and Gaïffas [2019] show that, in the single-scale case, even under Massart s condition with c M = 1, Decreasing Hedge and, consequently, Bubeck et al. s algorithm with decreasing learning rates, has expected pseudoregret E[RB t,k] 1/dmin. If the smallest scale σmin > 0, by taking dmin small, this lower bound can be made arbitrarily worse than the guarantee of MUSCADA, E[RMUSCADA t,k ] c M + 1, from Theorem 3.1.

Open problem. Our ability to incorporate an arbitrary prior suggests that the results should extend to countably many experts. However, the current techniques do break down. When maxk N σk < MUSCADA with Tuning 1 (if infk N σk > 0) or Tuning 2 would still deliver the worst-case bound. Yet our luckiness result currently requires maxk,l,t ηt,k ηt,lσ2 l < . Even with a common scale σ, this is never the case due to the dependence of ηt on the prior π, which is necessarily decreasing. Is luckiness actually possible, for example, in the online learning analog of the elegant challenge example presented by Talagrand [2014, Chapter 2]?

Sébastien Bubeck, Nikhil R. Devanur, Zhiyi Huang, and Rad Niazadeh. Multi-scale online learning: Theory and applications to online auctions and pricing. Journal of Machine Learning Research, 20 (62):1 37, 2019.

Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Impossible Tuning Made Possible: A New Expert Algorithm and Its Applications. ar Xiv:2102.01046 [cs], June 2021. ar Xiv: 2102.01046.

Ashok Cutkosky and Francesco Orabona. Black-Box Reductions for Parameter-free Online Learning in Banach Spaces. In Conference On Learning Theory, pages 1493 1529. PMLR, July 2018.

Dylan J. Foster, Satyen Kale, Mehryar Mohri, and Karthik Sridharan. Parameter-Free Online Learning via Model Selection. In Advances in Neural Information Processing Systems 30, pages 6020 6030. Curran Associates, Inc., 2017.

Yoav Freund. Open Problem: Second order regret bounds based on scaling time. In Conference on Learning Theory, pages 1651 1654, 2016.

Yoav Freund and Robert E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119 139, August 1997.

Elad Hazan. Introduction to Online Convex Optimization. ar Xiv:1909.05207 [cs, math, stat], September 2019. ar Xiv: 1909.05207.

Yu-Guan Hsieh, Kimon Antonakopoulos, and Panayotis Mertikopoulos. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 2388 2422. PMLR, 15 19 Aug 2021.

Wouter M. Koolen and Tim van Erven. Second-order Quantile Methods for Experts and Combinatorial Games. In Conference on Learning Theory, pages 1155 1175. PMLR, June 2015.

Wouter M Koolen, Peter Grünwald, and Tim van Erven. Combining Adversarial Guarantees and Stochastic Fast Rates in Online Learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

Nick Littlestone and Manfred K. Warmuth. The Weighted Majority Algorithm. Information and Computation, 108(2):212 261, February 1994.

Teodor Vanislavov Marinov and Julian Zimmert. The pareto frontier of model selection for general contextual bandits. In Advances in Neural Information Processing Systems, 2021.

Jaouad Mourtada and Stéphane Gaïffas. On the optimality of the Hedge algorithm in the stochastic regime. Journal of Machine Learning Research, 20(83):1 28, 2019.

Alexander Rakhlin and Karthik Sridharan. On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities. In Conference on Learning Theory, pages 1704 1722. PMLR, June 2017.

Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

Steven de Rooij, Tim van Erven, Peter D. Grünwald, and Wouter M. Koolen. Follow the Leader If You Can, Hedge If You Must. Journal of Machine Learning Research, 15(37):1281 1316, 2014.

Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast Convergence of Regularized Learning in Games. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

Michel Talagrand. Upper and Lower Bounds for Stochastic Processes, pages 1 12. 01 2014. ISBN 978-3-642-54074-5.