# parameterfree_online_learning_via_model_selection__eab7e923.pdf

Parameter-Free Online Learning via Model Selection

Dylan J. Foster Cornell University

Satyen Kale Google Research

Mehryar Mohri NYU and Google Research

Karthik Sridharan

Cornell University

We introduce an efﬁcient algorithmic framework for model selection in online learning, also known as parameter-free online learning. Departing from previous work, which has focused on highly structured function classes such as nested balls in Hilbert space, we propose a generic meta-algorithm framework that achieves online model selection oracle inequalities under minimal structural assumptions. We give the ﬁrst computationally efﬁcient parameter-free algorithms that work in arbitrary Banach spaces under mild smoothness assumptions; previous results applied only to Hilbert spaces. We further derive new oracle inequalities for matrix classes, non-nested convex sets, and Rd with generic regularizers. Finally, we generalize these results by providing oracle inequalities for arbitrary non-linear classes in the online supervised learning model. These results are all derived through a uniﬁed meta-algorithm scheme using a novel multi-scale algorithm for prediction with expert advice based on random playout, which may be of independent interest.

1 Introduction

A key problem in the design of learning algorithms is the choice of the hypothesis set F. This is known as the model selection problem. The choice of F is driven by inherent trade-offs. In the statistical learning setting, this can be analyzed in terms of the estimation and approximation errors. A richer or more complex F helps better approximate the Bayes predictor (smaller approximation error). On the other hand, a hypothesis set that is too complex may have too large a VC-dimension or have unfavorable Rademacher complexity, thereby resulting in looser guarantees on the difference between the loss of a hypothesis and that of the best-in class (large estimation error).

In the batch setting, this problem has been extensively studied with the main ideas originating in the seminal work of [40] and [39] and the principle of Structural Risk Minimization (SRM). This is typically formulated as follows: let (Fi)i N be an inﬁnite sequence of hypothesis sets (or models); the problem consists of using the training sample to select a hypothesis set Fi with a favorable estimation-approximation trade-off and choosing the best hypothesis f in Fi.

If we had access to a hypothetical oracle informing us of the best choice of i for a given instance, the problem would reduce to the standard one of learning with a ﬁxed hypothesis set. Remarkably though, techniques such as SRM or similar penalty-based model selection methods return a hypothesis f that enjoys ﬁnite-sample learning guarantees that are almost as favorable as those that would be obtained had an oracle informed us of the index i of the best-in-class classiﬁer s hypothesis set [39; 13; 36; 21; 4; 24]. Such guarantees are sometimes referred to as oracle inequalities. They can be derived even for data-dependent penalties [21; 4; 3].

Such results naturally raise the following questions in the online setting: can we develop an analogous theory of model selection in online learning? Can we design online algorithms for model selection with solutions beneﬁting from strong guarantees, analogous to the batch ones? Unlike the statistical setting, in online learning one cannot split samples to ﬁrst learn the optimal predictor within each subclass and then later learn the optimal subclass choice.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

A series of recent works on online learning provide some positive results along that direction. On the algorithmic side, [25; 27; 30; 31] present solutions that efﬁciently achieve model selection oracle inequalities for the important special case where F1,F2,... is a sequence of nested balls in a Hilbert space. On the theoretical side, a different line of work focusing on general hypothesis classes [14] uses martingale-based sequential complexity measures to show that, information-theoretically, one can obtain oracle inequalities in the online setting at a level of generality comparable to that of the batch statistical learning. However, this last result is not algorithmic.

The ﬁrst approach that a familiar reader might think of for tackling the online model selection problem is to run for each i an online learning algorithm that minimizes regret against Fi, and then aggregate over these algorithms using the multiplicative weights algorithm for prediction with expert advice. This would work if all the losses or experts considered were uniformly bounded by a reasonably small quantity. However, in many reasonable problems particularly those arising in the context of online convex optimization the losses of predictors or experts for each Fi may grow with i. Using simple aggregation would scale our regret with the magnitude of the largest Fi and not the i we want to compare against. This is the main technical challenge faced in this context, and one that we fully address in this paper.

Our results are based on a novel multi-scale algorithm for prediction with expert advice. This algorithm works in a situation where the different experts losses lie in different ranges, and guarantees that the regret to each individual expert is adapted to the range of its losses. The algorithm can also take advantage of a given prior over the experts reﬂecting their importance. This general, abstract setting of prediction with expert advice yields online model selection algorithms for a host of applications detailed below in a straightforward manner.

First, we give efﬁcient algorithms for model selection for nested linear classes that provide oracle inequalities in terms of the norm of the benchmark to which the algorithm s performance is compared. Our algorithm works for any norm, which considerably generalizes previous work [25; 27; 30; 31] and gives the ﬁrst polynomial time online model selection for a number of online linear optimization settings. This includes online oracle inequalities for high-dimensional learning tasks such as online PCA and online matrix prediction. We then generalize these results even further by providing oracle inequalities for arbitrary non-linear classes in the online supervised learning model. This yields algorithms for applications such as online penalized risk minimization and multiple kernel learning.

1.1 Preliminaries

Notation. For a given norm , let denote the dual norm. Likewise, for any function F, F will denote its Fenchel conjugate. For a Banach space (B, ), the dual is (B , ). We use x1 n as shorthand for a sequence of vectors (x1,...,xn). For such sequences, we will use xt[i] to denote the tth vector s ith coordinate. We let ei denote the ith standard basis vector. p denotes the p norm, σ denotes the spectral norm, and denotes the trace norm. For any p [1, ], let p be such that 1

Setup and goals. We work in two closely related settings: online convex optimization (Protocol 1) and online supervised learning (Protocol 2). In online convex optimization, the learner selects decisions from a convex subset W of some Banach space B. Regret to a comparator w W in this setting is deﬁned as n

t=1 ft(wt) n

Suppose W can be decomposed into sets W1,W2,.... For a ﬁxed set Wk, the optimal regret, if one tailors the algorithm to compete with Wk, is typically characterized by some measure of intrinsic complexity of the class (such as Littlestone s dimension [5] and sequential Rademacher complexity [33]), denoted Compn(Wk). We would like to develop algorithms that predict a sequence (wt) such that

ft(w) Compn(Wk) + Penn(k) k. (1)

This equation is called an oracle inequality and states that the performance of the sequence (wt) matches that of a comparator that minimizes the bias-variance tradeoff mink{minw Wk n

t=1 ft(w) + Compn(Wk)}, up to a penalty Penn(k) whose scale ideally matches that of Compn(Wk). We shall see shortly that ensuring that the scale of Penn(k) does

Protocol 1 Online Convex Optimization

for t = 1,...,n do

Learner selects strategy qt (W) for convex decision set W. Nature selects convex loss ft W R. Learner draws wt qt and incurs loss ft(wt). end for

indeed match is the core technical challenge in developing online oracle inequalities for commonly used classes.

In the supervised learning setting we measure regret against a benchmark class F =

k=1 Fk of functions f X R, where X is some abstract context space, also called feature space. In this case, the desired oracle inequality has the form:

(ˆyt,yt) inf

(f(xt),yt) Compn(Fk) + Penn(k) k. (2)

Protocol 2 Online Supervised Learning

for t = 1,...,n do

Nature provides xt X. Learner selects randomized strategy qt (R). Nature provides outcome yt Y. Learner draws ˆyt qt and incurs loss (ˆyt,yt). end for

2 Online Model Selection

2.1 The need for multi-scale aggregation

Let us brieﬂy motivate the main technical challenge overcome by the model selection approach we consider. The most widely studied oracle inequality in online learning has the following form

ft(w) O ( w 2 + 1)

n log(( w 2 + 1)n) w Rd. (3)

In light of (1), a model selection approach to obtaining this inequality would be to split the set W = Rd into 2 norm balls of doubling radius, i.e. Wk = w w 2 2k . A standard fact [15] is that such a set has Compn(Wk) = 2k n if one optimizes over it using Mirror Descent, and so obtaining the oracle inequality (1) is sufﬁcient to recover (3), so long as Penn(k) is not too large relative to Compn(Wk).

Online model selection is fundamentally a problem of prediction with expert advice [8], where the experts correspond to the different model classes one is choosing from. Our basic meta-algorithm, MULTISCALEFTPL (Algorithm 3), operates in the following setup. The algorithm has access to a ﬁnite number, N, of experts. In each round, the algorithm is required to choose one of the N experts. Then the losses of all experts are revealed, and the algorithm incurs the loss of the chosen expert.

The twist from the standard setup is that the losses of all the experts are not uniformly bounded in the same range. Indeed, for the setup described for the oracle inequality (3), class Wk will produce predictions with norm as large as 2k. Therefore, here, we assume that expert i incurs losses in the range [ ci,ci], for some known parameter ci 0. The goal is to design an online learning algorithm whose regret to expert i scales with ci, rather than maxi ci, which is what previous algorithms for learning from expert advice (such as the standard multiplicative weights strategy or Ada Hedge [12]) would achieve. Indeed, any regret bound scaling in maxi ci will be far too large to achieve (3), as the term Penn(k) will dominate. This new type of scale-sensitive regret bound, achieved by our algorithm MULTISCALEFTPL, is stated below.

Algorithm 3

procedure MULTISCALEFTPL(c, ) Scale vector c with ci 1, prior distribution . for time t = 1,...,n: do

Draw sign vectors σt+1,...,σn { 1}N each uniformly at random. Compute distribution

pt(σt+1 n) = arg min

sup gt gt[i] ci

σs[i]ci B(i)

where B(i) = 5ci

i n i . Play it pt. Observe loss vector gt. end for end procedure

Theorem 1. Suppose the loss sequence (gt)t n satisﬁes gt[i] ci for a sequence (ci)i [N] with each ci 1. Let N be a given prior distribution on the experts. Then, playing the strategy (pt)t n given by Algorithm 3, MULTISCALEFTPL yields the following regret bound:1

nlog(nci i) i [N]. (4)

The proof of the theorem is deferred to Appendix A in the supplementary material due to space constraints. Brieﬂy, the proof follows the technique of adaptive relaxations from [14]. It relies on showing that the following function of the ﬁrst t loss vectors g1 t is an admissible relaxation (see [14] for deﬁnitions):

Rel(g1 t) E σt+1,...,σn { 1}N sup

σs[i]ci B(i) .

This implies that if we play the strategy (pt)t n given by Algorithm 3, the regret to the ith expert is bounded by B(i) + Rel( ), where Rel( ) indicates the Rel function applied to an empty sequence of loss vectors. As a ﬁnal step, we bound Rel( ) as O(1) using a probabilistic maximal inequality (Lemma 2 in the supplementary material), yielding the bound (4). Compared to related FTPL algorithms [34], the analysis is surprisingly delicate, as additive ci factors can spoil the desired regret bound (4) if the cis differ by orders of magnitude.

The min-max optimization problem in MULTISCALEFTPL can be solved in polynomial-time using linear programming see Appendix A.1 in the supplementary material for a full discussion.

In related work, [7] simultaneously developed a multi-scale experts algorithm which could also be used in our framework. Their regret bound has sub-optimal dependence on the prior distribution over experts, but their algorithm is more efﬁcient and is able to obtain multiplicative regret guarantees.

2.2 Online convex optimization

One can readily apply MULTISCALEFTPL for online optimization problems whenever it is possible to bound the losses of the different experts a-priori. One such application is to online convex optimization, where each expert is a a particular OCO algorithm, and for which such a bound can be obtained via appropriate bounds on the relevant norms of the parameter vectors and the gradients of the loss functions. We detail this application which yields algorithms for parameter-free online learning and more below. All of the algorithms in this section are derived using a uniﬁed meta-algorithm strategy MULTISCALEOCO.

1This regret bound holds under expectation over the player s randomization. It is assumed that each gt is selected before the randomized strategy pt is revealed, but may adapt to the distribution over pt. In fact, a slightly stronger version of this bound holds, namely E n

t=1 eit, gt mini [N] n

t=1 ei, gt + O ci

n log(nci i) 0. A similar strengthening applies to all subsequent bounds.

The setup is as follows. We have access to N sub-algorithms, denoted ALGi for i [N]. In round t, each sub-algorithm ALGi produces a prediction wi

t Wi, where Wi is a set in a vector space V over R containing 0. Our meta-algorithm is then required to choose one of the predictions wi

t. Then, a loss function ft V R is revealed, whereupon ALGi incurs loss ft(wi

t), and the meta-algorithm suffers the loss of the chosen prediction. We make the following assumption on the sub-algorithms: Assumption 1. The sub-algorithms satisfy the following conditions:

For each i [N], there is an associated norm (i) such that supw Wi w (i) Ri.

For each i [N], the sequence of functions ft are Li-Lipschitz on Wi with respect to (i).

For each sub-algorithm ALGi, the iterates (wi

t)t n enjoy a regret bound n

t) infw Wi n

t=1 ft(w) Regn(i), where Regn(i) may be dataor algorithm-dependent.

Algorithm 4

procedure MULTISCALEOCO({ALGi, Ri, Li}i [N], ) Collection of sub-algorithms, prior . c (Ri Li)i [N] Sub-algorithm scale parameters. for t = 1, . . . , n do

t ALGi( f1, . . . , ft 1) for each i A. it MULTISCALEFTPL[c, ](g1, . . . , gt 1). Play wt = wit

t . Observe loss function ft and let ft(w) = ft(w) ft(0). gt ft(wi

t) i [N]. end for end procedure

In most applications, Wi will be a convex set and ft a convex function; this convexity is not necessary to prove a regret bound for the meta-algorithm. We simply need boundedness of the set Wi and Lipschitzness of the functions ft, as speciﬁed in Assumption 1. This assumption implies that for any i, we have ft(w) ft(0) Ri Li for any w Wi. Thus, we can design a meta-algorithm for this setup by using MULTISCALEFTPL with ci = Ri Li, which is precisely what is described in Algorithm 4. The following theorem provides a bound on the regret of MULTISCALEOCO; a direct consequence of Theorem 1. Theorem 2. Without loss of generality, assume that Ri Li 12. Suppose that the inputs to Algorithm 4 satisfy Assumption 1. Then the iterates (wt)t n returned by Algorithm 4 follow the regret bound

ft(w) E[Regn(i)] + O Ri Li

nlog(Ri Lin i) i [N]. (5)

Theorem 2 shows that if we use Algorithm 4 to aggregate the iterates produced by a collection of sub-algorithms (ALGi)i [N], the regret against any sub-algorithm i will only depend on that algorithm s scale, not the regret of the worst sub-algorithm.

Application 1: Parameter-free online learning in uniformly convex Banach spaces. As the ﬁrst application of our framework, we give a generalization of the parameter-free online learning bounds found in [25; 27; 30; 31; 10] from Hilbert spaces to arbitrary uniformly convex Banach spaces. Recall that a Banach space (B, ) is (2,λ)-uniformly convex if 1

2 2 is λ-strongly convex with respect to itself [32]. Our algorithm obtains a generalization of the oracle inequality (3) for any uniformly convex (B, ) by running multiple instances of Mirror Descent the workhorse of online convex optimization and aggregating their iterates using MULTISCALEOCO. This strategy is thus efﬁcient whenever Mirror Descent can be implemented efﬁciently. The collection of sub-algorithms used by MULTISCALEOCO, which was alluded to at the beginning of this section is as follows: For each 1 i N = n + 1, set Ri = ei 1, Li = L, Wi = {w B w Ri},

λ n, and ALGi = MIRRORDESCENT( i,Wi, 2). Finally, set = Uniform([n + 1]).

Mirror Descent is reviewed in detail in Appendix A.2 in the supplementary material, but the only feature of its performance of importance to our analysis is that, when conﬁgured as described above, the

2For notational convenience all Lipschitz bounds are assumed to be at least 1 without loss of generality for the remainder of the paper.

iterates (wi

t)t n produced by ALGi speciﬁed above will satisfy n

t) infw Wi n

t=1 ft(w) O(Ri L

λn) on any sequence of losses that are L-Lipschitz with respect to . Using just this simple fact, combined with the regret bound for MULTISCALEOCO and a few technical details in Appendix A.2, we can deduce the following parameter-free learning oracle inequality: Theorem 3 (Oracle inequality for uniformly convex Banach spaces). The iterates (wt)t n produced by MULTISCALEOCO on any L-Lipschitz (w.r.t. ) sequence of losses (ft)t n satisfy

ft(w) O L ( w + 1)

n log(( w + 1)Ln) λ w B. (6)

Note that the above oracle inequality applies for any uniformly convex norm . Previous results only obtain bounds of this form efﬁciently when is a Hilbert space norm. As is standard for such oracle inequality results, the bound is weaker than the optimal bound if w were selected in advance, but only by a mild

log(( w + 1)Ln) factor. Proposition 1. The algorithm can be implemented in time O(TMD poly(n)) per iteration, where TMD is the time complexity of a single Mirror Descent update.

In the example above, the (2,λ)-uniform convexity condition was mainly chosen for familiarity. The result can easily be generalized to related notions such as q-uniform convexity (see [37]). More generally, the approach can be used to derive oracle inequalities with respect to general strongly convex regularizer R deﬁned over the space W. Such a bound would have the form O L

n(R(w) + 1) log((R(w) + 1)n) for typical choices of R.

This example captures well-known quantile bounds [22] when one takes R to be the KL-divergence and W to be the simplex, or, in the matrix case, takes R to be the quantum relative entropy and W to be the set of density matrices, as in [18].

Application 2: Oracle inequality for many p norms. It is instructive to think of MULTISCALEOCO as executing a (scale-sensitive) online analogue of the structural risk minimization principle. We simply specify a set of subclasses and a prior specifying the importance of each subclass, and we are guaranteed that the algorithm s performance matches that of each sub-class, plus a penalty depending on the prior weight placed on that subclass. The advantage of this approach is that the nested structure used in the Theorem 3 is completely inessential. This leads to the exciting prospect of developing parameter-free algorithms over new and exotic set systems. One such example is given now: The MULTISCALEOCO framework allows us to obtain an oracle inequality with respect to many p norms in Rd simultaneously. To the best of our knowledge all previous works on parameter-free online learning have only provided oracle inequalities for a single norm. Theorem 4. Fix δ > 0. Suppose that the loss functions (ft)t n are Lp-Lipschitz w.r.t. p for each p [1 + δ,2]. Then there is a computationally efﬁcient algorithm that guarantees regret

ft(w) O ( w p + 1)Lp

nlog(( w p + 1)Lp log(d)n) (p 1) (7)

for all w Rd and all p [1 + δ,2].

The conﬁguration in the above theorem is described in full in Appendix A.2 in the supplementary material. This strategy can be trivially extended to handle p in the range (2, ). The inequality holds for p 1 + δ rather than for p 1 because the 1 norm is not uniformly convex, but this is easily rectiﬁed by changing the regularizer at p = 1; we omit this for simplicity of presentation.

We emphasize that the choice of p norms for the result above was somewhat arbitrary any ﬁnite collection of norms will also work. For example, the strategy can also be applied to matrix optimization over Rd d by replacing the p norm with the Schatten Sp norm. The Schatten Sp norm has strong convexity parameter on the order of p 1 (which matches the p norm up to absolute constants [2]) so the only change to practical change to the setup in Theorem 4 will be the running time TMD. Likewise, the approach applies to (p,q)-group norms as used in multi-task learning [20].

Application 3: Adapting to rank for online PCA For the online PCA task, the learner predicts from a class Wk = W Rd d W 0, W σ 1, W,I = k . For a ﬁxed value of k, such a class is

a convex relaxation of the set of all rank k projection matrices. After producing a prediction Wt, we experience afﬁne loss functions ft(Wt) = I Wt,Yt , where Yt Y = Y Rd d Y 0, Y σ 1 . We leverage an analysis of online PCA due to [29] together with MULTISCALEOCO to derive an algorithm that competes with many values of the rank simultaneously. This gives the following result: Theorem 5. There is an efﬁcient algorithm for Online PCA with regret bound

I Wt, Yt min W projection

O k n k [d 2].

For a ﬁxed value of k, the above bound is already optimal up to log factors, but it holds for all k simultaneously.

Application 4: Adapting to norm for Matrix Multiplicative Weights In the MATRIX MULTIPLICATIVE WEIGHTS setting [1] we consider hypothesis classes of the form Wr = W Rd d W 0, W r . Losses are given by ft(W) = W,Yt , where Yt σ 1. For a ﬁxed value of r, the well-known MATRIX MULTIPLICATIVE WEIGHTS strategy has regret against Wr bounded by O(r nlog d). Using this strategy for ﬁxed r as a sub-algorithm for MULTISCALEOCO, we achieve the following oracle inequality efﬁciently: Theorem 6. There is an efﬁcient matrix prediction strategy with regret bound

W,Yt ( W + 1)

nlog dlog(( W + 1)n)) W 0. (8)

A remark on efﬁciency All of our algorithms that provide bounds of the form (6) instantiate O(n) experts with MULTISCALEFTPL because, in general, the worst case w for achieving (6) can have norm as large as en. If one has an a priori bound say B on the range at which each ft attains its minimum, then the number of experts be reduced to O(log(B)).

2.3 Supervised learning

We now consider the online supervised learning setting (Protocol 2), with the goal being to compete with a sequence of hypothesis classes (Fk)k [N] simultaneously. Working in this setting makes clear a key feature of the meta-algorithm approach we have adopted: We can efﬁciently obtain online oracle inequalities for arbitrary nonlinear function classes so long as we have an efﬁcient algorithm for each Fk.

We obtain a supervised learning meta-algorithm by simply feeding the observed losses ( ,yt) (which may even be non-convex) to the meta-algorithm MULTISCALEFTPL in the same fashion as MULTISCALEOCO. The resulting strategy, which is described in detail in Appendix A.3 for completeness, is called MULTISCALELEARNING. We make the following assumptions analogous to Assumption 1, which lead to the performance guarantee for MULTISCALELEARNING given in Theorem 7 below. Assumption 2. The sub-algorithms used by MULTISCALELEARNING satisfy the following conditions:

For each i [N], the iterates (ˆyi

t)t n produced by sub-algorithm ALGi satisfy ˆyi

For each i [N], the function ( ,yt) is Li-Lipschitz on [ Ri,Ri].

For each sub-algorithm ALGi, the iterates (ˆyi

t)t n enjoy a regret bound n

t,yt) inff Fi n

t=1 (f(xt),yt) Regn(i), where Regn(i) may be dataor algorithmdependent.

Theorem 7. Suppose that the inputs to Algorithm 5 satisfy Assumption 2. Then the iterates (ˆyt)t n produced by the algorithm enjoy the regret bound

(f(xt), yt) E[Regn(i)] + O Ri Li

n log(Ri Lin i) i [N]. (9)

Online penalized risk minimization In the statistical learning setting, oracle inequalities for arbitrary sequences of hypothesis classes F1,...,FN are readily available. Such inequalities are typically stated in terms of complexity parameters for the classes (Fk) such as VC dimension or Rademacher complexity. For the online learning setting, it is well-known that sequential Rademacher complexity Radn(F) provides a sequential counterpart to these complexity measures [33], meaning that it generically characterizes the minimax optimal regret for Lipschitz losses. We will obtain an oracle inequality in terms of this parameter. Assumption 3. The sequence of hypothesis classes F1,...,FN are such that

1. There is an efﬁcient algorithm ALGk producing iterates (ˆyk

t )t n satisfying n

t ,yt) inff Fk n

t=1 (f(xt),yt) C L Radn(Fk) for any L-Lipschitz loss, where C is some constant. (an algorithm with this regret is always guaranteed to exist, but may not be efﬁcient).

2. Each Fk has output range [ Rk,Rk], where Rk 1 without loss of generality.

3. Radn(Fk) = (Rk

n) this is obtained by most non-trivial classes.

Theorem 8 (Online penalized risk minimization). Under Assumption 3 there is an efﬁcient (in N) algorithm that achieves the following regret bound for any L-Lipschitz loss:

(ˆyt, yt) inf

(f(xt), yt) O L Radn(Fk)

log(L Radn(Fk) k) k [N]. (10)

As in the previous section, one can derive tighter regret bounds and more efﬁcient (e.g. sublinear in N) algorithms if F1,F2,... are nested.

Application: Multiple kernel learning

Theorem 9. Let H1,...,HN be reproducing kernel Hilbert spaces for which each Hk has a kernel K such that supx X

K(x,x) Bk. Then there is an efﬁcient learning algorithm that guarantees

(f(xt), yt) O LBk( f Hk + 1)

log(LBkkn( f Hk + 1)) k, f Hk

for any L-Lipschitz loss, whenever an efﬁcient algorithm is available for the norm ball in each Hk.

3 Discussion and Further Directions

Related work There are two directions in parameter-free online learning that have been explored extensively. The ﬁrst considers bounds of the form (3); namely, the Hilbert space version of the more general setting explored in Section 2.2. Beginning with [26], which obtained a slightly looser rate than (3), research has focused on obtaining tighter dependence on w 2 and log(n) in this type of bound [25; 27; 30; 31]; all of these algorithms run in linear time per update step. Recent work [10; 11] has extended these results to the case where the Lipschitz constant is not known in advance. These works give lower bounds for general norms, but only give efﬁcient algorithms for Hilbert spaces. Extending Algorithm 4 to reach the Pareto frontier of regret in the unknown Lipschitz setting as described in [11] may be an interesting direction for future research.

The second direction concerns so-called quantile bounds [9; 22; 23; 31] for experts setting, where the learner s decision set W is the simplex d and losses are bounded in . The multi-scale machinery developed in the present work is not needed to obtain bounds for this setting because the losses are uniformly bounded across all model classes. Indeed, [14] recovered a basic form of quantile bound using the vanilla multiplicative weights strategy as a meta-algorithm. It is not known whether the more sophisticated data-dependent quantile bounds given in [22; 23] can be recovered in the same fashion.

Losses with curvature. The O( n)-type regret bounds provided by Algorithm 3 are appropriate when the sub-algorithms themselves incur O( n) regret bounds. However, assuming certain curvature properties (such as strong convexity, exp-concavity, stochastic mixability, etc. [16; 38]) of the loss functions it is possible to construct sub-algorithms that admit signiﬁcantly more favorable regret bounds (O(log n) or even O(1)). These are also referred to as fast rates in online learning. A natural direction for further study is to design a meta-algorithm that admits logarithmic or constant

regret to each sub-algorithm, assuming that the loss functions of interest satisfy similar curvature properties, with the regret to each individual sub-algorithm adapted to the curvature parameters for that sub-algorithm. Perhaps surprisingly, for the special case of the logistic loss, improper prediction and aggregation strategies similar to those proposed in this paper offer a way to circumvent known proper learning lower bounds [17]. This approach will be explored in detail in a forthcoming companion paper.

Computational efﬁciency. We suspect that a running-time of O(n) to obtain inequalities like (6) may be unavoidable through our approach, since we do not make use of the relationship between sub-algorithms beyond using the nested class structure. Whether the runtime of MULTISCALEFTPL can be brought down to match O(n) is an open question. This boils down to whether or not the min-max optimization problem in the algorithm description can simultaneously be solved in 1) Linear time in the number of experts 2) strongly polynomial time in the scales ci.

Acknowledgements

We thank Francesco Orabona and D avid P al for inspiring initial discussions. Part of this work was done while DF was an intern at Google Research and while DF and KS were visiting the Simons Institute for the Theory of Computing. DF is supported by the NDSEG fellowship.

[1] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a

meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012.

[2] Keith Ball, Eric A Carlen, and Elliott H Lieb. Sharp uniform convexity and smoothness

inequalities for trace norms. Inventiones mathematicae, 115(1):463 482, 1994.

[3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: risk bounds

and structural results. Journal of Machine Learning Research, 3:463 482, 2003. ISSN 15324435.

[4] Peter L. Bartlett, St ephane Boucheron, and G abor Lugosi. Model selection and error estimation.

Machine Learning, 48(1-3):85 113, 2002.

[5] Shai Ben-David, David Pal, and Shai Shalev-Shwartz. Agnostic online learning. In Proceedings

of the 22th Annual Conference on Learning Theory, 2009.

[6] St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

[7] Sebastien Bubeck, Nikhil Devanur, Zhiyi Huang, and Rad Niazadeh. Online auctions and multi-

scale online learning. Accepted to The 18th ACM conference on Economics and Computation (EC 17), 2017.

[8] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

[9] Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameter-free hedging algorithm. In

Advances in neural information processing systems, pages 297 305, 2009.

[10] Ashok Cutkosky and Kwabena A Boahen. Online convex optimization with unconstrained

domains and losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 748 756. 2016.

[11] Ashok Cutkosky and Kwabena A. Boahen. Online learning without prior information. The 30th

Annual Conference on Learning Theory, 2017.

[12] Steven De Rooij, Tim Van Erven, Peter D Gr unwald, and Wouter M Koolen. Follow the leader

if you can, hedge if you must. Journal of Machine Learning Research, 15(1):1281 1316, 2014.

[13] Luc Devroye, L azl o Gy orﬁ, and G abor Lugosi. A Probabilistic Theory of Pattern Recognition.

Springer, 1996.

[14] Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning. In

Advances in Neural Information Processing Systems, pages 3375 3383, 2015.

[15] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Opti-

mization, 2(3-4):157 325, 2016.

[16] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex

optimization. Machine Learning, 69(2-3):169 192, 2007.

[17] Elad Hazan, Tomer Koren, and Kﬁr Y Levy. Logistic regression: Tight bounds for stochastic

and online optimization. In Proceedings of The 27th Conference on Learning Theory, pages 197 209, 2014.

[18] Elad Hazan, Satyen Kale, and Shai Shalev-Shwartz. Near-optimal algorithms for online matrix

prediction. SIAM J. Comput., 46(2):744 773, 2017. doi: 10.1137/120895731.

[19] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction:

Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems 21, pages 793 800. MIT Press, 2009.

[20] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for

learning with matrices. Journal of Machine Learning Research, 13(Jun):1865 1890, 2012.

[21] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans.

Information Theory, 47(5):1902 1914, 2001.

[22] Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combina-

torial games. In Proceedings of The 28th Conference on Learning Theory, pages 1155 1175, 2015.

[23] Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In

Conference on Learning Theory, pages 1286 1304, 2015.

[24] Pascal Massart. Concentration inequalities and model selection. Lecture Notes in Mathematics,

1896, 2007.

[25] Brendan Mc Mahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear

optimization. In Advances in Neural Information Processing Systems, pages 2724 2732, 2013.

[26] Brendan Mcmahan and Matthew Streeter. No-regret algorithms for unconstrained online convex

optimization. In Advances in neural information processing systems, pages 2402 2410, 2012.

[27] H. Brendan Mc Mahan and Francesco Orabona. Unconstrained online linear learning in hilbert

spaces: Minimax algorithms and normal approximations. In Proceedings of The 27th Conference on Learning Theory, pages 1020 1039, 2014.

[28] Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequali-

ties with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

[29] Jiazhong Nie, Wojciech Kotłowski, and Manfred K Warmuth. Online pca with optimal regrets.

In International Conference on Algorithmic Learning Theory, pages 98 112. Springer, 2013.

[30] Francesco Orabona. Simultaneous model selection and optimization through parameter-free

stochastic learning. In Advances in Neural Information Processing Systems, pages 1116 1124, 2014.

[31] Francesco Orabona and D avid P al. From coin betting to parameter-free online learning. ar Xiv

preprint ar Xiv:1602.04128, 2016.

[32] Gilles Pisier. Martingales in banach spaces (in connection with type and cotype). course ihp,

feb. 2 8, 2011. 2011.

[33] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages,

combinatorial parameters, and learnability. Advances in Neural Information Processing Systems 23, pages 1984 1992, 2010.

[34] Alexander. Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to

algorithms. In Advances in Neural Information Processing Systems 25, pages 2150 2158, 2012.

[35] James Renegar. A polynomial-time algorithm, based on newton s method, for linear program-

ming. Mathematical Programming, 40(1):59 93, 1988.

[36] John Shawe-Taylor, Peter L Bartlett, Robert C Williamson, and Martin Anthony. Structural risk

minimization over data-dependent hierarchies. IEEE transactions on Information Theory, 44 (5):1926 1940, 1998.

[37] Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent.

In Advances in neural information processing systems, pages 2645 2653, 2011.

[38] Tim van Erven, Peter D. Gr unwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson.

Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793 1861, 2015.

[39] Vladimir Vapnik. Estimation of dependences based on empirical data, volume 40. Springer-

Verlag New York, 1982.

[40] Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies

of events to their probabilities. Theory of Probability and its Applications, 16(2):264 280, 1971.