# boosting_for_online_convex_optimization__67b6ee2f.pdf

Boosting for Online Convex Optimization

Elad Hazan 1 2 Karan Singh 3

We consider the decision-making framework of online convex optimization with a very large number of experts. This setting is ubiquitous in contextual and reinforcement learning problems, where the size of the policy class renders enumeration and search within the policy class infeasible. Instead, we consider generalizing the methodology of online boosting. We deﬁne a weak learning algorithm as a mechanism that guarantees multiplicatively approximate regret against a base class of experts. In this access model, we give an efﬁcient boosting algorithm that guarantees nearoptimal regret against the convex hull of the base class. We consider both full and partial (a.k.a. bandit) information feedback models. We also give an analogous efﬁcient boosting algorithm for the i.i.d. statistical setting. Our results simultaneously generalize online boosting and gradient boosting guarantees to contextual learning model, online convex optimization and bandit linear optimization settings.

1. Introduction

In the classical problem of prediction from expert advice (Cesa-Bianchi et al., 1997), a learner iteratively makes decisions and receives loss according to an arbitrarily chosen loss function. Learning in terms of guaranteeing an absolute bound on the loss incurred would be a hopeless task, but the learner is assisted by a pool of experts (or hypotheses). The goal of the learner is thus to minimize regret, or the difference of total loss to that of the best expert in hindsight.

For this canonical problem, numerous methods have been devised and reﬁned, some notably based on the multiplicative weights algorithm (Littlestone & Warmuth, 1994; Arora

*Equal contribution 1Google AI Princeton, Princeton, NJ, USA 2Princeton University, Princeton, NJ, USA 3Microsoft Research, Redmond, WA, USA. Correspondence to: Elad Hazan <ehazan@cs.princeton.edu>, Karan Singh <karansingh@microsoft.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

et al., 2012). It is well established that the regret can be bounded by O( p

T log |H|), where |H| is the number of experts, and that this is tight. However, in many problems of interest, the class of experts is too large to efﬁciently manipulate. This is particularly evident in contextual learning, where the experts are policies functions mapping contexts to action. In such instances, even if a regret bound of O( p

T log |H|) is meaningful, the algorithms achieving this bound are computationally inefﬁcient; their running time is linear in |H|.

One approach to deal with this computational difﬁculty is to grant the leaner a best-in-hindsight (or ERM) oracle. This has been explored in both the stochastic bandit (Langford & Zhang, 2008; Agarwal et al., 2014; Dudik et al., 2011) and (transductive) online settings (Rakhlin & Sridharan, 2016; Syrgkanis et al., 2016b;a; Hazan & Koren, 2016).

In this paper, we consider a different approach towards addressing computational intractability, motivated by the observation that it is often possible to design simple rulesof-thumb (Viola & Jones, 2001) that perform slightly better than random guesses. We propose that the learner has access to a weak learner - an computationally cheap mechanism capable of guaranteeing multiplicatively approximate regret against a base hypotheses class. Such considerations arise in the theory of boosting (Schapire & Freund, 2012b), which was originally devised in the context of binary classiﬁcation in the statistical/i.i.d. setting. A weak learner, deﬁned as one which is capable of better-than-random classiﬁcation, is used by a boosting algorithm to create an accurate classiﬁer. By analogy, the regret-based notion of weak learner we deﬁne here is a natural extension of the concept of weak learning to online convex optimization.

We design efﬁcient algorithms that when provided weak learners compete with the convex hull of the base hypotheses class with near-optimal regret. We further consider different information feedback models: both full gradient feedback, as well as linear bandit (function value) feedback, and derive efﬁcient sublinear regret algorithms for these settings.

1.1. Setting and contributions

The setting of online convex optimization (OCO) generalizes the problem of prediction from expert advice to a general convex decision set K Rd, and adversarially

Boosting for Online Convex Optimization

chosen convex loss functions ft : Rd 7 R. We consider a hypothesis class H C 7 K, where each hypothesis h H maps a context (or feature vector) in the context set ct C to an action h(ct) K.

Deﬁnition 1. An online learning algorithm W is a γ-weak OCO learner (WOCL) for a class H and edge γ (0, 1), if for any sequence of contexts chosen from C and unit1

linear loss functions f1, ..., f T , we have

t=1 ft(W(ct)) γ min h H

t=1 ft(h(ct))

t=1 f µ t + Regret T (W),

where f µ = R

x K f(x)dµ is the average value of f under the uniform distribution µ.

In classical boosting, a weak learner is an algorithm that offers an edge over a random guess. Since f µ t is precisely the loss associated with a uniform random guess, the above deﬁnition is a generalization of the classic notion. In addition, note that such a deﬁnition is invariant under constant offsets to the function value.

It is convenient to henceforth assume that the convex decision set is centered, that is R

x K xdµ = 0. If so, note that for any linear function f : K R, f µ t = f( R

x xdµ) = 0. Under this assumption, we can rephrase γ-WOCL as

t=1 ft(W(ct)) γ min h H

t=1 ft(h(ct)) + Regret T (W).

Let CH(H) be the convex hull of the hypothesis class. Our goal is to design an algorithm A which minimizes regret vs. the best hypothesis in the convex hull of the base class:

t=1 ft(A(ct)) min h CH(H)

t=1 ft(h(ct)).

In this model, our main result is an efﬁcient algorithm, whose guarantee is given by:

Theorem 2 (Main). Algorithm 1 with parameters γ, N, maintains N copies of γ-WOCL, and generates a sequence actions x1, ..., x T , such that

Regret = O T γ

γ Regret T (W) .

1f(x) is a unit loss function if maxx K f(x) minx K f(x) 1. This scaling is w.l.o.g. and all results hold with a different constant otherwise.

The formal theorem which we state and prove below has explicit constants that depend on the diameter of K (as opposed to |H|) and the Lipschitz constants of the loss functions. These are hidden for brevity. Notably, Algorithm 1 achieves the following goals:

1. Its running time does not depend on |H|, but rather only on the other natural parameters of the problem, notably the parameter N.

2. The regret it guarantees is with respect to a stronger benchmark than the best hypothesis in the base class. It competes with the best convex combination of hypotheses from the base class, rather than the best one.

3. Finally, the regret guarantee is not multiplicatively approximate anymore; it is directly with respect to the best convex combination without the γ factor.

1.2. Related Work

Boosting and ensemble methods are fundamental in all of machine learning and data science, see e.g. (Olshen, 2001) for a retrospective viewpoint and (Schapire & Freund, 2012a) for a comprehensive text.

Originally boosting was studied for statistical learning and the realizable setting. For a framework that encompasses the different boosting methods see (Brukhim et al., 2020).

More recently boosting was extended to real-valued prediction via the theory of gradient boosting (Friedman, 2002), and to the online setting (Leistner et al., 2009; Chen et al., 2012; 2014; Beygelzimer et al., 2015b;a; Agarwal et al., 2019; Jung et al., 2017; Jung & Tewari, 2018; Brukhim & Hazan, 2020). In each of these works, a certain component prevents the result from applying to the full OCO setting. Either the weak learner has a parameter γ = 1, the convex decision sets are restricted, or the loss functions are restricted.

Our work is the ﬁrst to generalize online boosting to realvalued prediction in the full setting of online convex optimization, that allows for arbitrary convex decision sets and arbitrary approximation factor γ. This requires the use of a new extension operator, deﬁned henceforth.

For extensive background on online learning and online convex optimization see (Hazan, 2019; Shalev-Shwartz et al., 2012).

The contextual experts and bandits problems have been proposed in (Langford & Zhang, 2008) as a decision making framework with large number of policies. In the online setting, several works study the problem with emphasis on efﬁcient algorithms given access to an ERM oracle (Rakhlin & Sridharan, 2016; Syrgkanis et al., 2016b;a; Rakhlin & Sridharan, 2016).

Boosting for Online Convex Optimization

Another related line of work in online learning studies the possibility of α-regret minimization given access to approximation oracles with an approximate weak learner (Kakade et al., 2009; Hazan et al., 2018). This weaker notion of α-regret only generates approximately optimal decisions. In contrast, our result gives a stronger guarantee of standard regret, and the regret is compared to a stronger benchmark of convex combination of experts.

2. Improper learning and Extension operator

2.1. Preliminaries

For a convex set K and scalar c > 0 we denote by c K the set of all scaled points, i.e.

c K = {cx s.t. x K} .

The algorithmic primitives introduced next use a smoothing operator. Let f be G-Lipschitz continuous. The smoothing operator may be deﬁned via inf-convolution or Moreau Yoshida regularization:

Mδ[f] = inf y Rd

The following lemma elucidates well known properties of such an operator (see (Beck, 2017)).

Lemma 3. The smoothed function ˆfδ = Mδ[f] satisﬁes:

1. ˆfδ is 1

δ -smooth, and G-Lispchitz.

2. | ˆfδ(x) f(x)| δG2

2 for all x Rd.

2.2. The extension operator

The main difﬁculty we face is that the weak learner only has a γ-approximate regret guarantee. To cope with this, the algorithm we describe henceforth scales the predictions returned by the weak learner by a factor of 1

γ . This algorithm, a rescaled variant of the Frank-Wolfe method, guarantees competitiveness with the convex hull of the base class, if we could somehow play actions in 1

γ K, instead of K. Since these actions do not belong to the decision set, they are infeasible.

This presents a signiﬁcant challenge. Ideally, we would like that for every action in γ 1K, one could ﬁnd an action in K of comparable (or lesser) loss (for all loss functions). It can be seen that some natural families of functions, i.e. linear functions, do not admit such an operation. For linear loss functions, for example, extrema always occur on the boundary of the set.

To remedy this, we modify the loss function being fed to the Frank-Wolfe algorithm in the ﬁrst place. We call this

modiﬁcation an extension of the loss function outside K. Deﬁne Dist(x, K) the Euclidean distance to K, and the Euclidean projection ΠK as

Dist(x, K) = min y K y x ,

ΠK(x) = arg min y K y x

Deﬁnition 4 ((K, δ, κ)-extension). The extension operator over K Rd is deﬁned as:

XK,δ,κ[f](x) = Mδ[f(x) + κ Dist(x, K)].

We henceforth denote by G an upper bound on the norm of the gradients of the loss functions over K, and set κ = G = maxt maxx Rd ft(x) . The important take-away from these operators is the following lemma, whose importance is crucial in the OCO boosting algorithm 1. The extended loss functions are smooth, agree with the original loss on K, and possess the property that projection of any action in Rd

onto K does not increment the extended loss value. In short, it permits projections of infeasible points that are obtained from the weak learners to the feasible domain without an increase in cost.

Lemma 5. The (K, δ, κ)-extension of a G-lipschitz function ˆf = XK,δ,κ[f] satisﬁes the following:

1. For every point x K, we have | ˆf(x) f(x)| δG2

2. The projection of a point x onto K does not increase the (K, κ, δ)-extended function value by more than G2δ.

ˆf (ΠK(x)) ˆf(x) + G2δ.

Proof. Since Dist(x, K) = 0 for all x K, the ﬁrst claim follows immediately from the properties of the smoothing operator as stated in Lemma 3. Denote xπ = ΠK(x).

ˆf(xπ) ˆf(x)

f(xπ) f(x) κ Dist(x, K) + G2δ

= f(xπ) f(x) κ x xπ + G2δ

G x xπ κ x xπ + G2δ = G2δ

Here the ﬁrst inequality follows from Lemma 3.2, and the second follows from the lipschitz assumptions on f. The last equality results from the choice κ = G.

3. Algorithm and Main Theorems

Algorithm 1 efﬁciently converts a weak online learning algorithm into an OCO algorithm with vanishing regret in a black-box manner. The idea is to apply the weak learning algorithm on linear functions that are gradients of the loss.

Boosting for Online Convex Optimization

Algorithm 1 Bo OCO = Boosting Online Convex Opt.

1: Input: N copies of the γ-WOCL W1, W2, . . . , WN, parameters (η1, . . . ηN), δ, κ = G. 2: for t = 1 to T do 3: Receive context ct. 4: Choose x0 t K arbitrarily. 5: for i = 1 to N do 6: Deﬁne xi t = (1 ηi)xi 1 t + ηi 1

γ Wi(ct). 7: end for 8: Predict xt = ΠK(x N t ) and suffer loss ft(xt). 9: Obtain loss function ft, create ˆft = XK,δ,κ[ft]. 10: for i = 1 to N do 11: Pass to Wi the linear loss function f i t.

f i t(x) = ˆft(xi 1 t ) x.

12: end for 13: end for

The algorithm then recursively applies another weak learner on the gradients of the residual loss, and so forth.

The main performance guarantee we prove is summarized in the following theorem.

Theorem 6 (Main). The actions xt generated by Algo-

rithm 1 with parameter δ = q

D2 γN , ηi = min{ 2

i , 1} satisfy

t=1 ft(xt) min h CH(H)

t=1 ft(h(ct))

γ Regret T (W).

4. Analysis

Deﬁne ˆft = X[ft] = Mδ[ft + G Dist(x, K)]. We apply the setting of κ = G, as required by Lemma 5, and by Lemma 3, ˆft is 1

Let h = arg minh CH(H) PT t=1 ft(h(ct)) be the best hypothesis in the convex hull in hindsight. We deﬁne x t = h (ct) as the decisions of this hypothesis.

The main crux of the proof is given by the following lemma.

Lemma 7. Suppose ˆft s are β-smooth, and ˆG lipschitz. Then,

t=1 ˆft(x N t )

t=1 ˆft(x t ) 2βD2T

γ Regret T (W).

Proof. Deﬁne for all i = 0, 1, 2, . . . , N,

ˆft(xi t) ˆft(x t ) .

Recall that ˆft is β smooth by our assumption. Therefore:

ˆft(xi 1 t + ηi( 1

γ Wi(ct) xi 1 t )) ˆft(x t )

t=1 [ ˆft(xi 1 t ) ˆft(x t ) + ηi ˆft(xi 1 t ) ( 1

xi 1 t ) + η2 i β

γ Wi(ct) xi 1 t 2]

By using the deﬁnition and linearity of f i t, we have

t=1 [ ˆft(xi 1 t ) ˆft(x t )

+ ηi(f i t( 1

γ Wi(ct)) f i t(xi 1 t )) + η2 i βD2

γ f i t(Wi(ct)) f i t(xi 1 t ))

Now, note the following equivalent restatement of the WOCL guarantee, which again utilizes linearity of f i t to conclude: linear loss on a convex combination of a set is equal to the same convex combination of the linear loss applied to individual elements.

t=1 f i t(Wi(ct))

t=1 f i t(h(ct)) + ˆGDRegret T (W)

= min h CH(H)

t=1 f i t(h(ct)) + ˆGDRegret T (W)

Using the above and that h CH(H), we have

t=1 [ηi ˆft(xi 1 t ) (x t xi 1 t )

2γ2 ] + ηi ˆGD

γ Regret T (W)

i 1(1 ηi) + η2 i βD2T

2γ2 + ηi RT

where the last inequality uses the convexity of ˆft and RT = ˆ GD

γ Regret T (W). We thus have the recurrence

i i 1(1 ηi) + η2 i βD2T

2γ2 + ηi RT .

Boosting for Online Convex Optimization

Denoting ˆ i = i RT ,

ˆ i ˆ i 1(1 ηi) + η2 i βD2T

Lemma 8. Let {ht} be a sequence that satisﬁes the recurence ht+1 ht(1 ηt) + η2 t c.

Then taking ηt = min{1, 2

t } implies

This is a recursive formula that can be solved by applying Lemma 8; we obtain that ˆ N 2βD2T

We are ready to prove the main guarantee of Algorithm 1.

Proof of Theorem 6. Using both parts of Lemma 5 in succession, we have

t=1 ft(x t )

t=1 ˆft(xt)

t=1 ˆft(x t ) + δG2T

t=1 ˆft(x N t )

t=1 ˆft(x t ) + 2δG2T

Next, recall by Lemma 3, that ˆft is 1

δ -smooth. By applying Lemma 7, and optimizing δ, we have

t=1 ft(x t )

2δG2T + 2D2T

γ Regret T (W)

γ Regret T (W)

Lemma 9. Dist(x, K) is 1-Lipschitz,

Using the above lemma, we claim that ˆG 2G. Therefore, using Lemma 3, f i t(xi t) = ˆft(xi t)) 2G.

Proof of Lemma 9. Observe the following.

Dist(x, K) Dist(y, K)

= x ΠK(x) y ΠK(y)

x ΠK(y) y ΠK(y) ΠK(y) K

x y triangle inequality

Proof of Lemma 8. This is proved by induction on t.

Induction base. For t = 1, we have

h2 h1(1 η1) + η2 1c = c 4c

Induction step.

ht+1 (1 ηt)ht + η2 t c

t2 induction hypothesis

5. Boosting for Bandit Linear Optimization

Recall that the setting of Bandit Linear Optimization (BLO) is exactly the same as OCO, but (1) the cost functions are linear, and (2) the feedback for the decision maker is the cost it incurs (no gradient information). We use the notation that our linear cost function at time t is:

ft(x) = f t x , ft Rd.

To boost in the linear bandit setting, we apply techniques for unbiased gradient estimation. Namely, we use randomness to create a random linear function whose expectation is the actual linear function. We can then hope to use the previous algorithm on the random linear functions and obtain a probabilistic regret bound.

For the rest of this section, we assume the following, Assumption 1. The convex decision set K contains the unit simplex K d.

This assumption is very unrestrictive: by scaling and rotation, any set K which is non-degenerate contains the simplex. Scaling changes regret by a constant factor, and thus we scale so as to contain the unit simplex for notational convenience. An axis-aligning rotation does not affect the regret bound. We use standard basis aligned simplex for notational simplicity.

We can now use standard randomized exploration to create an unbiased estimator of the loss function, as per the following scheme. Let bt be a Bernoulli random variable with parameter η. Let it be chosen uniformly at random from it {1, 2, ..., d}.

0 i = it, bt = 0

d η ft(it) i = it, bt = 1

Boosting for Online Convex Optimization

Clearly the expectation of the random vector ft equals E[ ft] = ft. We can now use this random variable as feedback for Algorithm 1, as given in Algorithm 2.

Algorithm 2 Bo BLO = Boosting Bandit Linear Opt.

1: Input: parameters η, Algorithm 1 instance A. 2: for t = 1 to T do 3: Observe context ct. 4: Draw random Bernoulli bt w. parameter η. 5: if bt = 1 then 6: Pick a coordinate basis it Rd vector randomly. 7: Play xt = it. 8: Pass ft as per Equation (2) to A. 9: else 10: Play xt = A(ct), and pass the 0 loss vector to A. 11: end if 12: end for

Notice that Algorithm 2 calls upon weak learners for OCO with full information. For this algorithm, our main performance guarantee is as follows. The resulting bound is sub-linear whenever the full-information regret bound is sublinear. Therefore, boosting in bandit linear settings is feasible whenever the full-information version is feasible, albeit at a slower rate. Theorem 10 (Boosting for Bandit Linear Optimization). The predictions xt generated by Algorithm 2 satisfy

inf h CH(H)

t=1 ft(h(ct))

γ Regret T (W)

Proof. Let yt = A(ct). Observe that

inf h CH(H)

t=1 ft(h(ct))

t=0 ( eft(yt) eft(x t ))

γ2 Regret T (W) + ηT Theorem 6

ηγ Regret T (W) + ηT

Above, we use G for an upper bound on the gradients of ft. Lastly, by construction the gradients of the loss function ft are bounded by d G

η . Balancing η concludes the claim

5.1. Application to contextual bandits

The contextual bandits problem is one of the most well studied special cases of contextual BLO. In multi-armed

contextual bandits, the decision set is a set of K discrete arms, the losses are [0, 1]-bounded, and the feedback available to learner is solely the loss of the arm it picks. This is exactly BLO for the decision set K = K. We state a corollary of our result for this setting.

Corollary 11 (Multi-armed Contextual Bandits). The predictions xt generated by Algorithm 2 with the decision set K = K and the loss set [0, 1]K satisfy

t=1 ft(h(ct))

γ Regret T (W)

6. Boosting for Stochastic Contextual Optimization

In this section we give an alternative viewpoint of our boosting results from the lens of stochastic contextual optimization. The mathematical development is analogous to the previous results, but the statement of results may be of independent interest in statistical learning theory and contextual learning.

Let K Rd be the (convex) decision set, and C be the set of possible contexts. In the Stochastic Contextual Optimization problem the aggregate loss attributable to a hypothesis is the expected cost of its actions with respect to a joint distribution D over the context set C and the set of convex cost functions over the decision set K.

FD(h) = E(f,c) D[f(h(c))].

Instead of explicit optimization over the hypothesis class, we assume that we have access to experts that are approximately competitive with the base hypothesis class. More formally, we deﬁne a weak optimizer as follows.

Deﬁnition 12. A γ-weak contextual optimizer is a learning algorithm that, when given samples from a distribution D over unit linear loss functions and contexts, outputs a mapping W : C K such that

FD(W) γ min h H FD(h) + ε.

Typically ε scales as 1 m when the learner is given m samples from the distribution D.

The results in this section do not follow blackbox from the previous ones. While online boosting algorithms operate in more permissive settings than statistical ones, as they do not require the context and loss vectors to be sampled i.i.d., the corresponding assumption on the weak learners for the online setting are stronger too.

Boosting for Online Convex Optimization

6.1. Algorithm and main theorem

Algorithm 3 below makes use of N weak optimiziers, and generates a hypothesis h : C K such that the following theorem holds. Notice that the resulting guarantee delivers a solution which is stronger in two aspects than the weak optimization oracle:

1. The resulting solution competes with the convex hull of H, rather than the best ﬁxed h H.

2. The resulting solution removes the γ approximation factor of the weak optimizers.

Theorem 13 (Main-SCO). The hypothesis h generated by

Algorithm 3 with parameter δ = q

D2 γN satisfy

FD(h) inf h CH(H) FD(h ) 4GD

Algorithm 3 B4CO = Boosting for Contextual Opt.

1: Input: γ-weak contextual optimizer O, parameters {η1, . . . ηN}, δ, κ = G, distribution D. 2: Choose h0 arbitrarily from H. 3: for i = 1 to N do 4: Deﬁne a distribution Di speciﬁed via the following sampling oracle: Sample (f, c) D, and deﬁne

ˆf = XK,δ,κ[f]

Deﬁne a linear loss function

f i(x) = ˆf(hi 1(c)) x

Output (f i, c). 5: Call a weak optimizer on the distribution Di to get a hypothesis Wi. 6: Deﬁne hi = (1 ηi)hi 1 + ηi

7: end for 8: Return predictor h, deﬁned as h(c) = ΠK(h N(c)).

As before, for a convex loss function f, deﬁne ˆf = XK,δ,G[f], and note that ˆft is 1

δ-smooth. Notations we use below:

1. F(h) = E(f,c) D[f(h(c))]

2. ˆF(h) = E(f,c) D[ ˆf(h(c))]

Lemma 14. Suppose, for any f F, ˆf is β-smooth, and maxf F,x K ˆf(x) ˆG. Then,

ˆF(h) ˆF(h ) 2βD2

Proof. Let h = arg minh CH(H) F(h). Denote for all i = 0, 1, 2, . . . , N, i = ˆF(hi) ˆF(h ).

Recall that ˆft is β smooth by our assumption. Thus,

i = ˆF(hi) ˆF(h )

= Ef,c D[ ˆf(hi(c))] ˆF(h )

ˆf(hi 1(c) + ηi( 1

γ Wi(c) hi 1(c))

Ef,c D[ ˆft(hi 1(c))

+ ηi ˆf(hi 1(c)) ( 1

γ Wi(c) hi 1(c))

γ hi 1(c) Wi(c) 2] ˆF(h )

Ef,c D[ ˆf(hi 1(c))

+ ηi(f i( 1

γ Wi(c)) f i t(hi 1(c)))

2γ2 ] ˆF(h )

Using linearity of f i, which implies that optimal aggregate loss values are equal over H and CH(H) hypotheses classes, and the deﬁnition of weak learning, we have

+ ηi Ef,c D h ˆft(hi 1(c)) (h (c) hi 1(c)) i

2γ2 + ηi ˆGDε

i 1(1 ηi) + η2 i βD2

2γ2 + ηi ˆGDε

where the last step follows from the convexity of f. We thus have the recurrence

i i 1(1 ηi) + η2 i βD2

2γ2 + ηi ˆGDε

As before, substituting ˆ i = i ˆ GDε

γ , and applying

Lemma 8, gives us ˆ N 2βD2

Proof of Theorem 13. Using Lemma 5, we have,

= Ef,c[f(h(c))] Ef,c[f(h (c))]

Ef,c[ ˆf(h(c))] Ef,c[ ˆf(h (c))] + δG2

Ef,c[ ˆf(h N(c))] Ef,c[ ˆf(h (c))] + 2δG2

Boosting for Online Convex Optimization

Boston dataset WL N=2 N=3 N=4 N=5 Improvement Decision Stumps 1.000 0.941 0.891 0.853 0.821 17.9% Ridge Regression 1.000 0.980 0.985 1.009 1.060 2.0% Tiny MLP 1.000 0.987 0.966 0.934 0.920 8.0% Diabetes dataset WL N=2 N=3 N=4 N=5 Improvement Decision Stumps 1.000 0.975 0.960 0.952 0.941 15.9% Ridge Regression 1.000 0.986 0.974 0.965 0.956 4.4% Tiny MLP 1.000 0.982 0.981 0.987 0.965 3.5% California Housing dataset WL N=2 N=3 N=4 N=5 Improvement Decision Stumps 1.000 0.969 0.946 0.932 0.922 7.8% Ridge Regression 1.000 1.019 1.041 1.066 1.094 -1.9% Tiny MLP 1.000 0.860 0.861 0.888 0.835 16.5%

Figure 1. Performance of the boosting algorithm on 3 different datasets, averaged across 20 runs, as a function of the weak learning class and the number of weak learners. The entries indicate normalized loss value with the base/weak learner loss set to one.

By Lemma 3, that ˆf is 1

δ -smooth. By applying Lemma 14:

F(A) F(h ) 2δG2 + 2D2

Because Dist(x, K) is 1-Lipschitz, using Lemma 3, f i t(xi t) = ˆft(xi t)) 2G. Choosing the suggested value of δ yields the proof.

7. Experimental Results

While the primary contribution of this work is theoretical, empirically testing our proposal serves as a sanity check for the theoretical results. Also, as stated in the related work, the proposed algorithm is the ﬁrst of its kind in terms of being able to simultaneously operate with general convex losses and sets, and with multiplicatively approximate (inexact) weak learners. Consequently, the role of experiments here is to corroborate the theory, and is not aimed at achieving state-of-the-art on the considered datasets.

To validate our results, we check if the proposed algorithm is indeed capable of boosting the accuracy of concrete instantiations of weak learners. Three online weak learners were considered2: decision stumps, ridge regression, and a tiny multi-layer perceptron with one hidden unit trained via online gradient descent.

For each choice of weak learner class, we considered the performance of the boosting algorithm (Algorithm 1) across multiple rounds of boosting or number of weak learners; the computational burden of the algorithm scales linearly with the latter. We evaluated the weak learners and the boosting

2We use machine learning models from Scikit-Learn(Pedregosa et al., 2011) to implement base learners.

algorithm on 3 publicly available datasets3 for regression with square loss, averaging each across 20 runs. The step size η was set to 0.01, and γ was chosen to be 0.1 these values were not tuned. We present the results in Table 1. Further experimental details and code are detailed in the supplement.

The results demonstrate the boosting for online convex optimization does indeed succeed in boosting the accuracy while using only few weak learners (equivalently, within a few rounds of boosting), and therefore, with a reasonably small increase in computation. The improvement is most pronounced for decision stumps, since this is an especially impoverished base class. Also, note that a convex combination of linear models is linear. Therefore, boosting a linear model (like ridge regression) has no expressivity beneﬁt. Mirroring this, the improvements in the ridge regression setting are least prominent, and in some cases negative.

8. Conclusions

We study the problem of online convex optimization in the contextual setting. To cope with the computational challenges, we take a boosting approach, and generalize the methodology of gradient boosting and online boosting to contextual online convex optimization.

Our derivation introduces a new extension operator for convex functions over convex domains which may be of independent interest.

3The California Hosuing dataset (Pace & Barry, 1997) and the Boston dataset (Harrison Jr & Rubinfeld, 1978) can be found at http://lib.stat.cmu.edu/datasets/. The Diabetes dataset used here is present in UCI ML Repository (https:// archive.ics.uci.edu/ml/datasets/Diabetes).

Boosting for Online Convex Optimization

Acknowledgements

Elad Hazan was supported in part by National Science Foundation Award 1704860.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646, 2014.

Agarwal, N., Brukhim, N., Hazan, E., and Lu, Z. Boosting for dynamical systems. ar Xiv preprint ar Xiv:1906.08720, 2019.

Arora, S., Hazan, E., and Kale, S. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121 164, 2012.

Beck, A. First-order methods in optimization. SIAM, 2017.

Beygelzimer, A., Hazan, E., Kale, S., and Luo, H. Online gradient boosting. In Advances in neural information processing systems, pp. 2458 2466, 2015a.

Beygelzimer, A., Kale, S., and Luo, H. Optimal and adaptive algorithms for online boosting. In International Conference on Machine Learning, pp. 2323 2331, 2015b.

Brukhim, N. and Hazan, E. Online boosting with bandit feedback. ar Xiv preprint ar Xiv:2007.11975, 2020.

Brukhim, N., Chen, X., Hazan, E., and Moran, S. Online agnostic boosting via regret minimization, 2020.

Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., and Warmuth, M. K. How to use expert advice. Journal of the ACM (JACM), 44(3):427 485, 1997.

Chen, S.-T., Lin, H.-T., and Lu, C.-J. An online boosting algorithm with theoretical justiﬁcations, 2012.

Chen, S.-T., Lin, H.-T., and Lu, C.-J. Boosting with online binary learners for the multiclass bandit problem. In International Conference on Machine Learning, pp. 342 350, 2014.

Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efﬁcient optimal learning for contextual bandits. ar Xiv preprint ar Xiv:1106.2369, 2011.

Friedman, J. H. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):367 378, 2002.

Harrison Jr, D. and Rubinfeld, D. L. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81 102, 1978.

Hazan, E. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

Hazan, E. and Koren, T. The computational power of optimization in online learning. In Proceedings of the fortyeighth annual ACM symposium on Theory of Computing, pp. 128 141, 2016.

Hazan, E., Hu, W., Li, Y., and Li, Z. Online improper learning with an approximation oracle. In Advances in Neural Information Processing Systems, pp. 5652 5660, 2018.

Jung, Y. H. and Tewari, A. Online boosting algorithms for multi-label ranking. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 279 287, 2018.

Jung, Y. H., Goetz, J., and Tewari, A. Online multiclass boosting. In Advances in neural information processing systems, pp. 919 928, 2017.

Kakade, S. M., Kalai, A. T., and Ligett, K. Playing games with approximation algorithms. SIAM Journal on Computing, 39(3):1088 1106, 2009.

Langford, J. and Zhang, T. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pp. 817 824, 2008.

Leistner, C., Saffari, A., Roth, P. M., and Bischof, H. On robustness of on-line boosting-a competitive study. In IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 1362 1369. IEEE, 2009.

Littlestone, N. and Warmuth, M. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994.

Olshen, R. A conversation with Leo Breiman. Statistical Science, 16(2):184 198, 2001.

Pace, R. K. and Barry, R. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291 297, 1997.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Rakhlin, A. and Sridharan, K. Bistro: An efﬁcient relaxation-based method for contextual bandits. In ICML, pp. 1977 1985, 2016.

Schapire, R. E. and Freund, Y. Boosting: Foundations and Algorithms. Cambridge university press, 2012a. ISBN 9780262017183. doi:

Boosting for Online Convex Optimization

10.1017/CBO9781107415324.004. URL https: //www.cambridge.org/core/product/ identifier/CBO9781107415324A009/type/ book_part.

Schapire, R. E. and Freund, Y. Boosting: Foundations and Algorithms. MIT Press, 2012b.

Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2012.

Syrgkanis, V., Krishnamurthy, A., and Schapire, R. Efﬁcient algorithms for adversarial contextual learning. In International Conference on Machine Learning, pp. 2159 2168, 2016a.

Syrgkanis, V., Luo, H., Krishnamurthy, A., and Schapire, R. E. Improved regret bounds for oracle-based adversarial contextual bandits. In Advances in Neural Information Processing Systems, pp. 3135 3143, 2016b.

Viola, P. and Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, volume 1, pp. I I. IEEE, 2001.