# online_agnostic_multiclass_boosting__546adfb5.pdf

Online Agnostic Multiclass Boosting

Vinod Raman Department of Statistics

University of Michigan

Ann Arbor, MI 48104 vkraman@umich.edu

Ambuj Tewari Department of Statistics

University of Michigan

Ann Arbor, MI 48018 tewaria@umich.edu

Boosting is a fundamental approach in machine learning that enjoys both strong theoretical and practical guarantees. At a high-level, boosting algorithms cleverly aggregate weak learners to generate predictions with arbitrarily high accuracy. In this way, boosting algorithms convert weak learners into strong ones. Recently, Brukhim et al. [6] extended boosting to the online agnostic binary classiﬁcation setting. A key ingredient in their approach is a clean and simple reduction to online convex optimization, one that efﬁciently converts an arbitrary online convex optimizer to an agnostic online booster. In this work, we extend this reduction to multiclass problems and give the ﬁrst boosting algorithm for online agnostic mutliclass classiﬁcation. Our reduction also enables the construction of algorithms for statistical agnostic, online realizable, and statistical realizable multiclass boosting.

1 Introduction

Boosting is a fundamental technique in machine learning that cleverly aggregates the predictions of weak learners to produce a strong learner. Originally studied in the batch (realizable) PAC learning setting for binary classiﬁcation, boosting has now been extended to a wide variety of settings, including multiclass classiﬁcation, online learning, and agnostic learning. Recently, Brukhim et al. [6] extended boosting to online agnostic binary classiﬁcation, marking the completion of boosting algorithms for all four regimes of statistical/online and agnostic/realizable binary classiﬁcation. However, less can be said about multiclass classiﬁcation, where boosting algorithms are only well studied under the assumption of realizability. Designing online agnostic multiclass boosting algorithms is important for several reasons. First, realizability is a very strong assumption in real-life: it requires the existence of an expert that perfectly labels the data. Second, the vast majority of classiﬁcation tasks require more than two labels, a prominent example being image classiﬁcation. Lastly, modern machine learning tasks often require sequential processing of data, making the design and development of online algorithms increasingly relevant. In this work we ﬁll this gap in literature by studying online agnostic boosting for multiclass problems.

1.1 Main Results

We give the ﬁrst weak learning conditions and algorithms for online agnostic multiclass boosting. A key idea of our algorithms is a reduction from boosting to online convex optimization (OCO), an idea borrowed from Brukhim et al. [6]. As a consequence of this reduction, we are also able to give algorithms for the three other settings of statistical agnostic, online realizable, and statistical realizable multiclass boosting. Finally, we give empirical results showcasing that our OCO-based boosting algorithms are fast and competitive with existing state-of-the-art multiclass boosting algorithms.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

1.2 Related Works

Boosting was ﬁrst studied for binary classiﬁcation under the realizable PAC learning setting [33, 16 18, 32] and then later extended to the agnostic PAC learning setting [15, 21, 27, 26, 25, 28, 2]. The

success of boosting for binary classiﬁcation led to signiﬁcant interest in designing boosting algorithms for multiclass problems. As a result, several multiclass boosting algorithms were proposed for the realizable batch setting [19, 14, 22, 20], culminating in the work by Mukherjee and Schapire [30], who uniﬁed the previous approaches under a cost matrix framework. Beyond the batch setting, online boosting algorithms for binary [31, 9, 4] and multiclass classiﬁcation [10, 24] have been designed assuming realizability (mistake-bound). More recently, Brukhim et al. [6] give the ﬁrst online agnostic (regret-bound) boosting algorithm for binary classiﬁcation, marking complete all four regimes of statistical/online and agnostic/realizable boosting for binary classiﬁcation.

Moving to agnostic multiclass boosting, Brukhim et al. [7] study the resources required for boosting in the statistical/batch setting as the number of labels k grows. However, they consider an alternative model of boosting where the weak learner is a strong agnostic PAC learner for a simple "easy-to-learn" base hypothesis class, and the goal is to learn target concepts outside the base class. Speciﬁcally, they assume the target concept can be represented by weighted plurality votes over the base class. In this way, the weakness of a weak learner is manifested in the base hypothesis class. Instead, in our work, we consider the standard boosting model of ﬁxing the base hypothesis class, and ask whether a weak learner s performance can be improved relative to the best ﬁxed hypothesis in that class.

Beyond classiﬁcation, several other works have studied online agnostic boosting for real-valued loss functions under both full-information and bandit feedback settings [3, 1, 5]. In particular, the work by Brukhim and Hazan [5] reduce online boosting for regression tasks under bandit feedback to online linear optimization. A key difference between these works and ours is in the weak learning assumption: these works consider a weak learner that is a strong learner for a small base class of regression functions. The goal of boosting then is to produce a strong online learner for a larger class which contains linear spans of the base class. This is in contrast to this work, where again, we ﬁx the base hypothesis class, and boost the regret bound.

2 Preliminaries and Notation

We ﬁrst describe the basic setup for online agnostic multiclass boosting. There are k possible labels Bk := {1, ..., k} and k is known to the booster and the weak learners. The booster maintains N copies of a weak learner, W, which themselves are (randomized) online learning algorithms that sequentially process examples from instance space X and output predictions in Bk, which we denote as the set of basis vectors of length k. At each iteration t = 1, ..., T, an adversary picks a labeled example (xt, yt) 2 X Bk and reveals xt to the booster. Once the booster observes the unlabeled data xt, it gathers the weak learners predictions and makes a ﬁnal (possibly randomized) prediction

ˆyt 2 Bk. After observing the booster s ﬁnal decision, the adversary reveals the true label yt, and the booster suffers the loss

{ˆyt 6= yt} or equivalently, gains the reward 2

{ˆyt = yt} 1. Finally, the booster, after observing the true label yt, updates each weak learner. Note that the loss/gain of the booster in round t can be written as 1 ˆyt yt and 2ˆyt yt 1 respectively. We also let k represent the (k 1)-dimensional probability simplex and k

γ represent the 1

γ -scaled (k 1)-dimensional probability simplex for γ 2 (0, 1). Finally, we let

k denote the k-dimensional ones vector.

Evaluation. Unlike the realizable setting, in the agnostic setting, we place no restrictions on how the stream of examples x1, ..., x T are labelled. Thus, for a ﬁxed hypothesis class H BX

k , the goal of the booster is to output predictions ˆyt such that the expected regret,

(2h(xt) yt 1)

(2ˆyt yt 1)

is minimized, where the expectation is over the randomness of the booster and that of the possibly adaptive adversary. Note, this is in contrast to the realizable setting where the stream is labelled by a h 2 H and we wish the booster to minimize the (expected) number of mistakes (mistake-bound).

Agnostic Boosting. A key technique in agnostic boosting, ﬁrst appearing in the work by Kanade and Kalai [27], is to update weak learners by feeding randomly relabelled examples. This is in contrast to the realizable setting where we typically update weak learners by passing reweighted examples.

Accordingly, in order to design a good boosting algorithm, we need to design the booster s strategy for random relabelling while also quantifying the weak learner s ability to maximize cumulative gain, even under relabelled data. The ﬁrst task will be resolved by allowing the booster to use an Online Convex Optimization (OCO) oracle. In this way, we reduce boosting to OCO, an idea borrowed from Brukhim et al. [6]. For the second task, we give different possible weak learning conditions for the same algorithm, all of which capture the ability of a weak learner to maximize cumulative gain with respect to the best ﬁxed competitor in hindsight.

Online Convex Optimization. Our booster will use an OCO oracle to update its weak learners. The OCO setting is a sequential game between an online player and adversary over N rounds (see [23] for an in-depth introduction). In each round, the player plays a point xi in a compact convex set K

d, the adversary reveals a loss function fi chosen from a family of bounded convex functions over K, and the player suffers the loss fi(xi). The goal of player is to also minimize regret, deﬁned as

We will denote an algorithm in this setting as OCO(K, N). If A is a OCO(K, N), then we will denote its regret by RA(N). For many OCO algorithms, like Online Gradient Descent (OGD), the regret RA(N) is a sub-linear function of the time-horizon N.

3 Online Agnostic Boosting

In this section, we present an online agnostic boosting algorithm and analyze its performance. We begin in Subsection 3.1 by formally describing our algorithm which reduces boosting to OCO. Then, in Subsection 3.2, we state a weak learning condition and subsequently prove a regret bound for our proposed algorithm.

3.1 Algorithm

Pseudocode for our online agnostic boosting algorithm is provided in Algorithm 1. The booster maintains oracle access to N copies of a weak learner W1, ..., WN as well as a OCO( k, N) algorithm A. Each weak learner is characterized by some advantage parameter γ 2 (0, 1] (directly proportional to its strength), and we will be precise about what exactly γ quantiﬁes in Subsection 3.2. In round t, the booster uses the weak learners to make a prediction ˆyt, observes the true label yt, and ﬁnally simulates a game with the OCO algorithm A to update each weak learner Wi. Speciﬁcally, the booster uses the outputs of A to feed relabelled examples (xt, yi

t) to Wi. A critical component of our algorithm is the L2 projection operator onto the probability simplex k, which we denote by Q. This is used by the booster in line 4 to make randomized predictions ˆyt.

We now highlight some desirable properties of Algorithm 1. First, it is easy to implement and efﬁcient assuming access to an efﬁcient weak learner. In particular, for each round t, if the running time of the weak learner is Q, then the running time of our booster is O(NQ + Nk log(k)). This is in contrast to the Online MBBM algorithm proposed by Jung et al. [24] for the realizable setting which is not efﬁcient even assuming access to an efﬁcient weak learner. Second, when k = 2, our algorithm reduces down to the online agnostic boosting algorithm proposed by Brukhim et al. [6] in the binary case. Indeed, one can verify that when k = 2, the L2 projection operator reduces to the same projection used by Brukhim et al. [6] and the distribution over labels pi

t induced by the outputs of OCO algorithm are equal for each weak learner in every round t.

While the framework in Algorithm 1 is inspired from Brukhim et al. [6] in the binary setting, several new complications arise in the multiclass setting. Most notably, when k > 2, we must ﬁgure out how the booster should make predictions, what loss functions the booster should construct and pass to A, and lastly, how the booster should use the output of A to update each weak learner. The interplay behind these three algorithmic pieces is delicate and they have been carefully designed in Algorithm 1 to enable the analysis in Subsection 3.2. Below, we provide some intuition behind these algorithmic decisions.

Randomized Prediction. At the start of each round, the booster averages the weak learners votes, scales the average by the parameter γ, projects the scaled vector back into the simplex, and ﬁnally samples a random label. When the L2 projection operator is selected, this approach for randomized

Algorithm 1: Online Agnostic Multiclass Boosting via OCO Input: Weak Learners W1...WN, OCO( k, N) algorithm A, Advantage parameter γ

1 for t = 1, ..., T do

2 Receive example xt

3 Accumulate weak predictions ht = PN

4 Set Dt = Q( ht

5 Predict ˆyt Dt 6 Receive true label yt 7 for i = 1, ..., N do

8 If i > 1, obtain pi

t , ..., li 1

t ). Else, initialize p1

9 Reveal loss function: li

10 Sample random label yi

t, and pass (xt, yi

Figure 1: The red, blue, and black lines correspond to 2

γ for γ = 0.50, 0.75, and, 1.0 respectively. p1 and p2 are γ-scaled votes and p

2 are their corresponding L2 projection onto 2. p1 and p2 denote the orthogonal projections onto the plane containing 2.

prediction achieves a polarization effect: as γ gets smaller, the projection concentrates mass on a fewer number of labels, speciﬁcally those labels that have achieved the majority of the votes from the weak learners. When γ nears 0, the L2 projection eventually places all mass on the label with the most votes.

One might think that the L2 projection is not a natural projection operator for k (as KL or L1 projections might seem better suited to the geometry of the probability simplex). However, in the case where we project from k

γ to k, we ﬁnd that it is a natural choice from a geometric perspective. Figure 1 provides a visualization of the L2 projection operator from k

γ to k for k = 2. Because the spaces k

γ and k are parallel, given any point p 2 k

γ , one can think of its L2 projection in the following procedural manner:

1. Compute the orthogonal projection, p, onto the plane containing k. 2. If p 2 k, output p = p. 3. Else, output p = arg minp2 k ||p p||2.

From this procedural perspective, Figure 1 lends a geometric intuition behind the polarization effect of the L2 projection. As γ shrinks, the number of points in k

γ that lie orthogonally above k shrinks. Thus, the orthogonal projection of the vast majority of points in k

γ will lie outside k,

leading to a greater number of sparse projections that lie on the boundary of k. From the booster s viewpoint, this property of the L2 projection is desirable. If the weak learners are very weak (small γ) but somehow concentrate votes on a few labels, then it may be likely that the true response is amongst these few labels. In this sense, γ controls how many weak learners need to agree on a particular label to convince the booster to deterministically predict that label.

Updating Weak Learners. Once the true label yt is revealed in each round, the booster must update each weak learner. As mentioned in Section 2, one strategy for updating weak learners in the agnostic setting is via random relabelling. Indeed, in line 10, the booster passes back to the weak learner the example xt with a random label yi

t. Together, the speciﬁed loss function in line 9 and random relabelling strategy in line 10 achieve the following effect: if more weak learners make mistakes, the distribution over labels output by the OCO algorithm A in line 8 concentrates on the true label yt, increasing the likelihood that yt is passed to subsequent weak learners. This is desirable as the outputs of the OCO guide each weak learner to correct for the mistakes of preceding learners.

3.2 Regret Analysis

Before we give the regret bound of Algorithm 1, we need to specify the capacity to which the weak learners can make predictions, even under potentially relabelled data. Unfortunately, in the agnostic setting, there is no canonical weak learning condition for multiclass problems. In this paper, we give several possible deﬁnitions of an Agnostic Weak Online Learner (AWOL) for the multiclass setting, all of which enable a regret analysis of Algorithm 1. We emphasize that we can derive different regret bounds for Algorithm 1 based on what condition we assume our weak learners to satisfy. For this section, we present a weak learning condition based loosely on a one-vs-one approach to multiclass classiﬁcation. In Appendix C, we provide alternative weak learning conditions.

Deﬁne the gain function for input z 2 Bk,

{z = } = z (y ).

For some y 2 Bk and 2 Bk \{y}, σy, ( ) can be thought of as the binary classiﬁcation task between labels y and . Deﬁnition 1 requires that for any such sequence of binary classiﬁcation tasks, an online agnostic weak learner must be able to eventually distinguish between every pair of labels to some non-trivial, but far from optimal, degree.

Deﬁnition 1 (Agnostic Weak Online Learning). Let H BX

k be a class of experts and let 0 < γ 1 denote the advantage . An online learning algorithm W is a (γ, T)-agnostic weak online learner (AWOL) for H if for any adaptively chosen sequence of tuples (xt, yt, t) 2 X Bk Bk where

t 6= yt, the algorithm outputs W(xt) 2 Bk at every iteration t 2 [T] such that,

σyt, t(h(xt))

σyt, t(W(xt))

where the expectation is taken w.r.t. the randomness of the weak learner W and that of the possibly adaptive adversary, and RW :

+ is the additive regret: a non-decreasing, sub-linear function of T.

The precise dependence of RW(T, k) on k is explored in more detail in Appendix D where we explicitly construct learners satisfying AWOL. We make few remarks about Deﬁnition 1. First, the strength of the weak learner varies directly with γ. Second, for k = 2, Deﬁnition 1 reduces to the weak learning condition by Brukhim et al. [6] in the binary setting. Finally, we emphasize that Deﬁnition 1 holds under an adaptive adversary, one that can choose (yt, t) based on {W(xi)}t 1

i=1 and its own internal random bits. Importantly, we also allow the adversary to pick t even after it has observed W(xt) as this only controls how much loss the weak learner suffers when it is incorrect.

Under the assumption that weak learners satisfy Deﬁnition 1, Theorem 1 bounds the regret of Algorithm 1 under an oblivious adversary. Using a standard reduction, our results can then be generalized to an adaptive adversary (see Chapter 4 in [8]). In particular, a key requirement allowing an oblivious regret bound to generalize to an adaptive regret bound is that the learner s predictions on round t should not depend on any of its past predictions from previous rounds. This is indeed true for our Booster.

Theorem 1 (Regret Bound). Assuming weak learners satisfy Deﬁnition 1, the expected regret bound of Algorithm 1 is,

(2h(xt) yt 1)

(2ˆyt yt 1)

where the expectation is over the randomness of the algorithm and weak learners, and RW(T, k), RA(N) are the regret terms of the AWOL, OCO algorithms respectively.

If one picks A to be Online Gradient Descent (OGD), then RA(N) = O(GD

N), where D is the diameter of k and G is the upper-bound on ||rpli

t(p)||. In our setting, D =

2 and G = O( 1

γ ) (using Lemma 19) and hence the average regret further simpliﬁes to:

(2h(xt) yt 1)

(2ˆyt yt 1)

To exemplify the role of N, consider a scenario where RW(T, k) = O(k

T) (see Appendix D for examples). By setting N = T γ2 the overall regret of the booster becomes O( k

T γ ). In the next subsection, we give the proof of Theorem 1.

3.2.1 Proof of Theorem 1

We follow a similar procedure to Brukhim et al. [6] by lower and upper bounding the expected sum of losses passed to A in terms of the regret of the weak learner and the regret of A respectively. These bounds rely on several lemmas that have been abstracted out and provided in Appendix E. As mentioned previously, we also assume an oblivious adversary.

Let (x1, y1), ..., (x T , y T ) be any sequence of example-label pairs. We start by giving a lower bound on the expected sum of losses passed to A using Deﬁnition 1. Deﬁne h = arg maxh2H

t=1(2h(xt) yt 1) as the optimal competitor in hindsight and let

Wi(xt), if Wi(xt) 6= yi

t 2 Bk \ {yi

t}, otherwise .

Note that the precise choice of i

t in the second case is not important. The proof below only uses the fact that i

t 2 Bk \ {yi

t} in that case. We then have,

Using the weak learning condition in Deﬁnition 1,

γ (Deﬁnition 1)

γ . (Lemma 15)

Putting things together, we ﬁnd,

t (2h (xt) 2yt)

E [2(h (xt) yt 1)] NRW(T, k)

γ (Lemma 16)

2(h (xt) yt 1) NRW(T, k)

Now, we compute an upper bound. For any t 2 [T] and arbitrary p 2 k:

N (OCO Regret)

(p E [2Wi(xt)

N (Lemma 17)

= 2 (E [ˆyt] yt 1) + RA(N)

N . (Law of total expectation)

Summing over T,

2 (E [ˆyt] yt 1) + TRA(N)

Combining lower and upper bounds for E

(2h(xt) yt 1)

(2ˆyt yt 1)

which completes the proof. Since 2h(xt) yt 1 = 1 2

k{h(xt) 6= yt}, we can also write the average regret in terms of the number of mistakes,

{ˆyt 6= yt} min

{h(xt) 6= yt}

2γT + RA(N)

4 Beyond Online Agnostic Boosting

In this section, we present results of extending our reduction to the three other boosting settings, namely statistical agnostic, online realizable, and statistical realizable learning. The purpose of this section is to showcase the generality of the OCO-based boosting framework and not to achieve state-of-the-art bounds for these settings. Boosting algorithms and all associated proofs can be found in Appendix A and B. Throughout this section we will let WS denote the hypothesis output by a weak learner trained on a sample S.

4.1 Statistical Agnostic Boosting

In the statistical setting, our objective of interest is the correlation, which we deﬁne below. Let D be a distribution over X Bk and let h : X ! Bk be an hypothesis. Deﬁne the multiclass correlation of h with respect to D as

cor D(h) = E (x,y) D [h(x) (2y

Like the online agnostic setting, the boosting algorithm for this setting (provided in Appendix A) can be analyzed under several candidate weak learning conditions. To showcase the dependence on k, we provide a weak learning condition based loosely on a one-vs-all approach to multiclass classiﬁcation.

Deﬁnition 2 (Empirical Agnostic Weak Learning). Let H BX

k be a hypothesis class and let 0 < γ 1 denote the advantage . Let x = (x1, ..., xm) 2 X denote an unlabeled sample. A learning algorithm W is a (γ, 0, m0)-agnostic weak learner (AWL) for H with respect to x if for any labels y = (y1, ..., ym) 2 Bk, and every reference label 2 Bk,

WS0(xi) (2yi

where S0 is an independent sample of size m0 drawn from the distribution which uniformly assigns to each example (xi, yi) probability 1/m.

Under the assumption that our weak learners satisfy Deﬁnition 2, Corollary 2 bounds the expected correlation of a statistical agnostic boosting algorithm.

Corollary 2 (Empirical Agnostic Correlation Bound). There exists a boosting algorithm whose output after T rounds, denoted by h, satisﬁes:

h2H E [cor S(h)] RA(T)

where S is the distribution which uniformly assigns to each example (xi, yi) probability 1/m.

Letting A be OGD, and setting T = O( 1 γ2 2 ) for any > 0 gives an error rate of k 0

γ + . Although, to our best knowledge, there are no existing multiclass boosting algorithms in the agnostic setting, we point out that there is evidence to suggest that a sub-optimal dependence of the error rate on k might be unavoidable (see [7]).

4.2 Online Realizable Boosting

In the online realizable setting, we are guaranteed that the stream of examples (x1, y1), ..., (x T , y T ) is perfectly labelled by some hypothesis h 2 H. Deﬁnition 3 then gives an appropriate online realizable weak learning condition. Deﬁnition 3 (Realizable Weak Online Learning). Let H BX

k be a class of experts and let 0 < γ 1 denote the advantage . An online learning algorithm W is a (γ, T)-realizable weak online learner (RWOL) for H if for any sequence (x1, y1), ..., (x T , y T ) 2 X Bk that is realizable by H, at every

iteration t 2 [T], the algorithm outputs W(xt) 2 Bk such that,

(2W(xt) yt 1)

γT RW(T, k),

where the expectation is taken w.r.t. the randomness of the weak learner W and RW :

+ is the additive regret: a non-decreasing, sub-linear function of T.

Under Deﬁnition 3, Corollary 3 bounds the expected gain of an online boosting algorithm. Corollary 3 (Mistake Bound). There exists a boosting algorithm whose outputs ˆy1, ..., ˆy T satisfy:

(2 E [ˆyt] yt 1) 1 RA(N)

where RW(T, k) = 2RW(T, k) + O(

If we consider a scenario where RW(T, k) = O(k

T), taking A to be OGD, N to be O( 1 γ2 2 ) and T to be O( 1 γ2 2 ), gives a mistake-bound at most k T. We point out that we cannot readily compare this bound with the existing bounds for online realizable boosting by Jung et al. [24] because the dependence of RW(T,k) on k depends on the weak learner of choice.

4.3 Statistical Realizable Boosting

In the statistical realizable setting, our metric of interest is again the multiclass correlation. However, under the realizability assumption, maxh 2H cor D(h ) = 1. Accordingly, Deﬁnition 4 gives an appropriate realizable weak learning condition. Deﬁnition 4 (Empirical Realizable Weak Learning). Let H BX

k be a hypothesis class and let 0 < γ 1 denote the advantage . Let S = {(x1, y1), ..., (xm, ym)} 2 X Bk be a sample. A learning algorithm W is a (γ, m0) - realizable weak learner (RWL) for H with respect to S if for any distribution p = (p1, ..., pm) over the examples,

E S0 [corp(WS0)] γ,

where S0 is an independent sample of size m0 drawn from p.

Under Deﬁnition 4, Corollary 4 bounds the expected correlation of a statistical realizable boosting algorithm. Corollary 4 (Empirical Correlation Bound). There exists a boosting algorithm whose output after T rounds, denoted h, satisﬁes,

where S is the distribution which uniformly assigns to each example (xi, yi) probability 1/m.

Note that the cost of weak learning manifests in the regret term of the OCO algorithm. Precisely, if one picks OGD to be the OCO algorithm, then the bound on correlation can be expressed as E

, which exhibits a decreasing lower bound as γ shrinks. Setting

T = O( 1 γ2 2 ) for any > 0 ensures that at most error is obtained. It is difﬁcult to compare our error rates to that achieved by existing multiclass boosting algorithm in the realizable setting because our weak learning condition is different and potentially stronger. Nevertheless, our bounds are sub-optimal to those in Mukherjee and Schapire [30] which only require setting T = O( 1

)) to achieve error.

Table 1: Average accuracy and runtime of algorithms on 7 UCI datasets.

Dataset k Accuracy(%) Runtime(s) Agn Opt Ada OCOR Agn Opt Ada OCOR Balance 3 84.4 79.5 75.2 78.8 11.5 57.6 7.1 4.5 Cars 4 80.6 69.9 75.8 69.6 42.5 104.1 20.4 12.1 Landsat 6 67.0 80.8 56.9 79.6 1255.6 1814.3 81.8 440.4 Segment 7 75.0 78.9 68.6 79.3 647.2 2448.3 51.8 154.7 Mice 8 86.0 77.6 71.3 79.6 1025.8 1811.6 90.1 258.2 Yeast 10 42.3 39.8 41.7 47.6 348.3 1468.8 25.2 56.6 Abalone 28 22.1 24.9 19.2 22.0 4036.7 9775.3 108.0 248.2

5 Experiments

We performed experiments with Algorithm 1 on seven UCI datasets [13]. Since, to our knowledge, there are no known online agnostic boosting algorithms for multiclass problems, we benchmark performance against three online realizable multiclass boosting algorithms: the state-of-the-art optimal (Online MBBM) and adaptive (Adaboost.OLM) online boosting algorithms by Jung et al. [24], and the OCO-based online boosting algorithm alluded to in Section 4 (see Appendix B). For weak learners, we used the implementation of the Very Fast Decision Tree from the River package [29] and restricted the maximum depth of the tree to 1. We used Projected OGD [35] for the OCO algorithm and set the number of weak learners, N, to 100 for each boosting algorithm.

Our experiments using Algorithm 1 are performed using fractional relabeling, a technique borrowed from Kanade and Kalai [27]. That is, instead of passing just a single example-label pair (xt, yi

t) where yi

t, we pass all k possible weighted example-label tuples {(xt, , pi

=1 to weak learner i in round t. Experiments with random relabeling showed that random relabeling runs faster but performs worse than fractional relabeling.

Table 1 summarizes the average accuracy and runtime over ﬁve independent shufﬂes of each dataset for each boosting algorithm. "Agn" refers to Algorithm 1, "Opt" to Online MBBM, "Ada" to Adaboost.OLM, and "OCOR" to the OCO-based online realizable algorithm. Standard errors for accuracies are provided in Appendix F. All algorithms except Ada Boost.OLM are parameterized by an advantage parameter γ 2 (0, 1). Thus, γ was tuned separately for each respective cell of Table 1. See Appendix F for more experimental details. All code is available at https://github.com/ vinodkraman/Online Agnostic Multiclass Boosting.

Despite having the fastest overall runtime, Adaboost.OLM had the poorest performance. Compared to Online MBBM, our OCO-based boosting algorithms achieve comparable performance at a fraction of the runtime. Speciﬁcally, for three of our datasets, our online agnostic boosting algorithm achieves the highest accuracy. For the remaining datasets, our OCO-based realizable boosting algorithm achieves comparable accuracy to Online MBBM with shorter runtimes.

6 Discussion

We give the ﬁrst weak learning conditions and algorithm for online agnostic multiclass boosting. Our algorithm relies on a clean and simple reduction from boosting to online convex optimization. This fruitful connection allows us to go beyond the online agnostic setting and design multiclass boosting algorithms for all four regimes of statistical/online and agnostic/realizable learning. As future work, we leave it open to identify the correct weak learning condition, construct adaptive versions of our boosting algorithms, improve regret upper bounds/prove lower bounds, design agnostic boosting algorithms under bandit feedback, and study the impact of the choice of weak learner and OCO algorithm on the empirical performance of our OCO-based boosting algorithms.

Acknowledgements. AT acknowledges the support of NSF via grant IIS-2007055.

[1] N. Agarwal, N. Brukhim, E. Hazan, and Z. Lu. Boosting for control of dynamical systems. In

International Conference on Machine Learning, pages 96 103. PMLR, 2020.

[2] S. Ben-David, P. M. Long, and Y. Mansour. Agnostic boosting. In International Conference on

Computational Learning Theory, pages 507 516. Springer, 2001.

[3] A. Beygelzimer, E. Hazan, S. Kale, and H. Luo. Online gradient boosting. Advances in neural

information processing systems, 28, 2015.

[4] A. Beygelzimer, S. Kale, and H. Luo. Optimal and adaptive algorithms for online boosting. In

International Conference on Machine Learning, pages 2323 2331. PMLR, 2015.

[5] N. Brukhim and E. Hazan. Online boosting with bandit feedback. In Algorithmic Learning

Theory, pages 397 420. PMLR, 2021.

[6] N. Brukhim, X. Chen, E. Hazan, and S. Moran. Online agnostic boosting via regret minimization.

Advances in Neural Information Processing Systems, 33:644 654, 2020.

[7] N. Brukhim, E. Hazan, S. Moran, I. Mukherjee, and R. E. Schapire. Multiclass boosting and

the cost of weak learning. Advances in Neural Information Processing Systems, 34, 2021.

[8] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,

USA, 2006. ISBN 0521841089.

[9] S.-T. Chen, H.-T. Lin, and C.-J. Lu. An online boosting algorithm with theoretical justiﬁcations.

ar Xiv preprint ar Xiv:1206.6422, 2012.

[10] S.-T. Chen, H.-T. Lin, and C.-J. Lu. Boosting with online binary learners for the multiclass

bandit problem. In International Conference on Machine Learning, pages 342 350. PMLR, 2014.

[11] Y. Chen and X. Ye. Projection onto a simplex. ar Xiv preprint ar Xiv:1101.6081, 2011.

[12] A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz. Multiclass learnability and the

erm principle. J. Mach. Learn. Res., 16(1):2377 2404, 2015.

[13] C. L. Du Bois. UCI network data repository, 2008. URL http://networkdata.ics.uci.

[14] G. Eibl and K.-P. Pfeiffer. Multiclass boosting for weak classiﬁers. Journal of Machine Learning

Research, 6(2), 2005.

[15] V. Feldman. Distribution-speciﬁc agnostic boosting. ar Xiv preprint ar Xiv:0909.2927, 2009.

[16] Y. Freund. Boosting a weak learning algorithm by majority. Information and computation, 121

(2):256 285, 1995.

[17] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an

application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

[18] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and

Economic Behavior, 29(1-2):79 103, 1999.

[19] Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96,

pages 148 156. Citeseer, 1996.

[20] Y. Freund, R. Schapire, and N. Abe. A short introduction to boosting. Journal-Japanese Society

For Artiﬁcial Intelligence, 14(771-780):1612, 1999.

[21] D. Gavinsky. Optimally-smooth adaptive boosting and application to agnostic learning. Journal

of Machine Learning Research, 4(May):101 117, 2003.

[22] T. Hastie, S. Rosset, J. Zhu, and H. Zou. Multi-class adaboost. Statistics and its Interface, 2(3):

349 360, 2009.

[23] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends R in

Optimization, 2(3-4):157 325, 2016.

[24] Y. H. Jung, J. Goetz, and A. Tewari. Online multiclass boosting. Advances in neural information

processing systems, 30, 2017.

[25] A. T. Kalai and R. A. Servedio. Boosting in the presence of noise. Journal of Computer and

System Sciences, 71(3):266 290, 2005.

[26] A. T. Kalai, Y. Mansour, and E. Verbin. On agnostic boosting and parity learning. In Proceedings

of the fortieth annual ACM symposium on Theory of computing, pages 629 638, 2008.

[27] V. Kanade and A. Kalai. Potential-based agnostic boosting. Advances in neural information

processing systems, 22, 2009.

[28] Y. Mansour and D. Mc Allester. Boosting using branching programs. Journal of Computer and

System Sciences, 64(1):103 112, 2002.

[29] J. Montiel, M. Halford, S. M. Mastelini, G. Bolmier, R. Sourty, R. Vaysse, A. Zouitine, H. M.

Gomes, J. Read, T. Abdessalem, and A. Bifet. River: machine learning for streaming data in python, 2020.

[30] I. Mukherjee and R. E. Schapire. A theory of multiclass boosting. Journal of Machine Learning

Research, 2013.

[31] N. C. Oza and S. J. Russell. Online bagging and boosting. In International Workshop on

Artiﬁcial Intelligence and Statistics, pages 229 236. PMLR, 2001.

[32] R. E. Schapire. The strength of weak learnability. Machine learning, 5(2):197 227, 1990.

[33] R. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. Kybernetes, 2013.

[34] M. Sion. On general minimax theorems. Paciﬁc Journal of mathematics, 8(1):171 176, 1958.

[35] M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In

Proceedings of the 20th international conference on machine learning (icml-03), pages 928 936, 2003.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6 where we identify

future work. (c) Did you discuss any potential negative societal impacts of your work? [No] Due to the

theoretical/general nature of the work, we do not foresee any major/direct ethical or societal consequences of the research we presented. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section

3.2, Section 4, Appendix A, Appendix B, and Appendix C (b) Did you include complete proofs of all theoretical results? [Yes] See Section 3.2.1,

Appendix A, Appendix B, Appendix D, and Appendix E

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main exper-

imental results (either in the supplemental material or as a URL)? [Yes] In order to preserve anonymity, we will include a URL to a github repository if accepted

(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See Section 5 and Appendix F (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] See Appendix F (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix F 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See Section 5 (b) Did you mention the license of the assets? [Yes] See Appendix F

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] None of the datasets we are using contain information about people. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]