# online_multiclass_boosting__e893e184.pdf

Online Multiclass Boosting

Young Hun Jung Jack Goetz Department of Statistics University of Michigan Ann Arbor, MI 48109 {yhjung, jrgoetz, tewaria}@umich.edu

Ambuj Tewari

Recent work has extended the theoretical analysis of boosting algorithms to multiclass problems and to online settings. However, the multiclass extension is in the batch setting and the online extensions only consider binary classiﬁcation. We ﬁll this gap in the literature by deﬁning, and justifying, a weak learning condition for online multiclass boosting. This condition leads to an optimal boosting algorithm that requires the minimal number of weak learners to achieve a certain accuracy. Additionally, we propose an adaptive algorithm which is near optimal and enjoys an excellent performance on real data due to its adaptive property.

1 Introduction

Boosting methods are a ensemble learning methods that aggregate several (not necessarily) weak learners to build a stronger learner. When used to aggregate reasonably strong learners, boosting has been shown to produce results competitive with other state-of-the-art methods (e.g., Korytkowski et al. [1], Zhang and Wang [2]). Until recently theoretical development in this area has been focused on batch binary settings where the learner can observe the entire training set at once, and the labels are restricted to be binary (cf. Schapire and Freund [3]). In the past few years, progress has been made to extend the theory and algorithms to more general settings.

Dealing with multiclass classiﬁcation turned out to be more subtle than initially expected. Mukherjee and Schapire [4] unify several different proposals made earlier in the literature and provide a general framework for multiclass boosting. They state their weak learning conditions in terms of cost matrices that have to satisfy certain restrictions: for example, labeling with the ground truth should have less cost than labeling with some other labels. A weak learning condition, just like the binary condition, states that the performance of a learner, now judged using a cost matrix, should be better than a random guessing baseline. One particular condition they call the edge-over-random condition, proves to be sufﬁcient for boostability. The edge-over-random condition will also ﬁgure prominently in this paper. They also consider a necessary and sufﬁcient condition for boostability but it turns out to be computationally intractable to be used in practice.

A recent trend in modern machine learning is to train learners in an online setting where the instances come sequentially and the learner has to make predictions instantly. Oza [5] initially proposed an online boosting algorithm that has accuracy comparable with the batch version, but it took several years to design an algorithm with theoretical justiﬁcation (Chen et al. [6]). Beygelzimer et al. [7] achieved a breakthrough by proposing an optimal algorithm in online binary settings and an adaptive algorithm that works quite well in practice. These theories in online binary boosting have led to several extensions. For example, Chen et al. [8] combine one vs all method with binary boosting algorithms to tackle online multiclass problems with bandit feedback, and Hu et al. [9] build a theory of boosting in regression setting.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

In this paper, we combine the insights and techniques of Mukherjee and Schapire [4] and Beygelzimer et al. [7] to provide a framework for online multiclass boosting. The cost matrix framework from the former work is adopted to propose an online weak learning condition that deﬁnes how well a learner can perform over a random guess (Deﬁnition 1). We show this condition is naturally derived from its batch setting counterpart. From this weak learning condition, a boosting algorithm (Algorithm 1) is proposed which is theoretically optimal in that it requires the minimal number of learners and sample complexity to attain a speciﬁed level of accuracy. We also develop an adaptive algorithm (Algorithm 2) which allows learners to have variable strengths. This algorithm is theoretically less efﬁcient than the optimal one, but the experimental results show that it is quite comparable and sometimes even better due to its adaptive property. Both algorithms not only possess theoretical proofs of mistake bounds, but also demonstrate superior performance over preexisting methods.

2 Preliminaries

We ﬁrst describe the basic setup for online boosting. While in the batch setting, an additional weak learner is trained at every iteration, in the online setting, the algorithm starts with a ﬁxed count of N weak learners and a booster which manages the weak learners. There are k possible labels [k] := {1, , k} and k is known to the learners. At each iteration t = 1, , T, an adversary picks a labeled example (xt, yt) X [k], where X is some domain, and reveals xt to the booster. Once the booster observes the unlabeled data xt, it gathers the weak learners predictions and makes a ﬁnal prediction. Throughout this paper, index i takes values from 1 to N; t from 1 to T; and l from 1 to k.

We utilize the cost matrix framework, ﬁrst proposed by Mukherjee and Schapire [4], to develop multiclass boosting algorithms. This is a key ingredient in the multiclass extension as it enables different penalization for each pair of correct label and prediction, and we further develop this framework to suit the online setting. The booster sequentially computes cost matrices {Ci t Rk k | i = 1, , N}, sends (xt, Ci t) to the ith weak learner WLi, and gets its prediction li t [k]. Here the cost matrix Ci t plays a role of loss function in that WLi tries to minimize the cumulative cost P

t Ci t[yt, li t]. As the booster wants each learner to predict the correct label, it wants to set the diagonal entries of Ci t to be minimal among its row. At this stage, the true label yt is not revealed yet, but the previous weak learners predictions can affect the computation of the cost matrix for the next learner. Given a matrix C, the (i, j)th entry will be denoted by C[i, j], and ith row vector by C[i].

Once all the learners make predictions, the booster makes the ﬁnal prediction ˆyt by majority votes. The booster can either take simple majority votes or weighted ones. In fact for the adaptive algorithm, we will allow weighted votes so that the booster can assign more weights on well-performing learners. The weight for WLi at iteration t will be denoted by αi t. After observing the booster s ﬁnal decision, the adversary reveals the true label yt, and the booster suffers 0-1 loss 1(ˆyt = yt). The booster also shares the true label to the weak learners so that they can train on this data point.

Two main issues have to be resolved to design a good boosting algorithm. First, we need to design the booster s strategy for producing cost matrices. Second, we need to quantify weak learner s ability to reduce the cumulative cost PT t=1 Ci t[yt, li t]. The ﬁrst issue will be resolved by introducing potential functions, which will be thoroughly discussed in Section 3.1. For the second issue, we introduce our online weak learning condition, a generalization of the weak learning assumption in Beygelzimer et al. [7], stating that for any adaptively given sequence of cost matrices, weak learners can produce predictions whose cumulative cost is less than that incurred by random guessing. The online weak learning condition will be discussed in the following section. For the analysis of the adaptive algorithm, we use empirical edges instead of the online weak learning condition.

2.1 Online weak learning condition

In this section, we propose an online weak learning condition that states the weak learners are better than a random guess. We ﬁrst deﬁne a baseline condition that is better than a random guess. Let [k] denote a family of distributions over [k] and ul γ [k] be a uniform distribution that puts γ more weight on the label l. For example, u1 γ = ( 1 γ

k ). For a given sequence of examples {(xt, yt) | t = 1, , T}, Uγ RT k consists of rows uyt γ . Then we restrict the booster s

choice of cost matrices to

Ceor 1 := {C Rk k | l, r [k], C[l, l] = 0, C[l, r] 0, and ||C[l]||1 = 1}.

Note that diagonal entries are minimal among the row, and Ceor 1 also has a normalization constraint. A broader choice of cost matrices is allowed if one can assign importance weights on observations, which is possible for various learners. Even if the learner does not take the importance weight as an input, we can achieve a similar effect by sending to the learner an instance with probability that is proportional to its weight. Interested readers can refer Beygelzimer et al. [7, Lemma 1]. From now on, we will assume that our weak learners can take weight wt as an input.

We are ready to present our online weak learning condition. This condition is in fact naturally derived from the batch setting counterpart that is well studied by Mukherjee and Schapire [4]. The link is thoroughly discussed in Appendix A. For the scaling issue, we assume the weights wt lie in [0, 1]. Deﬁnition 1. (Online multiclass weak learning condition) For parameters γ, δ (0, 1), and S > 0, a pair of online learner and an adversary is said to satisfy online weak learning condition with parameters δ, γ, and S if for any sample length T, any adaptive sequence of labeled examples, and for any adaptively chosen series of pairs of weight and cost matrix {(wt, Ct) [0, 1] Ceor 1 | t = 1, , T}, the learner can generate predictions ˆyt such that with probability at least 1 δ,

t=1 wt Ct[yt, ˆyt] C U γ + S = 1 γ

k ||w||1 + S, (1)

where C RT k consists of rows of wt Ct[yt] and A B denotes the Frobenius inner product Tr(AB ). w = (w1, , w T ) and the last equality holds due to the normalized condition on Ceor 1 . γ is called an edge, and S an excess loss. Remark. Notice that this condition is imposed on a pair of learner and adversary instead of solely on a learner. This is because no learner can satisfy this condition if the adversary draws samples in a completely adaptive manner. The probabilistic statement is necessary because many online algorithms predictions are not deterministic. The excess loss requirement is needed since an online learner cannot produce meaningful predictions before observing a sufﬁcient number of examples.

3 An optimal algorithm

In this section, we describe the booster s optimal strategy for designing cost matrices. We ﬁrst introduce a general theory without specifying the loss, and later investigate the asymptotic behavior of cumulative loss suffered by our algorithm under the speciﬁc 0-1 loss. We adopt the potential function framework from Mukherjee and Schapire [4] and extend it to the online setting. Potential functions help both in designing cost matrices and in proving the mistake bound of the algorithm.

3.1 A general online multiclass boost-by-majority (Online MBBM) algorithm

We will keep track of the weighted cumulative votes of the ﬁrst i weak learners for the sample xt by si t := Pi j=1 αj telj t, where αi t is the weight of WLi, li t is its prediction and ej is the jth standard basis vector. For the optimal algorithm, we assume that αi t = 1, i, t. In other words, the booster makes the ﬁnal decision by simple majority votes. Given a cumulative vote s Rk, suppose we have a loss function Lr(s) where r denotes the correct label. We call a loss function proper, if it is a decreasing function of s[r] and an increasing function of other coordinates (we alert the reader that proper loss has at least one other meaning in the literature). From now on, we will assume that our loss function is proper. A good example of proper loss is multiclass 0-1 loss:

Lr(s) := 1(max l =r s[l] s[r]). (2)

The purpose of the potential function φr i (s) is to estimate the booster s loss when there remain i learners until the ﬁnal decision and the current cumulative vote is s. More precisely, we want potential functions to satisfy the following conditions:

φr 0(s) = Lr(s), φr i+1(s) = El urγφr i (s + el). (3)

Algorithm 1 Online Multiclass Boost-by-Majority (Online MBBM)

1: for t = 1, , T do 2: Receive example xt 3: Set s0 t = 0 Rk

4: for i = 1, , N do 5: Set the normalized cost matrix Di t according to (5) and pass it to WLi

6: Get weak predictions li t = WLi(xt) and update si t = si 1 t + eli t 7: end for 8: Predict ˆyt := argmaxl s N t [l] and receive true label yt 9: for i = 1, , N do 10: Set wi[t] = Pk l=1[φyt N i(si 1 t + el) φyt N i(si 1 t + eyt)] 11: Pass training example with weight (xt, yt, wi[t]) to WLi

12: end for 13: end for

Readers should note that φr i (s) also inherits the proper property of the loss function, which can be shown by induction. The condition (3) can be loosened by replacing both equalities by inequalities , but in practice we usually use equalities.

Now we describe the booster s strategy for designing cost matrices. After observing xt, the booster sequentially sets a cost matrix Ci t for WLi, gets the weak learner s prediction li t and uses this in the computation of the next cost matrix Ci+1 t . Ultimately, booster wants to set

Ci t[r, l] = φr N i(si 1 t + el). (4)

However, this cost matrix does not satisfy the condition of Ceor 1 , and thus should be modiﬁed in order to utilize the weak learning condition. First to make the cost for the true label equal to 0, we subtract Ci t[r, r] from every element of Ci t[r]. Since the potential function is proper, our new cost matrix still has non-negative elements after the subtraction. We then normalize the row so that each row has ℓ1 norm equal to 1. In other words, we get new normalized cost matrix

Di t[r, l] = φr N i(si 1 t + el) φr N i(si 1 t + er) wi[t] , (5)

where wi[t] := Pk l=1 φr N i(si 1 t +el) φr N i(si 1 t +er) plays the role of weight. It is still possible that a row vector Ci t[r] is a zero vector so that normalization is impossible. In this case, we just leave it as a zero vector. Our weak learning condition (1) still works with cost matrices some of whose row vectors are zeros because however the learner predicts, it incurs no cost.

After deﬁning cost matrices, the rest of the algorithm is straightforward except we have to estimate ||wi|| to normalize the weight. This is necessary because the weak learning condition assumes the weights lying in [0, 1]. We cannot compute the exact value of ||wi|| until the last instance is revealed, which is ﬁne as we need this value only in proving the mistake bound. The estimate wi for ||wi|| requires to specify the loss, and we postpone the technical parts to Appendix B.2. Interested readers may directly refer Lemma 10 before proceeding. Once the learners generate predictions after observing cost matrices, the ﬁnal decision is made by simple majority votes. After the true label is revealed, the booster updates the weight and sends the labeled instance with weight to the weak learners. The pseudocode for the entire algorithm is depicted in Algorithm 1. The algorithm is named after Beygelzimer et al. [7, Online BBM], which is in fact Online MBBM with binary labels.

We present our ﬁrst main result regarding the mistake bound of general Online MBBM. The proof appears in Appendix B.1 where the main idea is adopted from Beygelzimer et al. [7, Lemma 3]. Theorem 2. (Cumulative loss bound for Online MBBM) Suppose weak learners and an adversary satisfy the online weak learning condition (1) with parameters δ, γ, and S. For any T and N satisfying δ 1

N , and any adaptive sequence of labeled examples generated by the adversary, the ﬁnal loss suffered by Online MBBM satisﬁes the following inequality with probability 1 Nδ:

t=1 Lyt(s N t ) φ1 N(0)T + S

i=1 wi . (6)

Here φ1 N(0) plays a role of asymptotic error rate and the second term determines the sample complexity. We will investigate the behavior of those terms under the 0-1 loss in the following section.

3.2 Mistake bound under 0-1 loss and its optimality

From now on, we will specify the loss to be multiclass 0-1 loss deﬁned in (2), which might be the most relevant measure in multiclass problems. To present a speciﬁc mistake bound, two terms in the RHS of (6) should be bounded. This requires an approximation of potentials, which is technical and postponed to Appendix B.2. Lemma 9 and 10 provide the bounds for those terms. We also mention another bound for the weight in the remark after Lemma 10 so that one can use whichever tighter. Combining the above lemmas with Theorem 2 gives the following corollary. The additional constraint on γ comes from Lemma 10.

Corollary 3. (0-1 loss bound of Online MBBM) Suppose weak learners and an adversary satisfy the online weak learning condition (1) with parameters δ, γ, and S, where γ < 1

2. For any T and N satisfying δ 1 N and any adaptive sequence of labeled examples generated by the adversary, Online MBBM can generate predictions ˆyt that satisfy the following inequality with probability 1 Nδ:

t=1 1(yt = ˆyt) (k 1)e γ2N

2 T + O(k5/2

Therefore in order to achieve error rate ϵ, it sufﬁces to use N = Θ( 1

ϵ ) weak learners, which

gives an excess loss bound of Θ( k5/2

Remark. Note that the above excess loss bound gives a sample complexity bound of Θ( k5/2

ϵγ S). If we use alternative weight bound to get k NS as an upper bound for the second term in (6), we end up having O(k NS). This will give an excess loss bound of Θ( k

We now provide lower bounds on the number of learners and sample complexity for arbitrary online boosting algorithms to evaluate the optimality of Online MBBM under 0-1 loss. In particular, we construct weak learners that satisfy the online weak learning condition (1) and have almost matching asymptotic error rate and excess loss compared to those of Online MBBM as in (7). Indeed we can prove that the number of learners and sample complexity of Online MBBM is optimal up to logarithmic factors, ignoring the inﬂuence of the number of classes k. Our bounds are possibly suboptimal up to polynomial factors in k, and the problem to ﬁll the gap remains open. The detailed proof and a discussion of the gap can be found in Appendix B.3. Our lower bound is a multiclass version of Beygelzimer et al. [7, Theorem 3].

Theorem 4. (Lower bounds for N and T) For any γ (0, 1

4), δ, ϵ (0, 1), and S k ln( 1

δ ) γ , there exists an adversary with a family of learners satisfying the online weak learning condition (1) with parameters δ, γ, and S, such that to achieve asymptotic error rate ϵ, an online boosting algorithm requires at least Ω( 1 k2γ2 ln 1

ϵ ) learners and a sample complexity of Ω( k

4 An adaptive algorithm

The online weak learning condition imposes minimal assumptions on the asymptotic accuracy of learners, and obviously it leads to a solid theory of online boosting. However, it has two main practical limitations. The ﬁrst is the difﬁculty of estimating the edge γ. Given a learner and an adversary, it is by no means a simple task to ﬁnd the maximum edge that satisﬁes (1). The second issue is that different learners may have different edges. Some learners may in fact be quite strong with signiﬁcant edges, while others are just slightly better than a random guess. In this case, Online MBBM has to pick the minimum edge as it assumes common γ for all weak learners. It is obviously inefﬁcient in that the booster underestimates the strong learners accuracy.

Our adaptive algorithm will discard the online weak learning condition to provide a more practical method. Empirical edges γ1, , γN (see Section 4.2 for the deﬁnition) are measured for the weak learners and are used to bound the number of mistakes made by the boosting algorithm.

4.1 Choice of loss function

Adaboost, proposed by Freund et al. [10], is arguably the most popular boosting algorithm in practice. It aims to minimize the exponential loss, and has many variants which use some other surrogate loss. The main reason of using a surrogate loss is ease of optimization; while 0-1 loss is not even continuous, most surrogate losses are convex. We adopt the use of a surrogate loss for the same reason, and throughout this section will discuss our choice of surrogate loss for the adaptive algorithm.

Exponential loss is a very strong candidate in that it provides a closed form for computing potential functions, which are used to design cost matrices (cf. Mukherjee and Schapire [4, Theorem 13]). One property of online setting, however, makes it unfavorable. Like Online MBBM, each data point will have a different weight depending on weak learners performance, and if the algorithm uses exponential loss, this weight will be an exponential function of difference in weighted cumulative votes. With this exponentially varying weights among samples, the algorithm might end up depending on very small portion of observed samples. This is undesirable because it is easier for the adversary to manipulate the sample sequence to perturb the learner.

To overcome exponentially varying weights, Beygelzimer et al. [7] use logistic loss in their adaptive algorithm. Logistic loss is more desirable in that its derivative is bounded and thus weights will be relatively smooth. For this reason, we will also use multiclass version of logistic loss:

l =r log(1 + exp(s[r] s[r])). (8)

We still need to compute potential functions from logistic loss in order to calculate cost matrices. Unfortunately, Mukherjee and Schapire [4] use a unique property of exponential loss to get a closed form for potential functions, which cannot be adopted to logistic loss. However, the optimal cost matrix induced from exponential loss has a very close connection with the gradient of the loss (cf. Mukherjee and Schapire [4, Lemma 22]). From this, we will design our cost matrices as following:

Ci t[r, l] :=

( 1 1+exp(si 1 t [r] si 1 t [l]) , if l = r

j =r 1 1+exp(si 1 t [r] si 1 t [j]) , if l = r. (9)

Readers should note that the row vector Ci t[r] is simply the gradient of Lr(si 1 t ). Also note that this matrix does not belong to Ceor 1 , but it does guarantee that the correct prediction gets the minimal cost.

The choice of logistic loss over exponential loss is somewhat subjective. The undesirable property of exponential loss does not necessarily mean that we cannot build an adaptive algorithm using this loss. In fact, we can slightly modify Algorithm 2 to develop algorithms using different surrogates (exponential loss and square hinge loss). However, their theoretical bounds are inferior to the one with logistic loss. Interested readers can refer Appendix D, but it assumes understanding of Algorithm 2.

4.2 Adaboost.OLM

Our work is a generalization of Adaboost.OL by Beygelzimer et al. [7], from which the name Adaboost.OLM comes with M standing for multiclass. We introduce a new concept of an expert. From N weak learners, we can produce N experts where expert i makes its prediction by weighted majority votes among the ﬁrst i learners. Unlike Online MBBM, we allow varying weights αi t over the learners. As we are working with logistic loss, we want to minimize P t Lyt(si t) for each i, where the loss is given in (8). We want to alert the readers to note that even though the algorithm tries to minimize the cumulative surrogate loss, its performance is still evaluated by 0-1 loss. The surrogate loss only plays a role of a bridge that makes the algorithm adaptive.

We do not impose the online weak learning condition on weak learners, but instead just measure the

performance of WLi by γi :=

t Ci t[yt,li t] P

t Ci t[yt,yt]. This empirical edge will be used to bound the number of mistakes made by Adaboost.OLM. By deﬁnition of cost matrix, we can check

Ci t[yt, yt] Ci t[yt, l] Ci t[yt, yt], l [k],

from which we can prove 1 γi 1, i. If the online weak learning condition is met with edge γ, then one can show that γi γ with high probability when the sample size is sufﬁciently large.

Algorithm 2 Adaboost.OLM

1: Initialize: i, vi 1 = 1, αi 1 = 0 2: for t = 1, , T do 3: Receive example xt 4: Set s0 t = 0 Rk

5: for i = 1, , N do 6: Compute Ci t according to (9) and pass it to WLi

7: Set li t = WLi(xt) and si t = si 1 t + αi teli t 8: Set ˆyi t = argmaxl si t[l], the prediction of expert i 9: end for 10: Randomly draw it with P(it = i) vi t 11: Predict ˆyt = ˆyit t and receive the true label yt 12: for i = 1, , N do 13: Set αi t+1 = Π(αi t ηtf i t (αi t)) using (10) and ηt = 2

14: Set wi[t] = Ci t[yt,yt]

k 1 and pass (xt, yt, wi[t]) to WLi

15: Set vi t+1 = vi t exp( 1(yt = ˆyi t)) 16: end for 17: end for

Unlike the optimal algorithm, we cannot show the last expert that utilizes all the learners has the best accuracy. However, we can show at least one expert has a good predicting power. Therefore we will use classical Hedge algorithm (Littlestone and Warmuth [11] and Freund and Schapire [12]) to randomly choose an expert at each iteration with adaptive probability weight depending on each expert s prediction history.

Finally we need to address how to set the weight αi t for each weak learner. As our algorithm tries to minimize the cumulative logistic loss, we want to set αi t to minimize P

t Lyt(si 1 t + αi teli t). This is again a classical topic in online learning, and we will use online gradient descent, proposed by Zinkevich [13]. By letting, f i t(α) := Lyt(si 1 t + αeli t), we need an online algorithm ensuring P

t f i t(αi t) minα F P

t f i t(α) + Ri(T) where F is a feasible set to be speciﬁed later, and Ri(T) is a regret that is sublinear in T. To apply Zinkevich [13, Theorem 1], we need f i t to be convex and F to be compact. The ﬁrst assumption is met by our choice of logistic loss, and for the second assumption, we will set F = [ 2, 2]. There is no harm to restrict the choice of αi t by F because we can always scale the weights without affecting the result of weighted majority votes.

By taking derivatives, we get

f i t (α) =

( 1 1+exp(si 1 t [yt] si 1 t [li t] α) , if li t = yt P

j =yt 1 1+exp(si 1 t [j]+α si 1 t [yt]) , if li t = yt. (10)

This provides |f i t (α)| k 1. Now let Π( ) represent a projection onto F: Π( ) := max{ 2, min{2, }}. By setting αi t+1 = Π(αi t ηtf i t (αi t)) where ηt = 2

T. Readers should note that any learning rate of the form ηt = c

t would work, but our choice is optimized to ensure the minimal regret.

The pseudocode for Adaboost.OLM is presented in Algorithm 2. In fact, if we put k = 2, Adaboost.OLM has the same structure with Adaboost.OL. As in Online MBBM, the booster also needs to pass the weight along with labeled instance. According to (9), it can be inferred that the weight is proportional to Ci t[yt, yt].

4.3 Mistake bound and comparison to the optimal algorithm

Now we present our second main result that provides a mistake bound of Adaboost.OLM. The main structure of the proof is adopted from Beygelzimer et al. [7, Theorem 4] but in a generalized cost matrix framework. The proof appears in Appendix C.

Theorem 5. (Mistake bound of Adaboost.OLM) For any T and N, with probability 1 δ, the number of mistakes made by Adaboost.OLM satisﬁes the following inequality:

t=1 1(yt = ˆyt) 8(k 1) PN i=1 γ2 i T + O( k N 2 PN i=1 γ2 i ),

where O notation suppresses dependence on log 1

Remark. Note that this theorem naturally implies Beygelzimer et al. [7, Theorem 4]. The difference in coefﬁcients is due to different scaling of γi. In fact, their γi ranges from [ 1

Now that we have established a mistake bound, it is worthwhile to compare the bound with the optimal boosting algorithm. Suppose the weak learners satisfy the weak learning condition (1) with edge γ. For simplicity, we will ignore the excess loss S. As we have γi =

t Ci t[yt,li t] P

t Ci t[yt,yt] γ

with high probability, the mistake bound becomes 8(k 1)

γ2N T + O( k N

γ2 ). In order to achieve error

rate ϵ, Adaboost.OLM requires N 8(k 1)

ϵγ2 learners and T = Ω( k2

ϵ2γ4 ) sample size. Note that

Online MBBM requires N = Ω( 1

ϵ ) and T = min{ Ω( k5/2

ϵγ2 )}. Adaboost.OLM is obviously suboptimal, but due to its adaptive feature, its performance on real data is quite comparable to that by Online MBBM.

5 Experiments

We compare the new algorithms to existing ones for online boosting on several UCI data sets, each with k classes1. Table 1 contains some highlights, with additional results and experimental details in the Appendix E. Here we show both the average accuracy on the ﬁnal 20% of each data set, as well as the average run time for each algorithm. Best decision tree gives the performance of the best of 100 online decision trees ﬁt using the VFDT algorithm in Domingos and Hulten [14], which were used as the weak learners in all other algorithms, and Online Boosting is an algorithm taken from Oza [5]. Both provide a baseline for comparison with the new Adaboost.OLM and Online MBBM algorithms. Best MBBM takes the best result from running the Online MBBM with ﬁve different values of the edge parameter γ.

Despite being theoretically weaker, Adaboost.OLM often demonstrates similar accuracy and sometimes outperforms Best MBBM, which exempliﬁes the power of adaptivity in practice. This power comes from the ability to use diverse learners efﬁciently, instead of being limited by the strength of the weakest learner. Online MBBM suffers from high computational cost, as well as the difﬁculty of choosing the correct value of γ, which in general is unknown, but when the correct value of γ is used it peforms very well. Finally in all cases Adaboost.OLM and Online MBBM algorithms outperform both the best tree and the preexisting Online Boosting algorithm, while also enjoying theoretical accuracy bounds.

Table 1: Comparison of algorithm accuracy on ﬁnal 20% of data set and run time in seconds. Best accuracy on a data set reported in bold.

Data sets k Best decision tree Online Boosting Adaboost.OLM Best MBBM

Balance 3 0.768 8 0.772 19 0.754 20 0.821 42 Mice 8 0.608 105 0.399 263 0.561 416 0.695 2173 Cars 4 0.924 39 0.914 27 0.930 59 0.914 56 Mushroom 2 0.999 241 1.000 169 1.000 355 1.000 325 Nursery 4 0.953 526 0.941 302 0.966 735 0.969 1510 ISOLET 26 0.515 470 0.149 1497 0.521 2422 0.635 64707 Movement 5 0.915 1960 0.870 3437 0.962 5072 0.988 18676

1Codes are available at https://github.com/yhjung88/Online Boosting With VFDT

Acknowledgments

We acknowledge the support of NSF under grants CAREER IIS-1452099 and CIF-1422157.

[1] Marcin Korytkowski, Leszek Rutkowski, and Rafał Scherer. Fast image classiﬁcation by boosting fuzzy classiﬁers. Information Sciences, 327:175 182, 2016.

[2] Xiao-Lei Zhang and De Liang Wang. Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection. In INTERSPEECH, pages 1534 1538, 2014.

[3] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.

[4] Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting. Journal of Machine Learning Research, 14(Feb):437 497, 2013.

[5] Nikunj C Oza. Online bagging and boosting. In 2005 IEEE international conference on systems, man and cybernetics, volume 3, pages 2340 2345. IEEE, 2005.

[6] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. An online boosting algorithm with theoretical justiﬁcations. ICML, 2012.

[7] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting. ICML, 2015.

[8] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. Boosting with online binary learners for the multiclass bandit problem. In Proceedings of The 31st ICML, pages 342 350, 2014.

[9] Hanzhang Hu, Wen Sun, Arun Venkatraman, Martial Hebert, and Andrew Bagnell. Gradient boosting on stochastic data streams. In Artiﬁcial Intelligence and Statistics, pages 595 603, 2017.

[10] Yoav Freund, Robert Schapire, and N Abe. A short introduction to boosting. Journal-Japanese Society For Artiﬁcial Intelligence, 14(771-780):1612, 1999.

[11] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. In Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 256 261. IEEE, 1989.

[12] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23 37. Springer, 1995.

[13] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of 20th ICML, 2003.

[14] Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71 80. ACM, 2000.

[15] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In COLT, pages 207 232, 2011.

[16] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2(4):285 318, 1988.

[17] Volodimir G Vovk. Aggregating strategies. In Proc. Third Workshop on Computational Learning Theory, pages 371 383. Morgan Kaufmann, 1990.

[18] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[19] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016.

[20] Robert E Schapire. Drifting games. Machine Learning, 43(3):265 291, 2001.

[21] Eric V Slud. Distribution inequalities for the binomial law. The Annals of Probability, pages 404 412, 1977.

[22] C.L. Blake and C.J. Merz. UCI machine learning repository, 1998. URL http://archive.ics.uci. edu/ml.

[23] Cios KJ Higuera C, Gardiner KJ. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PLo S ONE, 2015.

[24] Wallace Ugulino, Débora Cardador, Katia Vega, Eduardo Velloso, Ruy Milidiú, and Hugo Fuks. Wearable computing: Accelerometers data classiﬁcation of body postures and movements. In Advances in Artiﬁcial Intelligence-SBIA 2012, pages 52 61. Springer, 2012.