# selfpaced_boost_learning_for_classification__f3eb5182.pdf

Self-Paced Boost Learning for Classiﬁcation

Te Pi1 , Xi Li1 , Zhongfei Zhang1, Deyu Meng2, Fei Wu1, Jun Xiao1 , Yueting Zhuang1

1Zhejiang University, Hangzhou, China; 2Xi an Jiaotong University, Xi an, China

Effectiveness and robustness are two essential aspects of supervised learning studies. For effective learning, ensemble methods are developed to build a strong effective model from ensemble of weak models. For robust learning, self-paced learning (SPL) is proposed to learn in a self-controlled pace from easy samples to complex ones. Motivated by simultaneously enhancing the learning effectiveness and robustness, we propose a uniﬁed framework, Self-Paced Boost Learning (SPBL). With an adaptive from-easy-to-hard pace in boosting process, SPBL asymptotically guides the model to focus more on the insufﬁciently learned samples with higher reliability. Via a max-margin boosting optimization with self-paced sample selection, SPBL is capable of capturing the intrinsic inter-class discriminative patterns while ensuring the reliability of the samples involved in learning. We formulate SPBL as a fully-corrective optimization for classiﬁcation. The experiments on several real-world datasets show the superiority of SPBL in terms of both effectiveness and robustness.

1 Introduction

Effectiveness and robustness are two essential principles of generic supervised learning studies. The effective learning focuses on the discriminativeness of the model to capture the intrinsic data patterns for an accurate prediction. The robust learning typically lies in a distinction of the reliable data from the noisy, confusing data, such that the learning is guided by the reliable samples and less inﬂuenced by the confusing ones. The efforts of most approaches for learning from the data generally come down to these two aspects.

For effective learning, the key issue lies in the complex distributions of data with local nonlinear structures. To effectively explore these patterns, the boosting scheme [Zhou, 2012] is developed. Generally, the boosting methods build

{peterpite, xilizju, zhongfei}@zju.edu.cn;

dymeng@mail.xjtu.edu.cn; {wufei, junx, yzhuang}@cs.zju.edu.cn. Corresponding authors

Figure 1: Decision boundaries of boosting and SPBL classiﬁers and the ground truth for synthetic data with confusing/noisy points. The decision boundary of SPBL is more robust and effective (closer to the ground truth) than that of boosting, since SPBL focuses on the misclassiﬁed samples with high reliability based on a self-paced boosting optimization.

a strong ensemble model as a combination of multiple weak models, where each weak model focuses on the samples mispredicted by the previous model ensembles. Through this, boosting performs an asymptotic piecewise approximation to the data distributions to ﬁt each sample sufﬁciently. On the other hand, since only the mispredicted samples are considered in each step, the boosting is sensitive to the noisy and confusing data which greatly affect the optimization, especially at the later learning stage. Figure 1 shows a toy example of the decision boundary of boosting classiﬁer for synthetic data with confusing data points. The boosting scheme is very discriminative while lacking a learning robustness.

For robust learning, the goal is to relieve the inﬂuence of the noisy and confusing data. The confusing data generally correspond to the highly nonlinear local patterns hardly learnable for the model space, and the noisy ones are the outliers that should not be learned. Typically, the learning robustness relies on a sample selection to distinguish the reliable samples from the confusing ones. The recently studied self-paced learning (SPL) [Zhao et al., 2015] is such a representative effort. SPL is a learning paradigm that dynamically incor-

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

porates the samples into learning from easy ones to complex ones. With a self-controlled sample selection embedded in learning, the model is calibrated in a pace adaptively controlled by what it has already learned. Thus, SPL smoothly guides the learning to emphasize the patterns of the reliable discriminative data rather than those of the noisy and confusing ones, and obtains the learning robustness.

Based on the above analysis, we notice that boosting and SPL are consistent in basic principles and complementary in methodology. For consistency, both schemes are based on an asymptotic learning process from a weak/simple state to a strong/hard state. On the other hand, boosting and SPL are complementary in three aspects. First, the two schemes are respectively concerned on each of the two essential tasks of machine learning, the effectiveness and the robustness. Second, while boosting imposes a negative suppression on the insufﬁciently learned samples, SPL positively encourages the easily learned ones in a controlled pace. Third, boosting focuses more on the inter-class margins by striving to ﬁt each sample, while SPL is more concerned with the intra-class variations by dynamically selecting easy samples with different patterns. Thus, boosting tends to reﬂect the local patterns and is more sensitive to the noisy data, while SPL tends to explore the data smoothly with more robustness. As a result, the two learning schemes are prone to beneﬁt from each other.

To simultaneously enhance the learning effectiveness and robustness, in this paper, we propose a uniﬁed framework Self-Paced Boost Learning (SPBL). With an adaptive pace from easy to hard in boosting optimization, SPBL asymptotically guides the learning to focus on the insufﬁciently learned samples with high reliability. Through this, SPBL learns a model in both directions of positive encouragement (on reliable samples) and negative suppression (on misclassiﬁed samples), and is capable of capturing the intrinsic interclass discriminative patterns while ensuring the reliability of the samples involved in learning. Figure 1 further shows the decision boundary of SPBL on the toy dataset, which demonstrates its robustness and effectiveness.

We formulate SPBL as a fully-corrective optimization for classiﬁcation problem. Note that SPBL is a general framework for supervised learning and could be formulated for other supervised applications. The contributions of this paper are summarized as follows:

1. We propose a uniﬁed learning framework SPBL that learns in a joint manner from weak models to strong model and from easy samples to complex ones. To the best of our knowledge, this is the ﬁrst work that reveals and utilizes the association of boosting and SPL to simultaneously enhance the effectiveness and the robustness for supervised learning.

2. We formulate SPBL as a fully-corrective max-margin boosting optimization with self-paced sample selection for classiﬁcation task.

2 Related Work

We review the literature from the aspects of boost learning and self-paced learning.

Boosting is a family of supervised ensemble learning approaches which convert weak learners to strong ones [Zhou,

2012]. The boosting methods construct a strong (highly accurate) model by iteratively learning and combining many weak, inaccurate models, where each weak model focuses on the samples mispredicted by the previous models. The main variation among different boosting methods is their ways of weighting training samples and weak learners. Examples of boosting methods include Adaboost [Freund and Schapire, 1997], Soft Boost [R atsch et al., 2007], Total Boost [Warmuth et al., 2006], LPBoost [Demiriz et al., 2002], Logit Boost [Friedman et al., 2000], and Mada Boost [Domingo and Watanabe, 2000]. Boosting methods are applied in extensive applications, such as multi-class classiﬁcation [Shen et al., 2012b; Zhu et al., 2009], regression [Duffy and Helmbold, 2002], metric learning [Shen et al., 2012a], and statistical modeling[Tutz and Binder, 2006; Mayr et al., 2014]. The effectiveness of boosting lies in its piecewise approximation of a nonlinear decision function to sufﬁciently ﬁt the data patterns [Schapire and Freund, 2012]. However, [Long and Servedio, 2010] indicates that many boosting methods cannot withstand random classiﬁcation noise.

First proposed by [Kumar et al., 2010], the self-paced learning is inspired by the learning process of humans that gradually incorporates the training samples into learning from easy ones to complex ones. Different from the curriculum learning [Bengio et al., 2009] that learns the data in a predeﬁned order based on prior knowledge, SPL learns the training data in an order from easy to hard dynamically determined by the feedback of the learner itself, which is initially developed for avoiding the bad local minima. SPL is applied in different applications, such as image segmentation [Kumar et al., 2011], multimedia reranking [Jiang et al., 2014a], matrix factorization [Zhao et al., 2015], and multiple instance learning [Zhang et al., 2015]. Variants of SPL are also developed, such as self-paced curriculum learning [Jiang et al., 2015], and SPL with diversity [Jiang et al., 2014b]. Furthermore, [Meng and Zhao, 2015] provides a theoretical analysis of the robustness of SPL, which reveals the consistency of SPL with the non-convex regularization. Such regularization is upperbounded to restrict the contributions of noisy examples to the objective, and thus enhances the learning robustness.

3 Self-Paced Boost Learning

3.1 Problem Formulation

Let {(xi, yi)}n

i=1 be a set of n multi-class training samples, where xi 2 Rd is the feature of sample i, yi 2 {1, 2, . . . , C} is the class label of xi, and C is the number of classes. Based on the standard supervised learning scheme, a classiﬁcation model lies in learning a score function Fr ( ) : Rd ! R for each class with which the prediction is made:

y (x) = argmax

r2{1,...,C}

Fr (x; ), (1)

where Fr (x; ) serves as the conﬁdence score of classifying sample x to class r, parameterized by . Following the maxmargin formulation, the general objective function for multi-

class classiﬁcation is given by:

L ( ir) + R ( ) (2)

s.t. 8i, r, ir = Fyi (x; ) Fr (x; ) ,

where ir is the score margin of xi between its ground truth class yi and class r; L : R ! R+ is a loss function; R ( ) is a regularization for ; > 0 is a trade-off hyperparameter. Generally, the loss function L ( ) should be convex as a convex surrogate of the 0-1 loss, and be monotonically decreasing for a large margin. The regularization R ( ) is introduced to impose prior constraints on to relieve overﬁtting.

The two key issues of learning the classiﬁer lie in an effective formulation of the score function Fr ( ), and a robust formulation of the loss function L ( ). For an effective modeling, we adopt the boosting strategy that learns the classiﬁer Fr ( ) from weak models to a strong model. The effectiveness of boosting for classiﬁcation lies in its asymptotic piecewise approximation for a nonlinear decision function to sufﬁciently ﬁt the underlying data distributions. Speciﬁcally, a strong classiﬁer Fr ( ) is formulated as an ensemble of weak classiﬁers {hj ( ) 2 H}k

j=1 in the space of weak models H:

Fr (x; W) =

wrjhj (x), r = 1, . . . , C, (3)

where each hj ( ) : Rd ! {0, 1} is a binary weak classiﬁer; wrj > 0 is the weight parameter to be learned. Here is speciﬁed as the weight matrix W, deﬁned as W = [w1, , w C] 2 Rk C with each wr = [wr1, , wrk]T .

On the other hand, the learning robustness relies on the formulation of the loss function L ( ) to relieve the inﬂuence of noisy and confusing data. Instead of directly learning from the whole data batch, we aim to guide the boosting model to learn asymptotically from the easy/faithful samples to the complex/confusing ones in a smooth pace. Therefore, inspired by the self-paced learning (SPL) scheme [Kumar et al., 2010], we reformulate the boosting model with a selfpaced loss formulation, and propose a uniﬁed framework, Self-Paced Boost Learning (SPBL).

The general objective of SPBL is formulate as:

g (vi; λ) + R (W) (4)

s.t. 8i, r, ir = Hi:wyi Hi:wr; W > 0; v 2 [0, 1]n,

where H 2 Rn k is the weak classiﬁers responses for the training data with [Hij] = [hj (xi)], and Hi: is the i-th row of H; vi 2 [0, 1] is the SPL weight of sample xi that indicates its learning easiness ; g ( ; λ) : [0, 1] ! R is the SPL function that speciﬁes how the samples are selected (the reweighting scheme of v) controlled by the SPL parameter λ > 0.

In Eq. (4), a weight vi is assigned to each sample as a measure of its easiness . These SPL weights are tuned based on the current losses of samples and the SPL function g (vi; λ) to dynamically select the easily learned samples that are more

reliable and discriminative. With a joint optimization of sample selection (for v) and boost learning (for W), the SPBL model gradually incorporates the training samples into learning from easy ones to complex ones, so as to control the pace of boost learning by what the model has already learned.

For a speciﬁc formulation of Eq. (4), we specify L ( ) as a smooth loss function, the logistic loss, for the convenience of derivation, and specify R (W) as the l2,1-norm to exploit the group structure of the weak classiﬁer ensembles:

g (vi; λ) + k Wk2,1 (5)

s.t. 8i, r, ir = Hi:wyi Hi:wr; W > 0; v 2 [0, 1]n,

where k Wk2,1 = Pk

j=1 k Wj:k2. Note that the above objective is l2,1-norm regularized to impose a group sparsity constraint on the rows of W. The optimization would encourage the columns of W (each class) to select a relatively concentrated and shared subset of base classiﬁers, instead of learning them independently. We present the optimization of Eq. (5) and the speciﬁcation of g (vi; λ) in the next subsection.

3.2 Optimization We use an alternating optimization to solve Eq. (5), which optimizes each of the two variables with the other one ﬁxed in an alternating manner. For the optimization of v, we have

vili + g (vi; λ), s.t. vi 2 [0, 1], (6)

where li = P

r ln (1 + e ir) denotes the loss of sample xi. To solve vi in Eq. (6), the self-paced function g (vi; λ) needs to be speciﬁed. [Jiang et al., 2014a] has summarized the general properties of a self-paced function in three aspects. First, g (vi; λ) is convex w.r.t. vi 2 [0, 1] to guarantee the uniqueness of v

i . Second, v

i (li; λ) is monotonically decreasing w.r.t. li, which guides the model to select easy samples with smaller losses in favor of complex samples with larger losses. Third, v

i (li; λ) is monotonically increasing w.r.t. λ, which means that a larger λ has a higher tolerance to the losses and can incorporate more complex samples. Several examples of the self-paced function have been listed in [Jiang et al., 2014a], such as hard weighting, linear weighting, and mixture weighting. We specify the self-paced function as the one for mixture weighting, due to its overall better performance in the experiments:

g (vi; λ, ) = ln (vi + /λ), λ, > 0, (7)

where an extra SPL parameter is introduced in addition to λ. The corresponding optimal v

i is given by:

( 1, li 6 λ/( + λ) 0, li > λ /li /λ, otherwise

which is a mixture of a hard 0-1 weighting and a soft realvalued weighting.

For the optimization of W, we have

+ k Wk2,1, (9)

s.t. 8i, r, ir = Hi:wyi Hi:wr; W > 0.

To solve W in Eq. (9), we adopt the column generation method [Demiriz et al., 2002], due to the potentially inﬁnite number of candidate weak models in the H space. The column generation is applied in the dual space of W to maintain a small set of weak models as the active dual constraints. This active set is augmented during optimization until it is sufﬁcient to reach a solution within a tolerance threshold. We check the dual problem of Eq. (9):

{Uir ln Uir + (vi Uir) ln (vi Uir)} (10)

l Uil) Uir]Hi: 6 QT

8j, k Qj:k2 6 1,

where δryi = 1 (r = yi) is an indicator function. U 2 Rn C is the Lagrangian multiplier of the equality constraints of Eq. (9), with a relation to the primal solution:

Uir = vi 1 + e ir , i = 1, , n, r = 1, , C. (11)

The derivation of Eqs. (10) and (11) is similar to that of [Shen et al., 2012b].

Based on the column generation, the set of active weak classiﬁers is augmented by a weak model ˆh ( ) that most violates the current dual constraints in Eq. (10):

{ˆh ( ) , ˆr} = argmax

l Uil) Uir]h (xi). (12)

Then the optimization continues with the new set of active weak models, until the violation score (objective value of Eq. (12)) reaches a tolerance threshold.

Eq. (12) indicates that the matrix U serves as the sample importance for learning a new weak classiﬁer. Moreover, from Eq. (11) we see that U gives high weights to not only the misclassiﬁed samples with small margins ir, but also the easy samples with high SPL weights vi. That means that U is actually a composite measure of learning insufﬁciency and learning easiness. Since the vi weights are set in the previous iteration, based on Eq. (12), the future weak learners will put emphasis on samples that are both insufﬁciently learned currently and easily learned previously. The interactions of the update of the model parameters are summarized in Figure 2. As a balance and trade-off between boosting and SPL, the proposed SPBL performs learning in both directions of positive encouragement (on reliability) and negative suppression (on learning insufﬁciency), and takes both effectiveness and robustness into concern for learning a classiﬁcation model.

Further, it is easily seen that the multi-class boosting classiﬁcation model of [Shen et al., 2012b] is a special case of SPBL with all SPL weights v ﬁxed as 1n. By replacing vi in Eq. (11) with 1, the matrix U only emphasizes the misclassiﬁed samples with small margins, with the new weak classiﬁer learned accordingly. Thus, the boosting method tends to be sensitive to the noisy and hardly learnable data by striving to correctly classify these samples. Therefore, the proposed SPBL is a robust generalization of boosting models.

Figure 2: The interactions of the update of the model parameters. The blue blocks represent the boosting stage while the green block represents the SPL stage. The update of W and v are mutually interacted in successive iterations, while the current W and the previous v jointly inﬂuence the learning of the new weak classiﬁer through U.

We summarize the optimization procedure in Algorithm 1. The algorithm alternates between learning new ˆh ( ) (Line 5), updating W (Line 6), updating U (Line 7) and reweighting v in an SPL manner (Line 8). Note that the SPL parameters (λ, ) are iteratively increased (annealed) if they are small (Line 10 to 12), so as to introduce more (difﬁcult) samples in the future learning. Furthermore, we adopt an early stopping criterion on a held-out validation set when the iteration number exceeds TES times to maintain a better generalization performance and a reasonable running time.

4 Experiments

We evaluate the performance of SPBL classiﬁcation on three real-world datasets. The comparative methods include softmax regression (SR), multi-class SVM (Multi SVM), Multi Boost [Shen et al., 2012b], and Multi-class Adaboost (Ada Boost) [Zhu et al., 2009]. SR and Multi SVM are also embedded with a self-paced learning scheme for comparison, denoted as SR-S and Multi SVM-S.

Speciﬁcally, the two baseline methods, SR and Multi SVM, are formulated based on a linear classiﬁer, where SR optimizes a log-likelihood and Multi SVM optimizes a hinge-loss. The Multi Boost is a fully-corrective formulation of multiclass boosting classiﬁcation, which is a special case of SPBL with v ﬁxed as 1n. The Multi-class Ada Boost is a multi-class generalization of Ada Boost as a stagewise additive boosting model. It is worth comparing SPBL with the above methods to verify its effectiveness by learning a classiﬁer in a joint boosting and self-paced manner.

4.1 Dataset Description Three real-world image datasets are used. We choose the image data for experiments because the underlying patterns of image features tend to have rich nonlinear correlations. The three datasets are Caltech2561, Animal With Attributes (AWA)2, Corel10k3. All of them are publicly available and fully labeled with each sample belonging to only one class. The statistics of the datasets are summarized in Table 2.

1http://www.vision.caltech.edu/Image Datasets/Caltech256/ 2http://attributes.kyb.tuebingen.mpg.de/ 3http://www.ci.gxnu.edu.cn/cbir/dataset.aspx

Algorithm 1: SPBL for Classiﬁcation

Input : Training set {(xi, yi)}n

i=1; > 0; initial SPL parameters λ0, 0 > 0; initial SPL weights v0; λmax; TES; µ > 1; > 0. Output : A set of k weak classiﬁers {hj ( )}k

1 Initialize: v(0) v0; (λ, ) (λ0, 0); U v(0)1T

4 t t + 1; Boosting :

5 Learn a new weak classiﬁer: solve Eq. (12) to obtain

{ht( ), ˆr} based on U;

6 Update W: solve Eq. (9) for W (t) based on v(t 1);

7 Update U: compute U by Eq. (11) based on v(t 1); SPL :

8 Update v: compute v(t) by Eq. (8) based on W (t); Validation:

9 Test {hj ( )}t

j=1 and W (t) on the validation set, to obtain the error rate err(t); Annealing:

10 if λ < λmax then

11 λ µλ; µ ;

i [δˆryi (P

l Uil) Uir] ht (xi) < + or t > TES and err(t) > min 16s6t 1 err(s);

14 k argmins err(s);

Return : {hj ( )}k

j=1, W = W (k).

We use the spatial pyramid features for Caltech256 and Corel10k extracted based on [Lazebnik et al., 2006], and use the available Decaf feature for AWA. We reduce the dimensions of all the features to 512 by PCA.

4.2 Experimental Settings For a convenience of optimization, we ﬁrst extend the output of a weak classiﬁer h ( ) to real value [0, 1]. We assume a logistic linear form for h ( ):

h (x; h, bh) =

## 1 , (13)

where h 2 Rd, bh 2 R are the parameters of h ( ).

We adopt the strategy in [Jiang et al., 2014b] for the annealing of the SPL parameters (λ, ) (Line 10 to 12 in Algorithm 1). Speciﬁcally, at each iteration, we sort the samples in the ascending order of their losses, and set (λ, ) based on the number of samples to be selected by now. Instead of annealing the absolute values of (λ, ), we anneal the proportion of the number of selected samples. It is shown in [Jiang et al., 2014b] that such annealing scheme is more stable.

We implement a grid search for the tuning of the hyperparameter . Further, in order to test the robustness of our model, we manually add label noise into the training set by randomly selecting and relabeling s% of the training samples with the other labels different from the true ones. We conduct experiments with s 2 {0, 5, 10, 15} for the three datasets.

4.3 Experimental Results

Table 1 shows the error rate performance of SPBL and the comparative methods on the three datasets, with different proportions of noisy samples. The best results are shown in bold face. To give a concise demonstration of the performances, we show in Figure 3 the error rates for three datasets w.r.t. the noise ratio. We see that SPBL has a better overall performance than the comparative methods.

Figure 3 further shows that the performances of boosting methods (Multi Boost, Ada Boost) are sensitive to the noisy data, and that the comparative methods embedded with SPL (SR-S, Multi SVM-S) are more robust than their original counterparts. It is expected, since the suppression effect to noise of a comparative method stems from the self-paced learning scheme. By effectively utilizing the complementarity of boosting and SPL, the proposed SPBL demonstrates a stable performance improvement over the SPL-embedded methods, and an increasing performance improvement over the other comparative methods.

Further, we show in Figure 4 the change of the error rates on the training set and the test set w.r.t. the learning iterations of SPBL and Multi Boost, for s = 0. We see that the test and training error rate curves of SPBL are generally in between the corresponding curves of Multi Boost. Therefore, Figure 4 shows that SPBL relieves the overﬁtting problem of boosting methods, since it has a smaller gap between the training errors and the test errors. This is due to the smooth learning pace of SPBL based on a self-paced boosting optimization from easy samples to hard ones, instead of learning from the whole data batch as Multi Boost does. Through this, SPBL guides the model to focus on the samples not only insufﬁciently learned, but also with high conﬁdence of reliability, and thus relieves the overﬁtting and obtains a better generalization performance.

5 Conclusions

In this work, we propose a uniﬁed learning framework, Self Paced Boost Learning (SPBL), that learns in a joint manner from weak models to a strong model and from easy samples to complex ones, for both effective learning and robust learning. With an adaptive pace from easy to hard in boosting optimization, SPBL asymptotically guides the model to focus on the samples not only insufﬁciently learned but also with high reliability. Through this, SPBL learns a model in both directions of positive encouragement (on reliable samples) and negative suppression (on misclassiﬁed samples), and is capable of capturing the intrinsic inter-class discriminative patterns while ensuring the reliability of the samples involved in learning. To the best of our knowledge, this is the ﬁrst work that reveals and utilizes the association of boosting and SPL to simultaneously enhance the effectiveness and the robustness for supervised learning. We formulate SPBL as a fully corrective optimization for classiﬁcation task. The experiments on real-world datasets show the superiority of SPBL in terms of both effectiveness and robustness.

Table 1: The classiﬁcation error rate performance of each approach on the three datasets

Caltech101 AWA Corel10k s = 0 s = 5 s = 10 s = 15 s = 0 s = 5 s = 10 s = 15 s = 0 s = 5 s = 10 s = 15 SPBL 0.3321 0.3634 0.3890 0.4058 0.3462 03545 0.3846 0.3841 0.2182 0.2573 0.2716 0.2899 Multi Boost 0.3682 0.3903 0.4320 0.4361 0.3585 0.3898 0.4139 0.4256 0.2332 0.2762 0.3091 0.3262 Ada Boost 0.3719 0.4011 0.4216 0.4298 0.3573 0.3906 0.4200 0.4379 0.2316 0.2790 0.3002 0.3388 SR 0.4093 0.4188 0.4265 0.4293 0.3805 0.3852 0.3970 0.4085 0.2900 0.3012 0.3026 0.3284 SR-S 0.3964 0.3997 0.4131 0.4122 0.3790 0.3711 0.3835 0.3921 0.2762 0.2810 0.2922 0.3165 Multi SVM 0.4332 0.4440 0.4455 0.4787 0.3986 0.4183 0.4227 0.4354 0.2868 0.3112 0.3126 0.3580 Multi SVM-S 0.4041 0.4164 0.4109 0.4335 0.3830 0.3932 0.3995 0.4090 0.2772 0.3001 0.3049 0.3356

(a) Caltech256 (b) AWA (c) Corel10k

Figure 3: The error rate results w.r.t. the noise ratio s% for the three datasets. The proposed SPBL has a better overall performance than the comparative methods.

(a) Caltech256 (b) AWA (c) Corel10k

Figure 4: The error rates on the training and the test set of SPBL and Multi Boost w.r.t. the iterations for s = 0. The learning pace of the proposed SPBL is more smooth with a smaller gap between the training and the test performance. SPBL relieves the overﬁtting of boosting methods.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant U1509206, Grant 61472353, and Grant 61572431, the National Basic Research Program of China under Grant 2012CB316400 and Grant 2015CB352300, the Fundamental Research Funds for the Central Universities, Zhejiang Provincial Engineering Cen-

ter on Media Data Cloud Processing and Analysis, and the equipment donation by Nvidia.

[Bengio et al., 2009] Yoshua Bengio, J erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.

Table 2: Statistics of the datasets

Samples & Partition Dataset Feature Classes (training/validation/test) Caltech256 SP 256 29780 (50%/20%/30%) AWA Decaf 50 30475 (50%/20%/30%) Corel10k SP 100 10000 (50%/20%/30%)

In International Conference on Machine Learning, pages 41 48. ACM, 2009. [Demiriz et al., 2002] Ayhan Demiriz, Kristin P Bennett,

and John Shawe-Taylor. Linear programming boosting via column generation. Machine Learning, 46(1-3):225 254, 2002. [Domingo and Watanabe, 2000] Carlos Domingo and Os-

amu Watanabe. Madaboost: A modiﬁcation of adaboost. In Annual Conference on Computational Learning Theory, pages 180 189, 2000. [Duffy and Helmbold, 2002] Nigel Duffy and David Helm-

bold. Boosting methods for regression. Machine Learning, 47(2-3):153 200, 2002. [Freund and Schapire, 1997] Yoav Freund and Robert E

Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. [Friedman et al., 2000] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337 407, 2000. [Jiang et al., 2014a] Lu Jiang, Deyu Meng, Teruko Mita-

mura, and Alexander G Hauptmann. Easy samples ﬁrst: Self-paced reranking for zero-example multimedia search. In Proceedings of the ACM International Conference on Multimedia, pages 547 556. ACM, 2014. [Jiang et al., 2014b] Lu Jiang, Deyu Meng, Shoou-I Yu,

Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. Self-paced learning with diversity. In Advances in Neural Information Processing Systems, pages 2078 2086, 2014. [Jiang et al., 2015] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced curriculum learning. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015. [Kumar et al., 2010] M Pawan Kumar, Benjamin Packer, and

Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189 1197, 2010. [Kumar et al., 2011] M Pawan Kumar, Haithem Turki, Dan

Preston, and Daphne Koller. Learning speciﬁc-class segmentation from diverse data. In International Conference on Computer Vision, pages 1800 1807. IEEE, 2011. [Lazebnik et al., 2006] Svetlana Lazebnik, Cordelia Schmid,

and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, pages 2169 2178, 2006.

[Long and Servedio, 2010] Philip M Long and Rocco A

Servedio. Random classiﬁcation noise defeats all convex potential boosters. Machine Learning, 78(3):287 304, 2010. [Mayr et al., 2014] Andreas Mayr, Harald Binder, Olaf Gefeller, and Matthias Schmid. The evolution of boosting algorithms-from machine learning to statistical modelling. ar Xiv preprint ar Xiv:1403.1452, 2014. [Meng and Zhao, 2015] Deyu Meng and Qian Zhao. What

objective does self-paced learning indeed optimize? ar Xiv preprint ar Xiv:1511.06049, 2015. [R atsch et al., 2007] Gunnar R atsch, Manfred K Warmuth,

and Karen A Glocer. Boosting algorithms for maximizing the soft margin. In Advances in Neural Information Processing Systems, pages 1585 1592, 2007. [Schapire and Freund, 2012] Robert E Schapire and Yoav

Freund. Boosting: Foundations and algorithms. MIT press, 2012. [Shen et al., 2012a] Chunhua Shen, Junae Kim, Lei Wang,

and Anton Van Den Hengel. Positive semideﬁnite metric learning using boosting-like algorithms. The Journal of Machine Learning Research, 13(1):1007 1036, 2012. [Shen et al., 2012b] Chunhua Shen, Sakrapee Paisitkriangkrai, and Anton van den Hengel. A direct approach to multi-class boosting and extensions. ar Xiv preprint ar Xiv:1210.4601, 2012. [Tutz and Binder, 2006] Gerhard Tutz and Harald Binder.

Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics, 62(4):961 971, 2006. [Warmuth et al., 2006] Manfred K Warmuth, Jun Liao, and

Gunnar R atsch. Totally corrective boosting algorithms that maximize the margin. In International Conference on Machine Learning, pages 1001 1008. ACM, 2006. [Zhang et al., 2015] Dingwen Zhang, Deyu Meng, Chao Li,

Lu Jiang, Qian Zhao, and Junwei Han. A self-paced multiple-instance learning framework for co-saliency detection. In International Conference on Computer Vision, pages 594 602, 2015. [Zhao et al., 2015] Qian Zhao, Deyu Meng, Lu Jiang, Qi Xie, Zongben Xu, and Alexander G Hauptmann. Selfpaced learning for matrix factorization. In AAAI Conference on Artiﬁcial Intelligence, 2015. [Zhou, 2012] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC Press, 2012. [Zhu et al., 2009] Ji Zhu, Hui Zou, Saharon Rosset, and

Trevor Hastie. Multi-class adaboost. Statistics and its Interface, 2(3):349 360, 2009.