# selfpaced_learning_with_diversity__97e8f556.pdf

Self-Paced Learning with Diversity

Lu Jiang1, Deyu Meng1,2, Shoou-I Yu1, Zhenzhong Lan1, Shiguang Shan1,3, Alexander G. Hauptmann1

1School of Computer Science, Carnegie Mellon University 2School of Mathematics and Statistics, Xi an Jiaotong University 3Institute of Computing Technology, Chinese Academy of Sciences lujiang@cs.cmu.edu, dymeng@mail.xjtu.edu.cn {iyu, lanzhzh}@cs.cmu.edu, sgshan@ict.ac.cn, alex@cs.cmu.edu

Self-paced learning (SPL) is a recently proposed learning regime inspired by the learning process of humans and animals that gradually incorporates easy to more complex samples into training. Existing methods are limited in that they ignore an important aspect in learning: diversity. To incorporate this information, we propose an approach called self-paced learning with diversity (SPLD) which formalizes the preference for both easy and diverse samples into a general regularizer. This regularization term is independent of the learning objective, and thus can be easily generalized into various learning tasks. Albeit non-convex,the optimization of the variables included in this SPLD regularization term for sample selection can be globally solved in linearithmic time. We demonstrate that our method signiﬁcantly outperforms the conventional SPL on three real-world datasets. Speciﬁcally, SPLD achieves the best MAP so far reported in literature on the Hollywood2 and Olympic Sports datasets.

1 Introduction

Since it was raised in 2009, Curriculum Learning (CL) [1] has been attracting increasing attention in the ﬁeld of machine learning and computer vision [2]. The learning paradigm is inspired by the learning principle underlying the cognitive process of humans and animals, which generally starts with learning easier aspects of an aimed task, and then gradually takes more complex examples into consideration. It has been empirically demonstrated to be beneﬁcial in avoiding bad local minima and in achieving a better generalization result [1].

A sequence of gradually added training samples [1] is called a curriculum. A straightforward way to design a curriculum is to select samples based on certain heuristical easiness measurements [3, 4, 5]. This ad-hoc implementation, however, is problem-speciﬁc and lacks generalization capacity. To alleviate this deﬁciency, Kumar et al. [6] proposed a method called Self-Paced Learning (SPL) that embeds curriculum designing into model learning. SPL introduces a regularization term into the learning objective so that the model is jointly learned with a curriculum consisting of easy to complex samples. As its name suggests, the curriculum is gradually determined by the model itself based on what it has already learned, as opposed to some predeﬁned heuristic criteria. Since the curriculum in the SPL is independent of model objectives in speciﬁc problems, SPL represents a general implementation [7, 8] for curriculum learning.

In SPL, samples in a curriculum are selected solely in terms of easiness . In this work, we reveal that diversity, an important aspect in learning, should also be considered. Ideal self-paced learning should utilize not only easy but also diverse examples that are sufﬁciently dissimilar from what has already been learned. Theoretically, considering diversity in learning is consistent with the increasing entropy theory in CL that a curriculum should increase the diversity of training examples [1]. This can be intuitively explained in the context of human education. A rational curriculum for a pupil not only needs to include examples of suitable easiness matching her learning pace, but also,

Curriculum for SPL

Curriculum for SPLD

Outdoor bouldering Artificial wall climbing Snow mountain climbing

Positive training samples of Rock Climbing

a1 a2 a3 a4 c4 a2 b1 c1 a1 c4

easy hard easy and diverse hard

0.05 0.12 0.12 0.13 0.50 0.05 0.15 0.17

0.16 0.12 0.50

Figure 1: Illustrative comparison of SPL and SPLD on Rock Climbing event using real samples [15]. SPL tends to ﬁrst select the easiest samples from a single group. SPLD inclines to select easy and diverse samples from multiple groups.

importantly, should include some diverse examples on the subject in order for her to develop more comprehensive knowledge. Likewise, learning from easy and diverse samples is expected to be better than learning from either criterion alone.

We name the learning paradigm that considers both easiness and diversity Self-Paced Learning with Diversity (SPLD). SPLD proves to be a general learning framework as its intuition is embedded as a regularization term that is independent of speciﬁc model objectives. In addition, by considering diversity in learning, SPLD is capable of obtaining better solutions. For example, Fig. 1 plots some positive samples for the event Rock Climbing on a real dataset, named MED [15]. Three groups of samples are depicted for illustration. The number under the keyframe indicates the loss, and a smaller loss corresponds to an easier sample. Every group has easy and complex samples. Having learned some samples from a group, the SPL model prefers to select more samples from the same group as they appear to be easy to what the model has learned. This may lead to overﬁtting to a data subset while ignoring easy samples in other groups. For example, in Fig. 1, the samples selected in ﬁrst iterations of SPL are all from the Outdoor bouldering sub-event because they all look like a1. This is signiﬁcant as the overﬁtting becomes more and more severe as the samples from the same group are kept adding into training. This phenomenon is more evident in real-world data where the collected samples are usually biased towards some groups. In contrast, SPLD, considering both easiness and diversity, produces a curriculum that reasonably mixes easy samples from multiple groups. The diverse curriculum is expected to help quickly grasp easy and comprehensive knowledge and to obtain better solutions. This hypothesis is substantiated by our experiments.

The contribution of this paper is threefold: (1) We propose a novel idea of considering both easiness and diversity in the self-paced learning, and formulate it into a concise regularization term that can be generally applied to various problems (Section 4.1). (2) We introduce the algorithm that globally optimizes a non-convex problem w.r.t. the variables included in this SPLD regularization term for sample selection (Section 4.2). (3) We demonstrate that the proposed SPLD signiﬁcantly outperforms SPL on three real-word datasets. Notably, SPLD achieves the best MAP so far reported in literature on two action datasets.

2 Related work

Bengio et al. [1] proposed a new learning paradigm called curriculum learning (CL), in which a model is learned by gradually including samples into training from easy to complex so as to increase the entropy of training samples. Afterwards, Bengio and his colleagues [2] presented insightful explorations for the rationality underlying this learning paradigm, and discussed the relationship between the CL and conventional optimization techniques, e.g., the continuation and annealing methods. From human behavioral perspective, Khan et al. [10] provided evidence that CL is consistent with the principle in teaching. The curriculum is often derived by predetermined heuristics in particular problems. For example, Ruvolo and Eaton [3] took the negative distance to the boundary as the indicator for easiness in classiﬁcation. Spitkovsky et al. [4] used the sentence length as an indicator in

studying grammar induction. Shorter sentences have fewer possible solutions and thus were learned earlier. Lapedriza et al. [5] proposed a similar approach by ﬁrst ranking examples based on certain training values and then greedily training the model on these sorted examples.

The ad-hoc curriculum design in CL turns out onerous or conceptually difﬁcult to implement in different problems. To alleviate this issue, Kumar et al. [6] designed a new formulation, called self-paced learning (SPL). SPL embeds curriculum design (from easy to more complex samples) into model learning. By virtue of its generality, various applications based on the SPL have been proposed very recently [7, 8, 11, 12, 13]. For example, Jiang et al. [7] discovered that pseudo relevance feedback is a type of self-paced learning which explains the rationale of this iterative algorithm starting from the easy examples i.e. the top ranked documents/videos. Tang et al. [8] formulated a self-paced domain adaptation approach by training target domain knowledge starting with easy samples in the source domain. Kumar et al. [11] developed an SPL strategy for the speciﬁc-class segmentation task. Supanˇciˇc and Ramanan [12] designed an SPL method for longterm tracking by setting smallest increase in the SVM objective as the loss function. To the best of our knowledge, there has been no studies to incorporate diversity in SPL.

3 Self-Paced Learning

Before introducing our approach, we ﬁrst brieﬂy review the SPL. Given the training dataset D = {(x1, y1), , (xn, yn)}, where xi Rm denotes the ith observed sample, and yi represents its label, let L(yi, f(xi, w)) denote the loss function which calculates the cost between the ground truth label yi and the estimated label f(xi, w). Here w represents the model parameter inside the decision function f. In SPL, the goal is to jointly learn the model parameter w and the latent weight variable v = [v1, , vn] by minimizing:

min w,v E(w, v; λ) =

i=1 vi L(yi, f(xi, w)) λ

i=1 vi, s.t. v [0, 1]n, (1)

where λ is a parameter for controlling the learning pace. Eq. (1) indicates the loss of a sample is discounted by a weight. The objective of SPL is to minimize the weighted training loss together with the negative l1-norm regularizer v 1 = Pn i=1 vi (since vi 0). This regularization term is general and applicable to various learning tasks with different loss functions [7, 11, 12].

ACS (Alternative Convex Search) is generally used to solve Eq. (1) [6, 8]. It is an iterative method for biconvex optimization, in which the variables are divided into two disjoint blocks. In each iteration, a block of variables are optimized while keeping the other block ﬁxed. When v is ﬁxed, the existing off-the-shelf supervised learning methods can be employed to obtain the optimal w . With the ﬁxed w, the global optimum v = [v 1, , v n] can be easily calculated by [6]:

v i = 1, L(yi, f(xi, w)) < λ, 0, otherwise. (2)

There exists an intuitive explanation behind this alternative search strategy: 1) when updating v with a ﬁxed w, a sample whose loss is smaller than a certain threshold λ is taken as an easy sample, and will be selected in training (v i = 1), or otherwise unselected (v i = 0); 2) when updating w with a ﬁxed v, the classiﬁer is trained only on the selected easy samples. The parameter λ controls the pace at which the model learns new samples, and physically λ corresponds to the age of the model. When λ is small, only easy samples with small losses will be considered. As λ grows, more samples with larger losses will be gradually appended to train a more mature model.

4 Self-Paced Learning with Diversity

In this section we detail the proposed learning paradigm called SPLD. We ﬁrst formally deﬁne its objective in Section 4.1, and discuss an efﬁcient algorithm to solve the problem in Section 4.2.

4.1 SPLD Model Diversity implies that the selected samples should be less similar or clustered. An intuitive approach for realizing this is by selecting samples of different groups scattered in the sample space. We assume that the correlation of samples between groups is less than that of within a group. This

auxiliary group membership is either given, e.g. in object recognition frames from the same video can be regarded from the same group, or can be obtained by clustering samples.

This aim of SPLD can be mathematically described as follows. Assume that the training samples X = (x1, , xn) Rm n are partitioned into b groups: X(1), , X(b), where columns of X(j) Rm nj correspond to the samples in the jth group, nj is the sample number in the group and Pb j=1 nj = n. Accordingly denote the weight vector as v = [v(1), , v(b)], where v(j) =

(v(j) 1 , , v(j) nj )T [0, 1]nj. SPLD on one hand needs to assign nonzero weights of v to easy samples as the conventional SPL, and on the other hand requires to disperse nonzero elements across possibly more groups v(i) to increase the diversity. Both requirements can be uniformly realized through the following optimization model:

min w,v E(w, v; λ, γ) =

i=1 vi L(yi, f(xi, w)) λ

i=1 vi γ v 2,1, s.t. v [0, 1]n, (3)

where λ, γ are the parameters imposed on the easiness term (the negative l1-norm: v 1) and the diversity term (the negative l2,1-norm: v 2,1), respectively. As for the diversity term, we have:

j=1 v(j) 2. (4)

The SPLD introduces a new regularization term in Eq. (3) which consists of two components. One is the negative l1-norm inherited from the conventional SPL, which favors selecting easy over complex examples. The other is the proposed negative l2,1-norm, which favors selecting diverse samples residing in more groups. It is well known that the l2,1-norm leads to the group-wise sparse representation of v [14], i.e. non-zero entries of v tend to be concentrated in a small number of groups. Contrariwise, the negative l2,1-norm should have a counter-effect to group-wise sparsity, i.e. nonzero entries of v tend to be scattered across a large number of groups. In other words, this anti-group-sparsity representation is expected to realize the desired diversity. Note that when each group only contains a single sample, Eq. (3) degenerates to Eq. (1).

Unlike the convex regularization term in Eq. (1) of SPL, the term in the SPLD is non-convex. Consequently, the traditional (sub)gradient-based methods cannot be directly applied to optimizing v. We will discuss an algorithm to resolve this issue in the next subsection.

4.2 SPLD Algorithm Similar as the SPL, the alternative search strategy can be employed for solving Eq. (3). However, a challenge is that optimizing v with a ﬁxed w becomes a non-convex problem. We propose a simple yet effective algorithm for extracting the global optimum of this problem, as listed in Algorithm 1. It takes as input the groups of samples, the up-to-date model parameter w, and two self-paced parameters, and outputs the optimal v of minv E(w, v; λ, γ). The global minimum is proved in the following theorem (see the proof in supplementary materials):

Theorem 1 Algorithm 1 attains the global optimum to minv E(w, v) for any given w in linearithmic time. As shown, Algorithm 1 selects samples in terms of both the easiness and the diversity. Speciﬁcally:

Samples with L(yi, f(xi, w)) < λ will be selected in training (vi = 1) in Step 5. These samples represent the easy examples with small losses. Samples with L(yi, f(xi, w)) > λ + γ will not be selected in training (vi = 0) in Step 6. These samples represent the complex examples with larger losses. Other samples will be selected by comparing their losses to a threshold λ+ γ

i 1, where i is the sample s rank w.r.t. its loss value within its group. The sample with a smaller loss than the threshold will be selected in training. Since the threshold decreases considerably as the rank i grows, Step 5 penalizes samples monotonously selected from the same group.

We study a tractable example that allows for clearer diagnosis in Fig. 2, where each keyframe represents a video sample on the event Rock Climbing of the TRECVID MED data [15], and the number below indicates its loss. The samples are clustered into four groups based on the visual similarity. A colored block on the right shows a curriculum selected by Algorithm 1. When γ = 0,

Algorithm 1: Algorithm for Solving minv E(w, v; λ, γ).

input : Input dataset D, groups X(1), , X(b), w, λ, γ output: The global solution v = (v(1), , v(b)) of minv E(w, v; λ, γ).

1 for j = 1 to b do // for each group

2 Sort the samples in X(j) as (x(j) 1 , , x(j) nj ) in ascending order of their loss values L;

3 Accordingly, denote the labels and weights of X(j) as (y(j) 1 , , y(j) nj ) and (v(j) 1 , , v(j) nj );

4 for i = 1 to nj do // easy samples first

5 if L(y(j) i , f(x(j) i , w)) < λ + γ 1

i 1 then v(j) i = 1 ; // select this sample

6 else v(j) i = 0; // not select this sample

8 end 9 return v

Outdoor bouldering

Artificial wall climbing Snow mountain climbing

Bear climbing a rock

Curriculum: a, b, c, d

Curriculum: a, j, g, b

Curriculum: a, j, g, n

Figure 2: An example on samples selected by Algorithm 1. A colored block denotes a curriculum with given λ and γ, and the bold (red) box indicates the easy sample selected by Algorithm 1.

as shown in Fig. 2(a), SPLD, which is identical to SPL, selects only easy samples (with the smallest losses) from a single cluster. Its curriculum thus includes duplicate samples like b, c, d with the same loss value. When λ = 0 and γ = 0 in Fig. 2(b), SPLD balances the easiness and the diversity, and produces a reasonable and diverse curriculum: a, j, g, b. Note that even if there exist 3 duplicate samples b, c, d, SPLD only selects one of them due to the decreasing threshold in Step 5 of Algorithm 1. Likewise, samples e and j share the same loss, but only j is selected as it is better in increasing the diversity. In an extreme case where λ = 0 and γ = 0, as illustrated in Fig. 2(c), SPLD selects only diverse samples, and thus may choose outliers, such as the sample n which is a confusable video about a bear climbing a rock. Therefore, considering both easiness and diversity seems to be more reasonable than considering either one alone. Physically the parameters λ and γ together correspond to the age of the model, where λ focuses on easiness whereas γ stresses diversity.

As Algorithm 1 ﬁnds the optimal v, the alternative search strategy can be readily applied to solving Eq. (3). The details are listed in Algorithm 2. As aforementioned, Step 4 can be implemented using the existing off-the-shelf learning method. Following [6], we initialize v by setting vi = 1 to randomly selected samples. Following SPL [6], the self-paced parameters are updated by absolute values of µ1, µ2 (µ1, µ2 1) in Step 6 at the end of every iteration. In practice, it seems more robust by ﬁrst sorting samples in ascending order of their losses, and then setting the λ, γ according to the statistics collected from the ranked samples (see the discussion in supplementary materials). According to [6], the alternative search in Algorithm 1 converges as the objective function is monotonically decreasing and is bounded from below.

5 Experiments

We present experimental results for the proposed SPLD on two tasks: event detection and action recognition. We demonstrate that our approach signiﬁcantly outperforms SPL on three real-world challenging datasets. The code is at (http://www.cs.cmu.edu/ lujiang/spld).

Algorithm 2: Algorithm of Self-Paced Learning with Diversity.

input : Input dataset D, self-pace parameters µ1, µ2 output: Model parameter w

1 if no prior clusters exist then cluster the training samples X into b groups X(1), , X(b); 2 Initialize v , λ, γ ; // assign the starting value 3 while not converged do

4 Update w = arg minw E(w, v ; λ, γ) ; // train a classification model 5 Update v = arg minv E(w , v; λ, γ) using Algorithm 1; // select easy & diverse samples 6 λ µ1λ ; γ µ2γ ; // update the learning pace

7 end 8 return w = w

SPLD is compared against four baseline methods: 1) Random Forest is a robust bootstrap method that trains multiple decision trees using randomly selected samples and features [16]. 2) Ada Boost is a classical ensemble approach that combines the sequentially trained base classiﬁers in a weighted fashion [18]. Samples that are misclassiﬁed by one base classiﬁer are given greater weight when used to train the next classiﬁer in sequence. 3) Batch Train represents a standard training approach in which a model is trained simultaneously using all samples; 4) SPL is a state-of-the-art method that trains models gradually from easy to more complex samples [6]. The baseline methods are a mixture of the well-known and the state-of-the-art methods on training models using sampled data.

5.1 Multimedia Event Detection (MED)

Problem Formulation Given a collection of videos, the goal of MED is to detect events of interest, e.g. Birthday Party and Parade , solely based on the video content. The task is very challenging due to complex scenes, camera motion, occlusions, etc. [17, 19, 8].

Dataset The experiments are conducted on the largest collection on event detection: TRECVID MED13Test, which consists of about 32,000 Internet videos. There are a total of 3,490 videos from 20 complex events, and the rest are background videos. For each event 10 positive examples are given to train a detector, which is tested on about 25,000 videos. The ofﬁcial test split released by NIST (National Institute of Standards and Technology) is used [15].

Experimental setting A Deep Convolutional Neural Network is trained on 1.2 million Image Net challenge images from 1,000 classes [20] to represent each video as a 1,000-dimensional vector. Algorithm 2 is used. By default, the group membership is generated by the spectral clustering, and the number of groups is set to 64. Following [9, 8], Lib Linear is used as the solver in Step 4 of Algorithm 2 due to its robust performance on this task. The performance is evaluated using MAP as recommended by NIST. The parameters of all methods are tuned on the same validation set.

Table 1 lists the overall MAP comparison. To reduce the inﬂuence brought by initialization, we repeated experiments of SPL and SPLD 10 times with random starting values, and report the best run and the mean (with the 95% conﬁdence interval) of the 10 runs. The proposed SPLD outperforms all baseline methods with statistically signiﬁcant differences at the p-value level of 0.05, according to the paired t-test. It is worth emphasizing that MED is very challenging [15] and 26% relative (2.5 absolute) improvement over SPL is a notable gain. SPLD outperforms other baselines on both the best run and the 10 runs average. Random Forest and Ada Boost yield poorer performance. This observation agrees with the study in literature [15, 9] that SVM is more robust on event detection.

Table 1: MAP (x100) comparison with the baseline methods on MED.

Random Forest Ada Boost Batch Train SPL SPLD

3.0 2.8 8.3 9.6 12.1

10 Runs Average

3.0 2.8 8.3 8.6 0.42 9.8 0.45

Batch Train, SPL and SPLD are all performed using SVM. Regarding the best run, SPL boosts the MAP of the Batch Train by a relative 15.6% (absolute 1.3%). SPLD yields another 26% (absolute 2.5%) over SPL. The MAP gain suggests that optimizing objectives with the diversity is inclined to attain a better solution. Fig. 3 plots the validation and test AP on three representative events. As illustrated, SPLD attains a better solution within fewer iterations than SPL, e.g. in Fig. 3(a) SPLD obtains the best test AP (0.14) by 6 iterations as opposed to AP (0.12) by 11 iterations in

10 20 30 40 50 0

Average Precision

Dev AP Test AP Batch Train

10 20 30 40 50 0

Average Precision

Dev AP Test AP Batch Train

10 20 30 40 50 0

Average Precision

Dev AP Test AP Batch Train

10 20 30 40 50 0

Average Precision

Dev AP Test AP Batch Train

10 20 30 40 50 0

Average Precision

Dev AP Test AP Batch Train

10 20 30 40 50 0

Average Precision

Dev AP Test AP Batch Train

(a) E006: Birthday party (b) E008: Flash mob gathering (c) E023: Dog show

Figure 3: The validation and test AP in different iterations. Top row plots the SPL result and bottom shows the proposed SPLD result. The x-axis represents the iteration in training. The blue solid curve (Dev AP) denotes the AP on the validation set, the red one marked by squares (Test AP) denotes the AP on the test set, and the green dashed curve denotes the Test AP of Batch Train which remains the same across iterations.

Indoorbirthday party

Iter2 Iter 3

Iter 4 Iter 9 Iter 10

Indoorbirthday party Indoorbirthday party Indoor birthday party Indoorbirthday party Outdoorbirthday party

Outdoor birthday party Indoorbirthday party Indoorbirthday party Outdoorbirthday party Indoorbirthday party Indoorbirthday party

Bicycle/Scooter Car/Truck Bicycle/Scooter Car/Truck

Car/Truck Bicycle/Scooter Bicycle/Scooter

Car/Truck Bicycle/Scooter

(b) (a) (b)

E006: Birthday party E007: Changing a vehicle tire

The number of iterations in training

Figure 4: Comparison of positive samples used in each iteration by (a) SPL (b) SPLD.

SPL. Studies [1, 6] have shown that SPL converges fast, while this observation further suggests that SPLD may lead to an even faster convergence. We hypothesize that it is because the diverse samples learned in the early iterations in SPLD tend to be more informative. The best Test APs of both SPL and SPLD are better than Batch Train, which is consistent with the observation in [5] that removing some samples may be beneﬁcial in training a better detector. As shown, Dev AP and Test AP share a similar pattern justifying the rationale for parameters tuning on the validation set.

Fig. 4 plots the curriculum generated by SPL and SPLD in a ﬁrst few iterations on two representative events. As we see, SPL tends to select easy samples similar to what it has already learned, whereas SPLD selects samples that are both easy and diverse to the model. For example, for the event E006 Birthday Party , SPL keeps selecting indoor scenes due to the sample learned in the ﬁrst place. However, the samples learned by SPLD are a mixture of indoor and outdoor birthday parties. For the complex samples, both methods leave them to the last iterations, e.g. the 10th video in E007 .

5.2 Action Recognition

Problem Formulation The goal is to recognize human actions in videos.

Datasets Two representative datasets are used: Hollywood2 was collected from 69 different Hollywood movies [21]. It contains 1,707 videos belonging to 12 actions, splitting into a training set (823 videos) and a test set (884 videos). Olympic Sports consists of athletes practicing different sports collected from You Tube [22]. There are 16 sports actions from 783 clips. We use 649 for training and 134 for testing as recommended in [22].

Experimental setting The improved dense trajectory feature is extracted and further represented by the ﬁsher vector [23, 24]. A similar setting discussed in Section 5.1 is applied, except that the groups are generated by K-means (K=128).

Table 2 lists the MAP comparison on the two datasets. A similar pattern can be observed that SPLD outperforms SPL and other baseline methods with statistically signiﬁcant differences. We then compare our MAP with the state-of-the-art MAP in Table 3. Indeed, this comparison may be

Table 2: MAP (x100) comparison with the baseline methods on Hollywood2 and Olympic Sports.

Random Forest Ada Boost Batch Train SPL SPLD

28.20 41.14 58.16 63.72 66.65

Olympic Sports

63.32 69.25 90.61 90.83 93.11

less fair since the features are different in different methods. Nevertheless, with the help of SPLD, we are able to achieve the best MAP reported so far on both datasets. Note that the MAPs in Table 3 are obtained by recent and very competitive methods on action recognition. This improvement conﬁrms the assumption that considering diversity in learning is instrumental.

Table 3: Comparison of SPLD to the state-of-the-art on Hollywood2 and Olympic Sports

Olympic Sports

Vig et al. 2012 [25] 59.4%

Brendel et al. 2011 [28] 73.7%

Jiang et al. 2012 [26] 59.5%

Jiang et al. 2012 [26] 80.6%

Jain et al. 2013 [27] 62.5%

Gaidon et al. 2012 [29] 82.7%

Wang et al. 2013 [23] 64.3%

Wang et al. 2013 [23] 91.2%

5.3 Sensitivity Study We conduct experiments using different number of groups generated by two clustering algorithm: K-means and Spectral Clustering. Each experiment is fully tuned under the given #groups and the clustering algorithm, and the best run is reported in Table 4. The results suggest that SPLD is relatively insensitive to the clustering method and the given group numbers. We hypothesize that SPLD may not improve SPL in the cases where the assumption in Section 4.1 is violated, and the given groups, e.g. random clusters, cannot reﬂect the latent variousness in data.

Table 4: MAP (x100) comparison of different clustering algorithms and #clusters.

#Groups=32 #Groups=64 #Groups=128 #Groups=256

9.16 0.31 9.20 0.36 9.25 0.32 9.03 0.28

9.29 0.42 9.79 0.45 9.22 0.41 9.38 0.43

66.372 66.358 66.653 66.365

66.639 66.504 66.264 66.709

91.86 92.37 93.11 92.65

91.08 92.51 93.25 92.54

6 Conclusion

We advanced the frontier of the self-paced learning by proposing a novel idea that considers both easiness and diversity in learning. We introduced a non-convex regularization term that favors selecting both easy and diverse samples. The proposed regularization term is general and can be applied to various problems. We proposed a linearithmic algorithm that ﬁnds the global optimum of this non-convex problem on updating the samples to be included. Using three real-world datasets, we showed that the proposed SPLD outperforms the state-of-the-art approaches.

Possible directions for future work may include studying the diversity for samples in the mixture model, e.g. mixtures of Gaussians, in which a sample is assigned to a mixture of clusters. Another possible direction would be studying assigning reliable starting values for SPL/SPLD.

Acknowledgments

This work was partially supported by Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20068. Deyu Meng was partially supported by 973 Program of China (3202013CB329404) and the NSFC project (61373114). The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the ofﬁcial policies or endorsements, either expressed or implied, of IARPA, Do I/NBC, or the U.S. Government.

[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, Curriculum learning. In ICML, 2009.

[2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Trans. PAMI 35(8):1798-1828, 2013.

[3] S. Basu and J. Christensen. Teaching classiﬁcation boundaries to humans. In AAAI, 2013.

[4] V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. Baby steps: How Less is More in unsupervised dependency parsing. In NIPS, 2009.

[5] A. Lapedriza, H. Pirsiavash, Z. Bylinskii, and A. Torralba. Are all training examples equally valuable? Co RR abs/1311.6510, 2013.

[6] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.

[7] L. Jiang, D. Meng, T. Mitamura, and A. Hauptmann. Easy samples ﬁrst: self-paced reranking for zeroexample multimedia search. In MM, 2014.

[8] K. Tang, V. Ramanathan, L. Fei-Fei, and D. Koller. Shifting weights: Adapting object detectors from image to video. In NIPS, 2012.

[9] Z. Lan, L. Jiang, S. I. Yu, et al. CMU-Informedia at TRECVID 2013 multimedia event detection. In TRECVID, 2013.

[10] F. Khan, X. J. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension. In NIPS, 2011.

[11] M. P. Kumar, H. Turki, D. Preston, and D. Koller. Learning specﬁc-class segmentation from diverse data. In ICCV, 2011.

[12] J. S. Supancic and D. Ramanan. Self-paced learning for long-term tracking. In CVPR, 2013.

[13] Y. J. Lee and K. Grauman. Learning the easy things ﬁrst: Self-paced visual category discovery. In CVPR, 2011.

[14] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B 68(1):49-67, 2006.

[15] P. Over, G. Awad, M. Michel, et al. TRECVID 2013 - an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID, 2013.

[16] L. Breiman. Random forests. Machine learning. 45(1):5-32, 2001.

[17] L. Jiang, A. Hauptmann, and G. Xiang. Leveraging high-level and low-level features for multimedia event detection. In MM, 2012.

[18] J. Friedman. Stochastic Gradient Boosting. Coputational Statistics and Data Analysis. 38(4):367-378, 2002.

[19] L. Jiang, T. Mitamura, S. Yu, and A. Hauptmann. Zero-Example Event Search using Multi Modal Pseudo Relevance Feedback. In ICMR, 2014.

[20] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

[21] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009.

[22] J. C. Niebles, C. W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classiﬁcation. In ECCV, 2010.

[23] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.

[24] Z. Lan, X. Li, and A. Hauptmann. Temporal Extension of Scale Pyramid and Spatial Pyramid Matching for Action Recognition. In ar Xiv preprint ar Xiv:1408.7071, 2014.

[25] E. Vig, M. Dorr, and D. Cox. Space-variant descriptor sampling for action recognition based on saliency and eye movements. In ECCV, 2012.

[26] Y. G. Jiang, Q. Dai, X. Xue, W. Liu, and C. W. Ngo. Trajectory-based modeling of human actions with motion reference points. In ECCV, 2012.

[27] M. Jain, H. J egou, and P. Bouthemy. Better exploiting motion for better action recognition. In CVPR, 2013.

[28] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, 2011.

[29] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC, 2012