# 10000_times_accelerated_robust_subset_selection__62ec52a3.pdf 10,000+ Times Accelerated Robust Subset Selection (ARSS) Feiyun Zhu, Bin Fan, Xinliang Zhu, Ying Wang, Shiming Xiang and Chunhong Pan Institute of Automation, Chinese Academy of Sciences {fyzhu, bfan, ywang, smxiang and chpan}@nlpr.ia.ac.cn, zhuxinliang2012@ia.ac.cn Subset selection from massive data with noised information is increasingly popular for various applications. This problem is still highly challenging as current methods are generally slow in speed and sensitive to outliers. To address the above two issues, we propose an accelerated robust subset selection (ARSS) method. Specifically in the subset selection area, this is the first attempt to employ the ℓp (0 < p 1)-norm based measure for the representation loss, preventing large errors from dominating our objective. As a result, the robustness against outlier elements is greatly enhanced. Actually, data size is generally much larger than feature length, i.e. N L. Based on this observation, we propose a speedup solver (via ALM and equivalent derivations) to highly reduce the computational cost, theoretically from O N 4 to O N 2L . Extensive experiments on ten benchmark datasets verify that our method not only outperforms state of the art methods, but also runs 10,000+ times faster than the most related method. Introduction Due to the explosive growth of data (Wang, Kumar, and Chang 2012), subset selection methods are increasingly popular for a wide range of machine learning and computer vision applications (Frey and Dueck 2007; Jenatton, Audibert, and Bach 2011). This kind of methods offer the potential to select a few highly representative samples or exemplars to describe the entire dataset. By analyzing a few, we can roughly know all. Such case is very important to summarize and visualize huge datasets of texts, images and videos etc. (Bien and Tibshirani 2011; Elhamifar et al. 2012b). Besides, by only using the selected exemplars for succeeding tasks, the cost of memories and computational time will be greatly reduced (Garcia et al. 2012). Additionally, as outliers are generally less representative, the side effect of outliers will be reduced, thus boosting the performance of subsequent applications (Elhamifar et al. 2012a). There have been several subset selection methods. The most intuitional method is to randomly select a fixed number of samples. Although highly efficient, there is no guarantee for an effective selection. For the other methods, depending on the mechanism of representative exemplars, there are mainly three categories of selection methods. One category Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Comparisons of four algorithms on Optdigit. Two conclusions can be drawn. First, our method (ARSSour) is highly faster than all others; with the help of an elegant new theorem, RRSSour is significantly faster than the authorial algorithm RRSSNie. Second, ARSSour achieves highly promising prediction accuracies. relies on the assumption that the data points lie in one or multiple low-dimensional subspaces. Specifically, the Rank Revealing QR (RRQR) (Chan 1987; Boutsidis, Mahoney, and Drineas 2009) selects the subsets that give the best conditional sub-matrix. Unfortunately, this method has suboptimal properties, as it is not assured to find the globally optimum in polynomial time. Another category assumes that the samples are distributed around centers (Frey and Dueck 2007; Liu et al. 2010). The center or its nearest neighbour are selected as exemplars. Perhaps, Kmeans and Kmedoids are the most typical methods (Kmedoids is a variant of Kmeans). Both methods employ an EM-like algorithm. Thus, the results depend tightly on the initialization, and they are highly unstable for large K (i.e. the number of centers or selected samples). Recently, there are a few methods that assume exemplars are the samples that can best represent the whole dataset. However, for (Yu, Bi, and Tresp 2006), the optimization is a combinatorial problem (NP-hard) (Nie et al. 2013; Yu et al. 2008), which is computationally intractable to solve. Besides, the representation loss is measured by the least square measure, which is sensitive to outliers in data (Wang et al. 2014; Zhu et al. 2014; Nie et al. 2013). Then (Nie et al. 2013) improves (Yu, Bi, and Tresp 2006) by employing a robust loss via the ℓ2,1-norm; the ℓ1-norm is applied to samples, and the ℓ2-norm is used for features. In this way, the side effect of outlier samples is relieved. The Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence solver of (Nie et al. 2013) is theoretically perfect due to its ability of convergence to global optima. Unfortunately, in terms of computational costs, the solver is highly complex. It takes O N 4 for one iteration as shown in Table 1. This is infeasible for the case of large N (e.g. it takes 2000+ hours for a case of N = 13000). Moreover, the representation loss is only robust against outlier samples. Such case is worth improvement, as there may exist outlier elements in real data. Contributions. In this paper, we propose an accelerated robust subset selection method to highly raise the speed on the one hand, and to boost the robustness on the other. To this end, we use the ℓp (0

0). Then the algorithm is to minimize the objective of N n xn Xan 2 2 + ϵ + γ N n an 2 2 + ϵ. When ϵ 0, this objective is reduced to the objective (3). Remark 1. Since Unn XT X+γV RN N, the major computational cost of (4) focuses on a N N linear system. If solved by the Cholesky factorization method, it costs 1 for factorization as well as 2N 2 for forward and backward substitution. This amounts to O N 3 in total. By now, we only solve an. Once solving all the set of {an}N n=1, the total complexity amounts to O N 4 for one iteration step. Accelerated Robust Subset Selection (ARSS) Due to the huge computational costs, Nie s method is infeasible for the case of large N the computational time is up to 2088 hours for a case of 13000 samples. Besides, Nie s model (3) imposes the ℓ2-norm among features, which is prone to outliers in features. To tackle the above two issues, we propose a more robust model in the ℓp (0 < p 1)- norm. Although the resulted objective is challenging to solve, a speedup algorithm is proposed to dramatically save the computational costs. For the same task of N = 13000, it costs our method 1.8 minutes, achieving a 68429 times acceleration compared with the speed of Nie s method. Modeling. To boost the robustness against outliers in both samples and features, we formulate the discrepancy between X and XA via the ℓp(0 L then 5: update A via the updating rule (15), that is 6: A = B (IL + XB) 1P, where B = β XV 1 T . 7: end if Output: A Algorithm 2 for (5) or (8): A = ARSSALM (X, γ, p) Input: X, γ, p 1: Initialize μ>0, 1<ρ<2, ϵ=10 10, A=IN, Λ=0. 2: repeat 3: update E by the updating rule (12). 4: update V = [Vnn] RN N. 5: P = X E Λ γ ; IL is a L L identity matrix. 6: A = ARSSA (X, V, P, IL, β) via Algorithm 1. 7: update Λ by the updating rule (9), μ ρμ. 8: until convergence Output: A The solver to update A is given in Algorithm 1. The overall solver for our model (5) is summarized in Algorithm 2. According to Theorem 2 and Corollary 3, the solver for our model (13) is highly simplified, as feature length is generally much smaller than data size, i.e L N. Similarly, Nie s method could be highly accelerated by Theorem 4, obtaining 500+ times speedup, as shown in Fig. 2 and Table 3. Theorem 4. Nie s N N solver (20) (Nie et al. 2013) is equivalent to the following L L linear system (21) an =Unn Unn XT X + γV 1 XT xn (20) =Unn XV 1 T Unn X XV 1 T +γIL 1 xn (21) n {1, 2, , N}, where IL is a L L identity matrix. Proof. Based on (20), we have the following equalities: an =Unn Unn XT X + γV 1 XT xn, 2 Unn XT X+γV V 1 2 T Unn XV 1 =Unn XV 1 T Unn X XV 1 T +γIL 1 xn. The derivations are equivalent; their results are equal. 2V RN N is a positive and diagonal matrix with the nth diagonal entry as Vnn = 1 an 2 2+ϵ > 0, where ϵ is a small value to avoid singular failures (Nie et al. 2013; Zhu et al. 2014). Figure 2: Speed vs. increasing N on (a) Letter, (b) MNIST and (c) Waveform. Compared with the authorial solver TED and RRSSNie, our method ARSS and RRSSour dramatically reduce the computational time. The larger data size is, the larger gaps between these methods are. Note that the selection time is not sensitive to the number of selected samples K. (best viewed in color) Table 2: Statistics of ten benchmark datasets. Corollary 5. Since feature length is generally much smaller than data size, i.e. L N, our accelerated solver (20) for Nie s model (3) is highly faster than the authorial solver (21). Theoretically, we reduce the computational complexity from O N 4 to O N 2L + NL3 , while maintaining the same solution. That is, like Nie s solver (20), our speedup solver (21) can reach the global optimum. Extensive empirical results will verify the huge acceleration Experiments Experimental Settings In this part, the experimental settings are introduced. All experiments are conducted on a server with 64-core Intel Xeon E7-4820 @ 2.00 GHz, 18 Mb Cache and 0.986 TB RAM, using Matlab 2012. Brief descriptions of ten benchmark datasets are summarized in Table 2, where Total(N ) denotes the total set of samples in each data. Due to the high computational complexity, other methods can only handle small datasets (while our method can handle the total set). Thus, we randomly choose the candidate set from the total set to reduce the sample size, i.e. N < N (cf. Total(N ) and candid.(N) in Table 2). The remainder (except candidate set) are used for test. Specifically, to simulate the varying quality of samples, ten percentage of candidate samples from each class are randomly selected and arbitrarily added one of the following three kinds of noise: Gaussian , Laplace and Salt & pepper respectively. In a word, all experiment settings are same and fair for all the methods. Speed Comparisons There are two parts of speed comparisons. First, how speed varies with increasing N is illustrated in Fig. 2. Then the comparison of specific speed is summarized in Table 3. Note that TED and RRSSNie denote the authorial solver (via authorial codes); RRSSour is our accelerated solver for Nie s model via Theorem 4; ARSS is the proposed method. Speed vs. increasing N To verify the great superiority of our method over the state-of-the-art methods in speed, three experiments are conducted. The results are illustrated in Fig. 2, where there are three sub-figures showing the speed of four methods on the benchmark datasets of Letter, MNIST and Waveform respectively. As we shall see, both selection time of TED (Yu, Bi, and Tresp 2006) and RRSSNie (Nie et al. 2013) increases dramatically as N increases. No surprisingly, RRSSNie is incredibly time-consuming as N grows the order of curves looks higher than quadratic. Actually, the theoretical complexity of RRSSNie is highly up to O N 4 as analyzed in Remark 1. Compared with TED and RRSSNie, the curve of ARSS is surprisingly lower and highly stable against increasing N; there is almost no rise of selection time over growing N. This is owing to the speedup techniques of ALM and equivalent derivations. Via them, we reduce the computational cost from O N 4 to O N 2L , as analyzed in Theorem 2 and Corollary 3. Moreover, with the help of Theorem 4, RRSSour is the second faster algorithm that is significantly accelerated compared with the authorial algorithm RRSSNie. Speed with fixed N The speed of four algorithms is summarized in Table 3a, where each row shows the results on one dataset and the last row displays the average results. Four conclusions can be drawn from Table 3a. First, ARSS is the fastest algorithm, and RRSSour is the second fastest algorithm. Second, with the help of Theorem 4, RRSSour is highly faster than RRSSNie, averagely obtaining a 559 times acceleration. Third, ARSS is dramatically faster than RRSSNie and TED; the results in Table 3a verify an average acceleration of 23275 times faster than RRSSNie and 281 times faster than TED. This means that for example if it Table 3: Performances of TED, RRSS and ARSS: (left-a) speed in seconds, (right-b) prediction accuracies. In terms of speed, with the help of Theorem 4, RRSSour is averagely 559+ times faster than the authorial algorithm, i.e. RRSSNie; ARSS achieves surprisingly 23275+ times acceleration compared with RRSSNie. Due to the more robust loss in the ℓp-norm, the prediction accuracy of ARSS is highly encouraging. ARSS(N ) means the task of selecting samples from the whole dataset (with N samples as shown in the 2ndcolumn in Table 2), while TED to ARSS indicate the problem of dealing with the candidate sample sets (with N samples as shown in the 3rd column in Table 2). takes RRSSNie 100 years to do a subset selection task, it only takes our method 1.6 days to address the same problem. Finally, we apply ARSS to the whole sample set of each data. The results are displayed in the 6th column in Table 3, showing its capability to process very large datasets. Prediction Accuracy Accuracy comparison We conduct experiments on ten benchmark datasets. For each dataset, the top 200 representative samples are selected for training. The prediction accuracies are reported in Table 3b, including the results of two popular classifiers. Three observations can be drawn from this table. First, Linear SVM generally outperforms KNN. Second, in general, our method performs the best; for a few cases, our method achieves comparable results with the best performances. Third, compared with TED, both RRSS and ARSS achieve an appreciable advantage. The above analyses are better illustrated in the last row of Table 3b. These results demonstrate that the ℓp loss in our model is well suited to select exemplars from the sample sets of various quality. Prediction accuracies vs. increasing K To give a more detailed comparison, Fig. 3 shows the prediction accuracies versus growing K (the number of selected samples). There are two rows and four columns of sub-figures. The top row shows the results of KNN, and the bottom one shows results of SVM. Each column gives the result on one dataset. As we shall see, the prediction accuracies generally increase as K increases. Such case is consistent with the common view that more training data will boost the prediction accuracy. For each sub-figure, ARSS is generally among the best. This case implies that our robust objective (5) via the ℓp-norm is feasible to select subsets from the data of varying qualities. Conclusion To deal with tremendous data of varying quality, we propose an accelerated robust subset selection (ARSS) method. The ℓp-norm is exploited to enhance the robustness against both outlier samples and outlier features. Although the resulted objective is complex to solve, we propose a highly efficient solver via two techniques: ALM and equivalent derivations. Via them, we greatly reduce the computational complexity from O N 4 to O N 2L . Here feature length L is much smaller than data size N, i.e. L N. Extensive results on ten benchmark datasets verify that our method not only runs 10,000+ times faster than the most related method, but also outperforms state of the art methods. Moreover, we propose an accelerated solver to highly speed up Nie s method, theoretically reducing the computational complexity from O N 4 to O N 2L + NL3 . Empirically, our accelerated solver could achieve equal results and 500+ times acceleration compared with the authorial solver. Limitation. Our efficient algorithm build on the observation that the number of samples is generally larger than feature length, i.e. N >L. For the case of N L, the acceleration will be inapparent. Acknowledgements The authors would like to thank the editor and the reviewers for their valuable suggestions. Besides, this work is supported by the projects (Grant No. 61272331, 91338202, 61305049 and 61203277) of the National Natural Science Foundation of China. References Bien, J., and Tibshirani, R. 2011. Prototype selection for interpretable classification. Annals of Applied Statistics 5(4):2403 2424. Boutsidis, C.; Mahoney, M. W.; and Drineas, P. 2009. An improved approximation algorithm for the column subset selection problem. In SODA, 968 977. Chan, T. F. 1987. Rank revealing {QR} factorizations. Linear Algebra and its Applications 88 89(0):67 82. Figure 3: Accuracies vs. increasing K (the number of selected samples). There are two rows and four columns of subfigures: the top row shows the prediction accuracies of KNN, and the bottom shows the results of Linear SVM; each column shows the performances on one datasets, that is Diabetes, Vehicle, Coil20 and Waveform respectively. Generally, ARSS (ours) is among the best. (best viewed in color) Elhamifar, E.; Sapiro, G.; Vidal, R.; and Vidal, R. 2012a. See all by looking at a few: Sparse modeling for finding representative objects. In IEEE CVPR, 1600 1607. Elhamifar, E.; Sapiro, G.; Vidal, R.; and Vidal, R. 2012b. Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery. In NIPS, 19 27. Frey, B. J., and Dueck, D. 2007. Clustering by passing messages between data points. Science 315(5814):972 976. Garcia, S.; Derrac, J.; Cano, J. R.; and Herrera, F. 2012. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3):417 435. Jenatton, R.; Audibert, J.-Y.; and Bach, F. 2011. Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research (JMLR) 12:2777 2824. Levin, A.; Lischinski, D.; Weiss, Y.; and Weiss, Y. 2008. A closed-form solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell. 30(2):228 242. Li, C. 2011. Compressive Sensing for 3D Data Processing Tasks: Applications, Models and Algorithms. Ph.D. Dissertation, Houston, TX, USA. AAI3524544. Liu, G.; Lin, Z.; Yu, Y.; and Yu, Y. 2010. Robust subspace segmentation by low-rank representation. In ICML, 663 670. Meng, G.; Wang, Y.; Duan, J.; Xiang, S.; and Pan, C. 2013. Efficient image dehazing with boundary constraint and contextual regularization. In ICCV, 617 624. Nie, F.; Wang, H.; Cai, X.; Huang, H.; and Ding, C. 2012. Robust matrix completion via joint schatten p-norm and lpnorm minimization. In IEEE ICDM, 566 574. Nie, F.; Wang, H.; Huang, H.; and Ding, C. H. Q. 2013. Early active learning via robust representation and structured sparsity. In IJCAI, 1572 1578. Nie, F.; Huang, Y.; Wang, X.; and Huang, H. 2014. New primal svm solver with linear computational cost for big data classifications. In ICML. Nocedal, J., and Wright, S. J. 2006. Numerical Optimization. New York: Springer, 2nd edition. Wang, H.; Nie, F.; Huang, H.; and Huang, H. 2014. Robust distance metric learning via simultaneous l1-norm minimization and maximization. In ICML, 1836 1844. Wang, J.; Kumar, S.; and Chang, S.-F. 2012. Semisupervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell. 34. Wright, J.; Ganesh, A.; Rao, S.; Peng, Y.; and Ma, Y. 2009. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In NIPS, 2080 2088. Yu, K.; Zhu, S.; Xu, W.; ; and Gong, Y. 2008. Non-greedy active learning for text categorization using convex transductive experimental design. In SIGIR, 1081 1088. Yu, K.; Bi, J.; and Tresp, V. 2006. Active learning via transductive experimental design. In ICML, 1081 1088. Zhu, F.; Wang, Y.; Fan, B.; Meng, G.; and Pan, C. 2014. Effective spectral unmixing via robust representation and learning-based sparsity. ar Xiv :1409.0685. Zuo, W.; Meng, D.; Zhang, L.; Feng, X.; and Zhang, D. 2013. A generalized iterated shrinkage algorithm for nonconvex sparse coding. In IEEE ICCV, 217 224.