# noiserobust_semisupervised_learning_by_largescale_sparse_coding__045d80f5.pdf Noise-Robust Semi-Supervised Learning by Large-Scale Sparse Coding Zhiwu Lu1 and Xin Gao2 and Liwei Wang3 and Ji-Rong Wen1 and Songfang Huang4 1School of Information, Renmin University of China, Beijing 100872, China 2Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Jeddah 23955, Saudi Arabia 3Key Laboratory of Machine Perception (MOE), School of EECS, Peking University, Beijing 100871, China 4IBM China Research Lab, Beijing, China {zhiwu.lu, xin4gao, wangliwei.pku, jirong.wen}@gmail.com, huangsf@cn.ibm.com This paper presents a large-scale sparse coding algorithm to deal with the challenging problem of noiserobust semi-supervised learning over very large data with only few noisy initial labels. By giving an L1-norm formulation of Laplacian regularization directly based upon the manifold structure of the data, we transform noise-robust semi-supervised learning into a generalized sparse coding problem so that noise reduction can be imposed upon the noisy initial labels. Furthermore, to keep the scalability of noise-robust semi-supervised learning over very large data, we make use of both nonlinear approximation and dimension reduction techniques to solve this generalized sparse coding problem in linear time and space complexity. Finally, we evaluate the proposed algorithm in the challenging task of large-scale semi-supervised image classification with only few noisy initial labels. The experimental results on several benchmark image datasets show the promising performance of the proposed algorithm. Introduction Semi-supervised learning has been widely applied to many challenging image analysis tasks (Lu, Zhao, and Cai 2006; Xu and Yan 2009; Lu and Ip 2010; Fergus, Weiss, and Torralba 2010; Tang et al. 2009; Liu et al. 2009) such as image representation, image classification, and image annotation. In these image analysis tasks, the manual labeling of training data is often tedious, subjective as well as expensive, while the access to unlabeled data is much easier. In the literature, through exploiting the large number of unlabeled data with reasonable assumptions, semi-supervised learning (Blum and Mitchell 1998; Zhu, Ghahramani, and Lafferty 2003; Zhou et al. 2004) has been shown to reduce the need for expensive labeled data and achieve promising results in different challenging image analysis tasks. Among various semi-supervised learning methods, one influential work is graph-based semi-supervised learning (Zhu, Ghahramani, and Lafferty 2003; Zhou et al. 2004) which models the entire dataset as a graph. The basic idea behind this semi-supervised learning is label propagation over the graph with the cluster consistency (Zhou et al. Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2004). Since the graph is at the heart of graph-based semisupervised learning, graph construction has been studied extensively (Wang and Zhang 2008; Yan and Wang 2009; Cheng et al. 2010; Zhuang et al. 2012; Sun, Hussain, and Shawe-Taylor 2014). However, these graph construction methods are not developed directly for noise reduction, and the corresponding semi-supervised learning still suffers from significant performance degradation due to the inaccurate labeling of data points commonly encountered in different image analysis tasks. Moreover, the labeling of images may be contributed by the community (e.g. Flickr) and we can only obtain noisy labels for all the images. In this paper, we focus on developing a novel noiserobust semi-supervised learning algorithm to deal with the challenging problem of semi-supervised learning with noisy initial labels. As summarized in (Wang and Zhang 2008), graph-based semi-supervised learning can be formulated as a quadratic optimization problem based on Laplacian regularization (Zhu, Ghahramani, and Lafferty 2003; Zhou et al. 2004; Lu and Peng 2013a). Considering the success of L1-norm optimization for noise reduction (Elad and Aharon 2006; Mairal, Elad, and Sapiro 2008; Wright et al. 2009), if we give a new L1-norm formulation of Laplacian regularization, we can propose noise-robust L1-norm semi-supervised learning. Fortunately, derived from the eigenvalue decomposition of the normalized Laplacian matrix L, we can represent L in a symmetrical decomposition form and then define L1-norm Laplacian regularization. Since all the eigenvectors of L are explored in this symmetrical decomposition, our L1-norm Laplacian regularization is actually defined based upon the manifold structure of the data. By limiting the solution of semi-supervised learning to the space spanned by the eigenvectors of L, our noise-robust semi-supervised learning is formulated as a generalized sparse coding problem. To keep the scalability for very large data, we utilize both nonlinear approximation and dimension reduction techniques to solve this generalized sparse coding problem. More specifically, we first construct a large graph over the data only based upon a limited number of clustering centers (obtained by k-means clustering). Due to the special definition of this graph, we can find the eigenvectors of L (to be used in this generalized sparse coding problem) in linear time complexity. For the sake of more efficiency, we choose to only work with a small subset of eigenvec- Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence tors during solving the generalized sparse coding problem. By considering nonlinear approximation and dimension reduction together, we develop a large-scale sparse coding algorithm of linear time and space complexity. In this paper, the proposed algorithm is then applied to the challenging task of semi-supervised image classification over very large data with only few noisy initial labels. Although there exist other large-scale semi-supervised learning algorithms (Zhang, Kwok, and Parvin 2009; Liu, He, and Chang 2010; Fergus, Weiss, and Torralba 2010) in the literature, they are not originally developed to deal with noisy initial labels. To emphasize the main contributions of this paper, we summarize the following distinct advantages of our new semi-supervised learning algorithm: This is he first work to deal with the challenging problem of semi-supervised learning over very large data with only few noisy initial labels, to our knowledge. Our semi-supervised learning algorithm developed based on large-scale sparse coding has been shown to achieve promising results in semi-supervised image classification over very large data with only few noisy initial labels. Our new L1-norm Laplacian regularization can be similarly applied to many other problems in machine learning and patter recognition, considering the wide use of Laplacian regularization in the literature. The remainder of this paper is organized as follows. In Section 2, we develop a large-scale sparse coding (LSSC) algorithm for semi-supervised learning over very large data with only few noisy initial labels. In Section 3, the proposed LSSC algorithm is applied to large-scale noise-robust image classification. Sections 4 and 5 present our experimental results and conclusions, respectively. Noise-Robust Semi-Supervised Learning In this section, we first give the problem formulation for noise-robust semi-supervised learning by defining new L1norm Laplacian regularization. Moreover, we extend our noise-robust semi-supervised learning to large-scale problems by exploiting nonlinear approximation and dimension reduction techniques. Finally, we develop a large-scale sparse coding algorithm to solve the challenging problem of noise-robust semi-supervised learning over very large data. Problem Formulation We first introduce semi-supervised learning problem as follows. Here, we only consider the two-class problem, while the multi-class problem will be discussed in Section 3. Given a dataset X = {x1, ..., xl, xl+1, ..., xn} and a label set {1, 1}, the first l data points xi (i l) are labeled as yi {1, 1} and the remaining data points xu (l + 1 u n) are unlabeled with yu = 0. The goal of semisupervised learning is to predict the labels of the unlabeled data points, i.e., to find a vector f = [f1, ..., fn]T corresponding to a classification on X by labeling each xi with a label sign(fi), where sign( ) denotes the sign function. Let y = [y1, ..., yn]T , and we can observe that y is consistent with the initial labels according to the decision rule. We further model the whole dataset X as a graph G = {V, W} with its vertex set V = X and weight matrix W = [wij]n n, where wij denotes the similarity between data points xi and xj. It should be noted that the weight matrix W is usually assumed to be nonnegative and symmetric. For example, we can define the weight matrix W as wij = exp( ||xi xj||2/(2σ2)), (1) where the variance σ is a free parameter that can be determined empirically. In fact, to eliminate the need to tune this parameter, we can adopt many graph construction methods (Wang and Zhang 2008; Yan and Wang 2009; Cheng et al. 2010; Zhuang et al. 2012; Sun, Hussain, and Shawe-Taylor 2014) that have been developed in the literature. Derived from the weight matrix W, the normalized Laplacian matrix L of the graph G can be computed by where I is an n n identity matrix, and D is an n n diagonal matrix with its i-th diagonal element being equal to the sum of the i-th row of W (i.e. P j wij). Since Laplacian regularization plays an important role in graph-based semi-supervised learning, we then provide a symmetrical decomposition of the normalized Laplacian matrix L. More concretely, as a nonnegative definite matrix, L can be decomposed into L = V ΣV T , (3) where V is an n n orthonormal matrix with each column being an eigenvector of L, and Σ is an n n diagonal matrix with its diagonal element Σii being an eigenvalue of L (sorted as 0 Σ11 ... Σnn). Derived from the above eigenvalue decomposition, we can denote L in a symmetrical decomposition form: L = (Σ 1 2 V T )T Σ 1 2 V T = BT B, (4) where B = Σ 1 2 V T . Since B is computed with all the eigenvectors of L, we can regard B as being explicitly defined based upon the manifold structure of the data. We can directly utilize B = Σ 1 2 V T to define a new L1-norm smoothness measure, instead of the traditional smoothness measure used as Laplacian regularization for graph-based semi-supervised learning. In spectral graph theory, the smoothness of a vector f Rn is measured by Ω(f) = f T Lf. Different from the traditional smoothness Ω(f), in this paper, the L1-norm smoothness of a vector f Rn is measured by Ω(f) = ||Bf||1, just as our recent work (Lu and Peng 2012; 2013b; Lu and Wang 2015). Considering the success of L1-norm optimization for noise reduction (Elad and Aharon 2006; Mairal, Elad, and Sapiro 2008; Wright et al. 2009), we then define the objective function of our noise-robust semi-supervised learning as follows 2||f y||2 2 + λ||Bf||1. (5) The first term of Q(f) is the fitting constraint, while the second term is the L1-norm smoothness constraint used as Laplacian regularization. Here, the fitting constraint is not formulated as an L1-norm term. The reason is that, otherwise, most elements of f would tend to zeros (i.e. sparsity) by solving minf ||f y||1 + λ||Bf||1 given that y has very few nonzero elements (i.e. very few initial labeled data are usually provided for semi-supervised learning). In other words, the labels of data points are almost not propagated across the entire dataset, which actually conflicts with the original goal of semi-supervised learning. Hence, the fitting constraint of Q(f) remains as an L2-norm term. It is worth noting that our L1-norm formulation of Laplacian regularization plays an important role in the explanation of noise-robust semi-supervised learning in the framework of sparse coding. Concretely, by limiting the solution f to the space spanned by the eigenvectors V (i.e. f = V α), we can readily formulate our noise-robust semi-supervised learning as a sparse reconstruction problem (see the next subsection). In fact, the sparsity can be considered to be induced into the compressed domain for our noise-robust semi-supervised learning, since sparse coding is regarded as learning compressible model (Zhang, Schneider, and Dubrawski 2010) in the literature. Furthermore, we can also readily apply dimension reduction to our noise-robust semi-supervised learning by working with only a small subset of eigenvectors (i.e. only partial columns of V are used), which is especially suitable for image analysis tasks where the datasets are commonly very large. Although there exist other L1-norm generalizations of Laplacian regularization in (Chen et al. 2011; Petry, Flexeder, and Tutz 2011; Zhou and Scholk opf 2005; Zhou, Lu, and Peng 2013) which approximately take the form of P i