# nonlinear_feature_extraction_with_maxmargin_data_shifting__aa0422a9.pdf Nonlinear Feature Extraction with Max-Margin Data Shifting Jianqiao Wangni , Ning Chen MOE Key lab of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNList, Department of Electronic Engineering, Tsinghua University, Beijing, 100084, China zjnqha@gmail.com, ningchen@tsinghua.edu.cn Feature extraction is an important task in machine learning. In this paper, we present a simple and efficient method, named max-margin data shifting (MMDS), to process the data before feature extraction. By relying on a large-margin classifier, MMDS is helpful to enhance the discriminative ability of subsequent feature extractors. The kernel trick can be applied to extract nonlinear features from input data. We further analyze in detail the example of principal component analysis (PCA). The empirical results on multiple linear and nonlinear models demonstrate that MMDS can efficiently improve the performance of unsupervised extractors. Introduction Feature extractors are important in machine learning and can heavily affect final performance. They transform lowlevel input data, like raw pixels in images and bag-of-words in documents, to high-level features like typical visual patterns (Bengio, Courville, and Vincent 2013) and topics (Blei, Ng, and Jordan 2003). Supervised feature extraction methods have attracted much attention because they are able to learn suitable features for specific tasks. For example, supervised LDA (Mcauliffe and Blei 2008; Zhu, Ahmed, and Xing 2012) jointly models words in documents and their responses (e.g., movie ratings); supervised dictionary learning (Mairal et al. 2009) seeks the overcomplete basis for signals that belong to different classes. In general, the training data are valuable and limited, therefore cannot cover the rich variations. For example, each node of a sensor network may have a probability to mechanically fail, and visual objects have various types of translations and rotations. Therefore, training a good extractor requires a large scale of observations; and how to accomplish such a task by using only a relatively small dataset is an important and challenging goal to pursue. One useful technique to improve the robustness of extractors is corruption, which artificially generates more data by adding noise (e.g., Guassian noise or blankout noise) to the original data. This noising scheme has proven effective on introducing adaptive regularization (Bishop 1995); helping auto-encoder (Vincent Corresponding author. Copyright c 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. et al. 2008) to learn semantic features; and improving the generalization ability of deep networks (Hinton et al. 2012). The explicit corruption method transforms each data instance xn RD (n = 1, , N) to S versions xns through some pre-defined corrupting models p( x|xn), and minimizes the average loss 1 NS n s L( xns, yn) over the corrupted dataset. Though being simple, the disadvantage of the explicit corruption is also obvious corrupting every observation to multiple copies will dramatically increase the training size and thereby training time. Recent work on marginalized corrupted features (MCF) (Maaten et al. 2013) provides an implicit corruption scheme, which avoids enumerating multiple copies and instead minimizes the expected loss 1 N Ep[L( xn, yn)] under the corrupting model. MCF showed promise in learning autoencoders (Chen et al. 2014) and link predictors (Chen et al. 2015). In this paper, we present max-margin data shifting (MMDS), a simple and efficient method that processes the input data before extracting features. MMDS relies on a large-margin classifier on the original data. By building a subsequent feature extractor (e.g., PCA), we are able to learn features that are suitable for classification tasks. MMDS can be approximately derived as the mean of a supervised corrupting model, which considers both the standard noising model (e.g., Gaussian noise) on input data and the discriminative ability of a potential corruption according to a prelearned large-margin classifier. The latter factor encodes our intuition that some corrupted data points may cross the decision boundary (if given) and lead to a difficult case for classification; and such corrupted data points should be discouraged. Our empirical results on various datasets demonstrate the effectiveness of MMDS in the context of various feature extractors. Supervised Data Corruption Considering a fully labeled dataset that includes N data points xn RD and labels yn Y = {1, . . . , H}, where H is the number of categories, we aim to extract the highlevel representation, or latent features, zn RK from data xn. Instead of dealing with the original data, we consider the augmented data xns, which are sampled from a supervised corrupting distribution pn( x|xn, yn, η), parameterized by η. Assuming each data is corrupted for S times, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) the fitness of each zn can be measured by a loss function L(xn, zn, Θ) = 1 S s L( xns, zn, Θ) of a feature extractor, whose parameters are denoted by Θ. For various extractor models, we use different loss functions. For example, the loss function L( xns, zn, Θ) for Non-negative Matrix Factorization (NMF) (Lee and Seung 2001), Kmeans (KM) and Principal Component Analysis (PCA) (Collins, Dasgupta, and Schapire 2001) are defined as NMF : xns W zn 2, zn RK + , W RK D + , KM : xns W zn 2, zn {ek}K k=1, W RK D PCA : x ns xns 2σ2 z n W xns σ2 + A(z n W), W W where W = {W RK D : WW = I}, ek is a vector that places 1 in the k-th dimension and 0 in all other dimensions, σ is a noise parameter, and A(z n W) = exp(z n Wx/σ2 x x/2σ2)dx is logpartition function. Our optimization objective function is F(Θ) = 1 s L( xns, zn, Θ). Let S , we have n Ep( x|xn,yn,η)[L( x, zn, Θ)]. (2) A standard corrupting model treats all the corrupted versions equally, as indicated by the uniform weight 1 S in the average loss. Here, we suggest to distinguish two types of corrupted data points. As illustrated in Figure 1, each of the two data points in different classes (marked as red circles) are corrupted many times, as indicated by the squares and triangles around each data point. Suppose there is a linear decision boundary that well separates the two red points. Then, some of the corrupted data points may cross the decision boundary or be near the boundary and get mixed with the data in a different class. Such points may not be good for extracting features that are suitable for classification, as they make it harder to find a good separating line in the original space. Therefore, we propose a dropping step to remove such corrupted points. The above corrupting and dropping design can be formally described by the supervised corrupting distribution p( x|xn, yn, η), which considers both the standard noise (e.g., Gaussian noise) on input data xn and the discriminative ability of each data. Here, we measure the discriminative ability via data margin. Let ηy be the classifier weights associated with class y. Following (Crammer and Singer 2002), we can learn a large-margin classifier by solving the following problem: n max y (Δℓn(y)+η yxn) η ynxn (3) where Δℓn(y) = ℓI(y = yn) measures the cost of making a wrong prediction, and for simplicity we have omitted the offset parameters. As we need to improve the separability of the training set, besides dropping the samples that cross the decision boundary, we also drop the ones that are within a certain distance from the boundary (marked as crosses in Figure 1). To improve the robustness, we further relax the Figure 1: The original data and corrupted data. (Best viewed in color) deterministic dropping as a soft assignment. That is, we put a weight that decreases exponentially w.r.t. to the hinge-loss (i.e., the gap between their margins and the threshold): φ(yn| x, η) = exp{ C max y (Δℓn(y)+η y x η yn x)}, (4) whose minus logarithm is the multiclass hinge loss by considering the margin favored by the true label over an alternative label. C is a constant. For real-valued data, we further consider the Gaussian noising model, and define our supervised corrupting distribution p as p( x|xn, yn, η) N( x|xn, σ2I)φ(yn| x, η), (5) which is a combination of standard data noising and the discriminative ability. In the sequel, we denote pn( x) = p( x|xn, yn, η). We derive an approximate objective to the intractable expectation in Eq. (2): n L(Epn[ x], zn, Θ), (6) where the approximation is due to the movement of the expectation over pn from outside to inside of the loss function. The objective only depends on data expectation and thereby is efficient without explicitly considering a large amount of corrupted samples. Figure 1 gives a geometrical view: As a proportion of the corrupted data are deleted according to the cutting planes (i.e., the two dotted lines), their mean Epn[ x] (blue circles) are shifted from original data xn (red circles). Besides, the shifting direction is perpendicular to the cutting planes due to the geometric symmetry. In order to make the above procedure practical, we still need to approximate the expectation Epn[ x]. Below, we present a simple procedure, which works well in practice. First, we can show that the corrupting distribution pn( x) can be obtained by minimizing the following objective KL qn( x)||q0 n( x) + CEqn[max y (Δℓn(y) + η y x η yn x)], where q0 n( x) = N( x|xn, σ2I) and the KL function is Kullback Leibler divergence. Then, we move the expectation inside the max function and deal with the following objective (denoted by L(qn( x))): KL qn( x)||q0 n( x) +C max y (Δℓn(y) Eqn[η yn x η y x]), which is in fact a lower bound due to the Jensen s inequality and the convexity of max function. Considering the following optimization problem min q( x),η n L(qn( x)) + 1 which can be written in a constrained form: min {qn( x)},η,ξ n KL qn( x)||q0 n( x) + 1 s.t. : n, y Y, Eqn[η yn x η y x] Δℓn(y) ξy n, with the normalization constraint that qn( x)d x = 1. The Lagrangian function is L({qn( x)}, η, ξ, α, λ) = n KL qn( x)||q0 n( x) +λ qn( x)d x λ+ C n,y αy n(Eqn[η yn x η y x Δℓn(y)] + ξy n). Though finding a rigorous solution needs to iterate over q( x) and η, we consider a simple procedure without iteration and derive a simple and efficient operator. Specifically, we start with setting qn( x) at the prior q0 n( x), and solve for the classifier weights η. For this step, we get the solution ηy = n αy nxn and the dual parameters {α} are obtained by optimizing the dual form of a multiclass SVM (Crammer and Singer 2002). This step can be efficiently done by an existing solver. Then by setting the derivative of the objective w.r.t {qn(x)}N n=1 to zero while fixing α and η, we get qn( x) N(xn + σ2Σyαy n[ηyn ηy], σ2I). (8) Since we use qn( x) to approximate pn( x), we have a rule of calculating Epn[xn], as Epn[xn] Eqn[xn]. Max-Margin Data Shifting With the above approximation, we get a simple and efficient rule to shift the input data: (MMDS) : T [xn] = xn + σ2 y αy n[ηyn ηy], (9) where α and η are the dual and primal solutions of SVM in Eq. (3). We will refer to the operation T as max-margin data shifting (MMDS). This operation satisfies our intuition the simple multiplication with SVM coefficients suggests that the MMDS data are more likely to be predicted as true labels, in contrast with all other labels. Then we train an unsupervised extractor with the MMDS data, by optimizing n L(T [xn], zn, Θ). Note that we only use the MMDS data in learning the feature extractor. The procedure of training an extractor on the MMDS data is summarized as Training Data{x} Shift T [x] Train Extractor. After learning the feature extractor, we use it to extract features from the original data and learn a classifier. The procedure for training a classifier is summarized as Training Data{x, y} Extractor {z, y} Train Classifier. Finally, for testing data, we extract the latent features z and apply the classifier to make the prediction. The above shifting process can be naturally extended to consider nonlinear mapping on the input data, e.g., via a kernel mapping in a Reproducing Kernel Hilbert Space (RKHS). Random Fourier Features (RFF) (Rahimi and Recht 2009) provide a simple and scalable method to embed the original data to RKHS defined by shift-invariant kernels like radical basis function (RBF). (Lopez-Paz et al. 2014) proposed Randomized Kernel PCA, which uses PCA to fit the RFF embedding. Suppose a kernel satisfying K(x, x ) = K(x x , 0), and pf(ω) is a distribution that normalizes the Fourier transformation of this kernel pf(ω) exp(iω x)K(x, 0)dx, where pf(ω)dω = 1. The T-dimensional RFF embeddings are generated as F[x] [cos(ω 1x+b1), ,cos(ω T x+b T )] , (10) where ωt pf(ω) and bt Uniform( π, π). We can apply any extractors to fit the MMDS-processed versions of the RFF embeddings, to learn discriminative nonlinear features. The training procedure of kernelized extractors is {x} RFF F[x] MMDS T [F[x]] Train Extractor. (11) Notice that our framework can be easily generalized to semisupervised learning. All we need to do is to shift the labeled data while keeping the unlabeled ones not changed. Application in Principal Component Analysis Now, we use principal component analysis (PCA) to explain the effect of MMDS in detail. The vanilla PCA seeks the dominant components underlying data. To consider labels, which provide side information with the potential to select components suitable for specific tasks, various extensions have been developed. For example, Supervised PCA (Yu et al. 2006; Rish et al. 2008; Du et al. 2015) combines probabilistic PCA with a regression task under the assumption that both data and side information are generated from a common latent space through linear mapping. The SVDM (Pereira and Gordon 2006) seeks a low dimensional linear embedding of training data using singular vector decomposition (SVD). (Rish et al. 2008) tackled the above problem using a closed-form update rule with provable convergence. Though these methods were proven to be effective, they deal with label information in a regression manner. Furthermore, most of them are based on EM solvers, which can get trapped at local minimum, and suffer from time complexity issues. In this section, we first define a max-margin PCA (MMPCA). Then we show that MMPCA can be approximately solved by learning an unsupervised PCA on the MMDS data, and we will refer to this method as MMDS-PCA. A PCA model assumes that zero-centered data are generated from features through linear mapping that x = W z + ϵ, where zn N(0, I), and ϵ is Gaussian noise. Following the notation of (Guo 2009), we use the formulation of Exponential Family PCA (EFPCA) for further description due to simplicity. We also use capital letters to represent matrices stacking variables. For instance, Z RK N is a matrix whose n-th column is zn. Our objective function on the original data matrix X without corruption is described as in Eq. (1): L(X, Z, W) A(Z W) tr(Z WX) + 1 where W W, A(Z W) = n A(z n W), and we omit the term that only depends on X. This model is difficult to optimize because the orthogonal constraint W is non-convex, and the objective is not jointly convex on Z and W, but it is convex on Z for fixed W in a good way. Denoting A as the Fenchel conjugate of A, we have A(z n W) = max Un tr(z n WUn) A (Un). Notice that A (Un) is convex. Denoting A (U) = n A (Un) and setting L/ zn = 0, we get zn = W(xn Un). Substituting it back, the dual form of PCA is: L(X, M, U) A (U) 1 2 tr((U X)(U X) M), where M M {W W : W W}. (Guo 2009) suggested that the domain M can be relaxed to M = {M : I M 0, tr(M) = K}. After this step the objective satisfies strong minmax property (Borwein and Lewis 2010), which implies that the min and max operators are exchangeable in coordinate descent optimization. Then we get a closed-form solution as W = QK((X U)(X U) ), where the operator QK(A) represents the matrix stacking the top-K eigenvectors of matrix A. It is ensured that the solution W is orthogonal and satisfies our premise. This indicates that the relaxation of domain from M to M is tight. A supervised version of PCA is max-margin PCA (MMPCA), which uses SVM to fit the latent variables z while extracting components: min W W min Z min ν 1 2 ν 2 + L(X, Z, W) + C s.t. : n, yn(ν zn b) 1 ξn, (ξn 0) (12) where we have restricted us to consider the binary classification (i.e., yn {+1, 1}) for simplicity. Extension to the multi-class case can be done similarly as in the MMDS formulation. We define the Lagrangian function of the softmargin SVM F(α, ν, Z) as n αn[yn(ν zn b) 1+ξn]+C where n, αn [0, C] and C is a positive constant. MMPCA is reformulated as min W W min Z max α min ν F(α, ν, Z) + L(X, Z, W). (14) Notice that the min and max operators in F(α, ν, Z) are exchangeable due to linearity. Setting the derivative w.r.t Z and we get zn = W(xn Un) + αnynν. Now we consider a strongly related SVM model whose Lagrangian function is F(α, η, X) = 1 2 η 2 n αn[yn(η xn b) 1+ξn]+C n ξn, where ( n, αn [0, C]). This objective function is independent of features. For further derivation, we will make an assumption that the equality zn = Wxn holds, which means that the learned features approximate the projection of data by W, and this approximation becomes tighter when Rank(X) approaches K. Recalling that WW = I, the dual forms of the two different SVMs share an equal term n αnynxn 2 = n αnynzn 2, so that the two dual solutions are actually the same. The primal solutions of two SVMs Wη = n αnyn Wxn = n αnynzn = ν are related through W. Recall that zn = W(xn Un) + αnynν, we get zn = W(T [xn] Un), where T [xn] = xn +αnynη is a binary-labeled and σ2 = 1 case of MMDS. According to our previous proof, we have maxα minν F(α, ν, Z) = maxα minη F(α, η, X) under the premise of zn = Wxn. Substituting it back, we get the dual form of the objective D[1] : min M M max U,α min η F(α, η, X) + L(T [X], M, U). Since the SVM part F(α, η, X) does not affect the optimum of other parameters, M and W are obtained in a similar manner as before. For MMDS-PCA, we have the following learning problem: D[2] : min M M max U L(T [X], M, U). (15) By comparing with D[1], the only modification that affects Z and W is changing xn to T [xn], which implies that the MMPCA is approximated to MMDS-PCA, where the approximation precision depends on the low rank property of X and the scale of data shifting. Analysis on Principal Components Based on the above analysis on the relation between MMPCA and MMDS-PCA, it is obvious that data shifting can enlarge the feature margin. Now we analyze how the principal components are affected. Suppose that input data are zero-centered after preprocessing, their covariance matrix is decomposed as S(X) = 1 N n xnx n = VΩV , where V V = I, and Ω = diag[l1, , l D] arranges the eigenvalues in a descending order. Recall that in binary case we have η = n αnynxn. The covariance matrix of MMDS data is S(T [X]) = VΩV + α0ηη /N, where α0 = n α2 n + 2. The equation proves that MMDS-PCA equals to using SVM coefficient as one training data. Denoting γ = V η RD, we have η = VV η = Vγ, so ηη = (Σdγd Vd)(Σdγd V d) = Σj,kγjγk Vj V k. According to the property of normal distribution, we have η N(0, n α2 nΦ), where Φ is the covariance matrix of data distribution. S(X) can be viewed as a sampling approximation of Φ, and (Zhang et al. 2012) proved the tail bounds on sum of N random matrices, ϵ 0, Pr{ S(X) Φ 2 Q(N, ζ, ϵ) Φ 2} 2De ϵ, (16) where Q(N, ζ, ϵ) = 2ϵζ + 1/N + 2ϵζ/N and ζ = tr(Φ)/ Φ 2. We denote α0 = E[ n α2 n] to represent taking expectation w.r.t data distribution. Suppose there is enough amount of data, γjγk(j = k) will approach zero, E[γjγk] α0E[V j S(X)Vk] = α0E[V j Vk]lk = 0, where the last step comes from the definition of V. Now we can approximate an eigen-decomposition by ηη = Vγγ V d γ2 d Vd V d = V diag[γ2 1, , γ2 D]V . Now we have S(T [X]) VΩV + α0ηη V ΩV , (17) where Ω = diag[l1 + α0γ2 1, , l D + α0γ2 D]. Since d, Vd 2 = 1, so γd = η Vd = ρ(η, Vd) η , where ρ( , ) is the correlation coefficient. This means that γd measures the linear correlation between the eigenvector Vd and SVM coefficients η. Recall that PCA chooses principal components according to eigen-values. As data shifting affects the descending order of eigenvalues of data covariance, the components which are more correlated with η will have a higher possibility of being selected as principal ones. Then the projection is influenced, especially when we reduce dimension heavily. This result concords with our intuition, because the multiclass SVM can be viewed as a discriminative projection to a |Y| dimensional subspace. To conclude, SVM supervises PCA to obtain a balance between data reconstruction and linear separability. The calculation of covariance matrix needs O(ND2) computation, and the eigen-decomposition step consumes O(D3) computation, with stochastic SVM solver taking O(N). When the data size becomes relatively large, the total complexity is approximately O(ND2). In contrast, all of the previous methods are iterative algorithms, and the periteration complexity is O(NDK4) for BMMPCA (Du et al. 2015), O(NDK) (primal form) or O(N 2) (dual form) time for SPPCA (Yu et al. 2006) and O(N 2) for SEPCA (Guo 2009), which are generally slower than our method. Experiments We implement our data shifting phase multiclass SVM using the stochastic minibatch Frank-Wolfe algorithm (Lacoste Julien et al. 2014) with a fixed maximum number of iterations. For the final classification phase, we use the Lib Linear package (Fan et al. 2008) to learn multiclass SVMs. To show the effect of data shifting operation, Figure 1 visualizes a 2 dimensional embedding of the Digits-HOG data in 4 classes, which will be introduced later. The two figures show projections by 2-dimensional PCA and MMDS-PCA respectively. We can see that the features extracted from MMDS data are more separable. Real World Datasets with PCA We tested MMDS-PCA with eigen-decomposition solvers on multiple real world datasets on various tasks, including face recognition, tumor diagnosis and video retrieval. The Yale dataset contains 165 images of 15 individuals. The Yale B (the extended Yale Face Database B) dataset includes 38 individuals and about 64 near frontal face images. The ORL dataset contains 10 different varying lighting and facial detail images for each of 40 distinct subjects. All the Figure 2: 2D-embedding of the digit images processed by (left) PCA and (right) MMDS-PCA. Table 1: Classification Accuracy (%) on various datasets. Yale Yale B ORL 11 tumor PCA 55.8 12.9 51.6 67.6 LDA 37.1 15.7 28.6 28.6 SPPCA 51.6 9.8 61.7 63.0 SDR-GLM 58.8 19.0 - 63.5 SEPCA 64.4 20.5 - 88.9 MMDS-PCA 70.1 33.3 73.5 73.3 faces are manually aligned, cropped and resized to 32*32 pixels. The 11 Tumor dataset contains 174 gene samples of 11 different class, documented as 12,522 dimensional vectors. The TRECVID2003 dataset contains 1078 video shots of 5 categories, documented as 165-dim HSV color histogram, and we evenly split them to training and testing set. The number of training samples per class is 2 for ORL, 3 for Yale, 11 for tumor, and 5 for Yale B. We compare with a wide range of state-of-the-art supervised feature extractors, including SPPCA (Yu et al. 2006), SEPCA (Guo 2009), supervised dimensionality reduction with generalized linear models (SDR-GLM) (Rish et al. 2008), large-margin Harmonium (MMH) (Chen et al. 2012), and infinite latent SVM (i LSVM) (Zhu, Chen, and Xing 2014); and two baseline methods PCA and linear discriminant analysis (LDA). For all models, the data are projected into 10-dimensional space (K=10). We use 5 (or the number of minimal category) folds cross-validation to find proper parameters. Table 1 and Table 2 present the accuracy of different models on various datasets. We can see that our easilyimplemented MMDS-PCA outperforms most state-of-art supervised models on many datasets, except for one case in 11 tumor, which is very suitable for exponential family PCA (Guo 2009) to fit. Generalization Ability and Parameter Sensitivity We randomly choose 500 samples from the MNIST dataset, and use 10,000 samples for testing. The Digits dataset is within Open CV. We extract 64 dimensional HOG features (Dalal and Triggs 2005) of 5,000 digits images and evenly split them to train/test sets. The Letters dataset (Ben, Carlos, Table 2: Classification Accuracy (%) on the TRECVID EFH MMH i LSVM MMDS-PCA TRECVID 56.5 56.6 56.3 59.74 Table 3: Recognition accuracy (%) of different dimension reduction methods using the RFF embedding of various datasets and their MMDS versions. Data\Model ICA SPCA KM FA NMF MNIST 57.8 30.3 48.1 64.9 38.3 MNIST-MMDS 63.7 48.8 61.8 74.4 47.6 Digits 62.8 63.5 63.2 70.0 28.4 Digits-MMDS 81.6 75.5 71.2 86.1 72.1 Letter 31.6 26.9 17.9 38.5 31.0 Letter-MMDS 39.7 38.2 46.3 52.8 38.2 and Daphne 2004) contains 26 kinds of Latin characters, and we use 5,375 samples for training while the other 46,777 for testing. For these data, we calculate their RFF embeddings with a RBF kernel K(x, x ) = exp( x x 2/σ2), σ2 is determined during experiments, and the dimensions of RFF is set to 500 for Digits and 1,000 for the other datasets. We test the generalization ability of MMDS using multiple feature extractors, including Independent Components Analysis (ICA), Sparse Principal Components Analysis (SPCA) (Zou, Hastie, and Tibshirani 2006), Kmeans (KM), Factor Analysis (FA), and Non-negative Matrix Factorization (NMF) (Lee and Seung 2001). We kernelize them by using RFF embeddings under the previously mentioned settings. We extract 5-dim features using all of these models. Table 3 presents the accuracy using either the original data or the one processed by MMDS. Even these models are not easy to develop supervised versions, our results demonstrate that they can consistently benefit from MMDS, and most improvements are significant. We tested handwritten symbols recognition tasks using PCA, Autoencoder (AE), Kernel PCA (KPCA) (Lopez-Paz et al. 2014), on the MNIST, Digits, and Letters datasets to study the parameter sensitivity. We use PCA with a standard eigen-decomposition solver. We implemented the autoencoder, which optimizes 1 N n xn ϕ(W ϕ(Wxn + b1) + b2)) 2 (18) w.r.t. W, b1, b2, where ϕ is the sigmoid function. We use an L-BFGS solver with a fixed maximum number of iterations, and normalize the data to (0.1, 0.9) before sending to the autoencoder. We implemented KPCA using RFF embeddings in the previous setting. Then we use the same classifier setting for every model pairs that use either the original data or the shifted data. Figure 3 presents the performance of various extractors under different shifting scales (i.e., σ2 in MMDS), where the accuracy and reconstruction errors are normalized. The tendency shows that as the shifting scale gets larger, the reconstruction error stably increases and the accuracy will first Figure 3: Sensitivity of MM-models: X-axis, shifting scale σ2; Red, classification accuracy; Blue, reconstruction error. All the statistics are normalized for visualization. increase and then decrease. Such results suggest that the discriminative MMDS may reduce the fitness on the observed data, and a suitable scale of shifting operation is helpful for extracting discriminative features; but an arbitrarily large shifting may not well fit the input data. Finally, Table 4 presents the performance under different settings of feature dimensions (K). We can see that the improvement brought by MMDS over the original data is substantial. The selection of discriminative components is very important, especially under low dimensional conditions, whereas the information is more compressed. Table 4: Accuracy w.r.t changing feature dimensions (%). In each column, the left number is the accuracy of fitting the original data, while the right number is the accuracy of using the MMDS data. Data (K)\Model PCA AE KPCA MNIST(5) 64.4 69.7 43.8 49.7 64.4 73.7 MNIST(10) 75.3 79.2 58.1 76.3 74.7 81.7 MNIST(20) 81.5 83.0 77.7 81.6 81.0 83.1 Digits(5) 58.3 79.4 73.5 79.0 70.8 83.8 Digits(10) 79.6 89.1 87.4 93.0 85.8 94.0 Digits(20) 86.8 90.2 92.7 93.6 91.9 94.3 Letters(5) 35.1 51.1 35.4 42.2 38.4 53.3 Letters(10) 53.6 62.9 50.7 58.8 53.4 65.0 Letters(20) 65.5 69.7 61.2 65.8 62.9 71.9 Conclusions We present a simple max-margin data shifting (MMDS) operator to learn discriminative features using unsupervised extractors. MMDS can be approximately derived as the mean of a supervised corrupting model that considers both the standard data noise and the discriminative ability of a potential corruption according to a pre-trained large-margin classifier. Our experiments on various datasets show that MMDS can help a wide family of models on learning features that are suitable for classification tasks. Acknowledgments The work was supported by the National Basic Research Program (973 Program) of China (Nos. 2013CB329403, 2012CB316301), National NSF of China (Nos. 61305066, 61322308, 61332007), TNList Big Data Initiative, and Tsinghua Initiative Scientific Research Program (Nos. 20121088071, 20141080934). Ben, T.; Carlos, G.; and Daphne, K. 2004. Max-margin markov networks. Advances in Neural Information Processing Systems 16:25. Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on 35(8):1798 1828. Bishop, C. M. 1995. Training with noise is equivalent to tikhonov regularization. Neural computation 7(1):108 116. Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:993 1022. Borwein, J. M., and Lewis, A. S. 2010. Convex analysis and nonlinear optimization: theory and examples, volume 3. Springer Science & Business Media. Chen, N.; Zhu, J.; Sun, F.; and Xing, E. P. 2012. Largemargin predictive latent subspace learning for multiview data analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on. Chen, M.; Weinberger, K. Q.; Sha, F.; and Bengio, Y. 2014. Marginalized denoising auto-encoders for nonlinear representations. In ICML. ACM. Chen, Z.; Chen, M.; Weinberger, K. Q.; and Zhang, W. 2015. Marginalized denoising for link prediction and multi-label learning. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Collins, M.; Dasgupta, S.; and Schapire, R. E. 2001. A generalization of principal components analysis to the exponential family. In Advances in Neural Information Processing Systems, 617 624. Crammer, K., and Singer, Y. 2002. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research 2:265 292. Dalal, N., and Triggs, B. 2005. Histograms of oriented gradients for human detection. In CVPR 2005. IEEE. Du, C.; Zhe, S.; Zhuang, F.; Qi, Y.; He, Q.; and Shi, Z. 2015. Bayesian maximum margin principal component analysis. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; and Lin, C.-J. 2008. Liblinear: A library for large linear classification. The Journal of Machine Learning Research 9:1871 1874. Guo, Y. 2009. Supervised exponential family principal component analysis via convex optimization. In Advances in Neural Information Processing Systems, 569 576. Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv:1207.0580. Lacoste-Julien, S.; Jaggi, M.; Schmidt, M.; and Pletscher, P. 2014. Block-coordinate frank-wolfe optimization for structural svms. In ICML. ACM. Lee, D. D., and Seung, H. S. 2001. Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing Systems, 556 562. Lopez-Paz, D.; DE, M.; Sra, S.; Ghahramani, Z.; and Sch olkopf, B. 2014. Randomized nonlinear component analysis. In ICML. ACM. Maaten, L.; Chen, M.; Tyree, S.; and Weinberger, K. 2013. Learning with marginalized corrupted features. In ICML. ACM. Mairal, J.; Ponce, J.; Sapiro, G.; Zisserman, A.; and Bach, F. R. 2009. Supervised dictionary learning. In Advances in Neural Information Processing Systems, 1033 1040. Mcauliffe, J. D., and Blei, D. M. 2008. Supervised topic models. In Advances in Neural Information Processing Systems, 121 128. Pereira, F., and Gordon, G. 2006. The support vector decomposition machine. In ICML. ACM. Rahimi, A., and Recht, B. 2009. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems. Rish, I.; Grabarnik, G.; Cecchi, G.; Pereira, F.; and Gordon, G. J. 2008. Closed-form supervised dimensionality reduction with generalized linear models. In ICML. ACM. Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.- A. 2008. Extracting and composing robust features with denoising autoencoders. In ICML. ACM. Yu, S.; Yu, K.; Tresp, V.; Kriegel, H.-P.; and Wu, M. 2006. Supervised probabilistic principal component analysis. In ACM SIGKDD, 464 473. Zhang, L.; Mahdavi, M.; Jin, R.; Yang, T.; and Zhu, S. 2012. Recovering the optimal solution by dual random projection. ar Xiv preprint ar Xiv:1211.3046. Zhu, J.; Ahmed, A.; and Xing, E. P. 2012. Medlda: maximum margin supervised topic models. The Journal of Machine Learning Research 13(1):2237 2278. Zhu, J.; Chen, N.; and Xing, E. P. 2014. Bayesian inference with posterior regularization and applications to infinite latent svms. The Journal of Machine Learning Research 15(1):1799 1847. Zou, H.; Hastie, T.; and Tibshirani, R. 2006. Sparse principal component analysis. Journal of computational and graphical statistics 15(2):265 286.