# discriminative_vanishing_component_analysis__6b44f434.pdf

Discriminative Vanishing Component Analysis

Chenping Hou1, Feiping Nie2, Dacheng Tao3

1College of Science, National University of Defense Technology 2Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University 3Center for Quantum Computation and Intelligent Systems and the Faculty of Engineering and Information Technology, University of Technology, Sydney hcpnudt@hotmail.com, feipingnie@gmail.com, dacheng.tao@uts.edu.au

Vanishing Component Analysis (VCA) is a recently proposed prominent work in machine learning. It narrows the gap between tools and computational algebra: the vanishing ideal and its applications to classiﬁcation problem. In this paper, we will analyze VCA in the kernel view, which is also another important research direction in machine learning. Under a very weak assumption, we provide a different point of view to VCA and make the kernel trick on VCA become possible. We demonstrate that the projection matrix derived by VCA is located in the same space as that of Kernel Principal Component Analysis (KPCA) with a polynomial kernel. Two groups of projections can express each other by linear transformation. Furthermore, we prove that KPCA and VCA have identical discriminative power, provided that the ratio trace criteria is employed as the measurement. We also show that the kernel formulated by the inner products of VCA s projections can be expressed by the KPCA s kernel linearly. Based on the analysis above, we proposed a novel Discriminative Vanishing Component Analysis (DVCA) approach. Experimental results are provided for demonstration.

Introduction Feature extraction is one of the most important topics in machine learning. Primarily, it addresses the problem of ﬁnding the most relevant and informative set of features. It has been commonly recognized that the effectiveness of feature extraction takes great inﬂuence on the subsequent procedures, such as classiﬁcation and clustering (Guyon and Elisseeff 2003). Because of its importance, feature extraction still attracts a lot of research efforts nowadays, although it is a traditional topic and there is much literature already. In the literature, there are mainly two distinct ways for feature extraction. The ﬁrst kind of approaches are performed in the original data space. They use the statistical characteristics or similarity measurements of the original data for feature extraction. Typical statistical property based methods include Relief F (Robnik-Sikonja and Kononenko 2003), Fisher Score (Koller and Sahami 1996) and Laplacian Score (He, Cai, and Niyogi 2006), etc. Similarity-based methods include linear methods, e.g., Principal Component Analysis (PCA) and Linear Discriminative Analysis (LDA)

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(Bishop 2006), and other nonlinear dimensionality reduction approaches (Van der Maaten, Postma, and den Herik 2009). The second way to do feature extraction is to transform the original data into another space ﬁrst. For example, the kernel methods can characterize the original data in a Reproducing Kernel Hilbert Space (RKHS). By kernel trick, they just need the kernel matrix, without knowing the explicit mapping function from the original data space to RHS (Sch olkopf and Smola 2001). Due to their effectiveness, the kernel methods have been regarded as one of the most important tools for data mining. Compared with the ﬁrst kind of approaches, the second type of methods have more ﬂexibility in characterizing original data and have attracted a lot of research interests. Recently, Livni et al have proposed a novel approach named Vanishing Component Analysis (VCA) (Livni et al. 2013), which belongs to the above-mentioned second category. VCA uses the concept vanishing ideal in computational algebra for data transformation and feature selection, solving the feature extraction problem in a distinctive way. It has sound theoretical foundation and great values in applications. It not only provides a new perspective on the feature extraction problem in machine learning, but also makes contributions to the mathematical area concerning vanishing ideal. In this paper, we aim to analyze this prominent work from the kernel view. Under a weak assumption, we prove that the projection matrix derived by VCA is located in the same space as that of Kernel Principal Component Analysis (KPCA) (Sch olkopf, Smola, and M uller 1998) with polynomial kernel on each category respectively. The explicit linear transformation matrices between two groups of transformed data points are derived. Furthermore, we demonstrate that different mapping results of KPCA and VCA have equal discriminative power if the ratio trace criteria in traditional LDA is adopted for measurement. The linear relations between the kernels of KPCA and VCA are also revealed. Based on the above analysis, we propose a new Discriminative Vanishing Component Analysis (DVCA) approach. The main contributions of this paper include, (1) We have analyzed VCA in kernel view. The results reveal the relationship between the two types of prominent research, kernel methods and VCA. It also provides a new perspective of VCA and deepens the understanding of the

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

relations of these approaches. (2) We have proposed an improved approach, i.e., DVCA. Compared with VCA, it takes the discriminative information among different classes into consideration while VCA does not. Experimental results are provided to show its advantages.

VCA Revisited

VCA is a novel supervised approach which provides us a new perspective of feature extraction and tightens the relationship between the concept of vanishing ideal in algebra and its application in machine learning. Let us introduce some notation and basic deﬁnitions ﬁrst. Assume that Xi = [x(i) 1 , x(i) 2 , , x(i) ni ] represents the data points in the ith category and X = [X1, X2, , Xc] contains all data points in the original D-dimensional space. Here, ni is the number of points in the class i, c is the number of classes and n is the total number of training points, i.e., n = c i=1 ni. The set of all polynomials f(x) that satisﬁed f(x) = 0 for any x S is known as the vanishing ideal of S, where S is a data set. It is denoted as I(S). If a set of polynomials can generate I(S) by common operations between them, then this set of polynomials is called a set of generators of I(S). The elements of such a ﬁnite set of generators is named as Vanishing Components. The basic idea of VCA is ﬁnding a ﬁnite set of generators or vanishing components for the data points in each category. Then, it combines all the vanishing components of each category and uses this combination as the transformation function for all data points. Intuitively, if a function is a vanishing component of the vanishing ideal of Xi, then it will attain zero values on the points belonging to the ith class, but not necessary to be zero on data points of other categories. By this way, VCA extracts the discriminative information from the data and consequently, it is more convenient to use this kind of representations for the following classiﬁcation task. In (Livni et al. 2013), the authors provide a quick and effective way to fulﬁll this goal. It can also be regarded as great progress in the mathematical ﬁelds concerning vanishing ideal computation. See more details about VCA there.

VCA vs KPCA

Since VCA focuses on ﬁnding the polynomials with the vanishing property, we directly investigate polynomial kernels. Recalling the results in kernel theory, we know that the polynomial kernel element computed by (1 + x T y)d can be formulated by the inner product of two vectors φ(x) and φ(y), where φ( ) is the mapping function corresponding to this kernel. If we use the polynomial kernel with degree d, previous results show that the function φ( ) can be derived explicitly. It is just the linear combination of all the scaled versions of the monomials with degrees no larger than d (Sch olkopf and Smola 2001). Since we only refer to linear transformation, we use the monomials with degrees no larger than d for simplicity in the following deduction, and it takes no effects on the following theoretical results.

KPCA with polynomial kernels can be regarded as performing traditional PCA on the mapping data. The kernel trick is an elegant method to avoid manipulation in high dimensional space explicitly. Using the kernel trick, we just need the kernel and it is not necessary to know the exact mapping function φ( ). In the following, unless otherwise speciﬁed, the kernel that KPCA uses is polynomial. Based on the results in Proposition 3.1, Theorem 3.2 and the procedures of VCA in (Livni et al. 2013), for the ith class, VCA aims to ﬁnd the generators of the null space of Φi, where Φi = [φ(x(i) 1 ), φ(x(i) 2 ), , φ(x(i) ni )] consists of the mapping results of the data points in the ith category, by above-mentioned φ( ). Denote Si as the matrix formulated by vectors that span the null space of Φi. Si is the ith group of projecting vectors of VCA, satisfying ST i Φi = 0. Before analyzing VCA in kernel view, we would like to explain the intuition of our work. The Theorem 3.2 in (Livni et al. 2013) argues that the kernel trick cannot be used to ﬁnd the vanishing components. The reason is that the authors only use the kernel feature maps of the training points on which the polynomials should vanish. To overcome this problem, we manage to construct vanishing components from the linear combinations of kernel feature maps using all training points across classes. To analyze this problem, we ﬁrst reveal the intrinsic relationship between KPCA and VCA. Let us discuss the relationship between the linear spaces that they are located in. Denote Φ = [Φ1, Φ2, , Φc]. Commonly, the dimensionality of the mapped data is very high. Thus, it is reasonable to assume that the columns of Φ consist of linearly independent vectors. Besides, we can also neglect the projections derived by Si if they locate in the null space of all data points, because they project all data points to zero values and eliminating these dimensions with totally zero values makes no difference. For this reason, without loss of generality, we assume that span(Si) span(Φ), where span( ) denotes the space spanned by the columns of the corresponding matrices. Under these weak assumptions, the space spanned by the projection matrix of VCA is the same as that of the space spanned by all data points.

Theorem 1. Assume that the columns of Φ are linear independent. Then, span([S1, S2, , Sc]) = span(Φ).

Proof. For any i = j, we will prove that span([Si, Sj]) = span(Φ). On one side, we know that span(Si) span(Φ) for i = 1, 2, , c. Then, we know that span([Si, Sj]) span(Φ). On the other side, for any vector ξ span(Φi), it is not orthogonal to span(Sj). Otherwise, ξ span(Φj). In other words, there is a vector located in both span(Φi) and span(Φj). It conﬂicts with the assumption that the columns of Φ are linear independent. Moreover, if span([Si, Sj]) = span(Φ), there is a vector ξ span(Φ) which is orthogonal to span([Si, Sj]). It indicates that ξ span(Φi) and ξ is orthogonal to span(Sj). It conﬂicts with the above statement. We will show that the projections derived by KPCA and VCA are related by a linear transformation. Based on the

above theorem, [S1, S2, , Sc] can be formulated as [S1, S2, , Sc] = ΦG, (1) where G is a matrix formulated by all coefﬁcients. It has full row rank since rank([S1, S2, , Sc]) = rank(Φ). Assume that the SVD decomposition of Φ is Φ = UΣVT and B = GT VΣ, where we only pick up the nonzero singular values and the corresponding singular vectors. It is named as thin SVD in the following parts. Denote the projection results derived by KPCA and VCA as P and P respectively. Then, P = UT Φ and P = [S1, S2, , Sc]T Φ. Their relationship is shown in Theorem 2. Theorem 2. P = BP and P = (BT B) 1BT P. Proof. Note that P = [S1, S2, , Sc]T Φ = (ΦG)T Φ

= (UΣVT G)T Φ = GT VΣUT Φ = BP. (2)

Recall the deﬁnition B = GT VΣ and the fact that G is a matrix with full row rank, we know that B is a matrix with full column rank. In other words, BT B is an invertible matrix. Then, P = BP BT P = BT BP

P = (BT B) 1BT P. (3)

From Theorem 1 we know that these two projections share the same linear space, therefore, it is not surprising that they can express each other by linear transformations. Intuitively, they should have similar discriminative power. The following results show that the discriminative power of the two projections is equal, provided that the ratio trace criteria in LDA is employed for measurement. The reason why we use ratio trace criteria is that it is one of the most popular metrics in supervised learning. Before going into the details, we provide some lemmas. Lemma 1. For any matrix A, if UT U = I, we have the equation (UAUT )+ = UA+UT , where ( )+ denotes the Moore-Penrose inverse of a matrix. Proof. Assume that the thin SVD decomposition of A is A = UAΣAVT A. Then,

(UAUT )+ = (UUAΣAVT AU T )+

=(UUAΣA(UVA)T )+ = UVAΣ 1 A (UUA)T

=U(VAΣ 1 A UT A)UT = UA+UT .

Lemma 2. Assume that A is an invertible matrix and B is a matrix with full column rank, we have the equation (BABT )+ = BA+BT . Proof. Assume the thin SVD decomposition of B is B = UBΣBVT B. Then,

(BABT )+ = (UBΣBVT BAVBΣBUT B)+

=UB(ΣBVT BAVBΣB)+UT B =UB(ΣBVT BAVBΣB) 1UT B =UBΣ 1 B VT BA 1VBVBΣ 1 B UT B = (BT )+A 1B+. (5)

In the above deduction, the second equation holds based on Lemma 1. The third equation holds since B is a matrix with full column rank. Lemma 3. If B is a matrix with full column rank, then BT (BT )+ = B+B = I Proof. Assume that the thin SVD decomposition of B is B = UBΣBVT B. Then,

BT (BT )+ = VBΣBUT BUBΣ 1 B VT B = VBVT B = I,

B+B = VBΣ 1 B UT BUBΣBVT B = VBVT B = I. (6)

Lemma 4. Assume that A is an invertible matrix and B is a matrix with full column rank. For any matrix C, we have Tr((BABT )+BCBT ) = Tr(A 1C), where Tr( ) represents the trace of a matrix. Proof. Recalling Lemma2 and Lemma3, we have

Tr((BABT )+BCBT ) = Tr((BT )+A 1B+BCBT )

=Tr(A 1B+BCBT (BT )+) = Tr(A 1C). (7)

Based on the above four lemmas, we show our main results on the relationship between KPCA and VCA in terms of the discriminative power. Theorem 3. Recall that the projections derived by KPCA and VCA are P and P respectively. If we adopt the same ratio trace criteria as in traditional LDA for measurement, then P and P have the same discriminative power. In other words, the following equation holds.

Tr(( PLw PT )+( PLb PT )) = Tr((PLw PT ) 1(PLb PT )), (8)

where Lb and Lw are the matrices concerning the betweenand withinscatters and they are deﬁned as follows.

Lw = I H(HT H) 1HT .

Lb = H(HT H) 1HT 1

where H = {0, 1}n C is an indicator matrix, i.e., Hij = 1 if xi belongs to the jth class and Hij = 0 otherwise. Proof. Recalling the results in Theorem 2, we know that P = BP and B is a matrix with full column rank. Based on Lemma 4, we have

Tr(( PLw PT )+( PLb PT ))

=Tr((BPLw PT BT )+(BPLb PT BT ))

=Tr((PLw PT ) 1(PLb PT )).

The second equation holds since PLw PT is an invertible matrix as in traditional LDA. Finally, we would like to reveal the relationship between KPCA and VCA. Interestingly, the kernel formulated by the inner products of VCA s projections can be expressed by that of KPCA linearly.

Theorem 4. Note that the projections derived by KPCA and VCA are P and P respectively. The corresponding kernels are K = PT P and K = PT P respectively. Then, K = KQ = RK, where Q = GGT ΦT Φ and R = ΦT ΦGGT . Proof. Note that Φ = UΣVT and P = UT Φ, we have

K = PT P = ΦT UUT Φ = VΣUT UUT UΣVT

= VΣUT UΣVT = ΦT Φ. (10)

Since [S1, S2, , Sc] = ΦG, K = PT P = ([S1, S2, , Sc]T Φ)T [S1, S2, , Sc]Φ

=((ΦG)T Φ)T (ΦG)T Φ = ΦT ΦGGT ΦT Φ =KQ = RK.

In summary, we have provided some theoretical results which can reveal the intrinsic relationship between KPCA and VCA from different aspects.

Why kernel can help We would like to explain why kernel can help. In VCA, the authors have proved the following result. Theorem 5. (Livni et al. 2013) Let Ki be the reproducing kernel and function f span(Ki( , x(i) j )) such that f van-

ishes on all x(i) j for j = 1, 2, , ni. Then f is the zero function. In this theorem, it is assumed that the function should have the form f span(Ki( , x(i) j )). It only concerns the points in the i-th class. Different from the setting of this theorem, we assume that the function should satisfy f span(K( , x(i) j )), which is formulated on all training points. In other words, Livin et al only use the kernel feature maps of the training points on which the polynomials should vanish, while we manage to construct vanishing components from the linear combinations of kernel feature maps using all training points across classes. It is more common in real applications as in most kernel learning algorithm (Sch olkopf and Smola 2001). Thus, our results do not conﬂict with Theorem 5 proved by Livin et al., but provide a different point of view to VCA and make the kernel trick on VCA become possible. Interestingly, if we constrain the function as the form concerning only the ith kernel Ki, it is not surprising that the function f with vanishing property is the zero function since it is located in the null space of Ki.

Discriminative Vanishing Component Analysis In the section, we ﬁrst reformulate VCA in the kernel format based on the above analysis. Then, by adding a discriminative objective function, we present our Discriminative Vanishing Component Analysis (DVCA) approach.

Kernel Formulation of VCA Based on the above explanations and the results in Theorem 1, we can reformulate VCA by ﬁnding {Si} that satisﬁes:

ST i Φi = 0, ST i Si = I, Si = ΦGi, (11)

where Gi is formulated by the corresponding columns of G in formulating Si as shown in Eq. (1). Recall that Hilbert s basis theorem (Cox, Little, and O Shea 2007) guarantees the existence of vanishing component. Considering the results in Theorem 1, we can approximate the above-mentioned problem in Eq. (11) as follows.

min S1,S2, ,Sc

j=1 ST i φ(x(i) j ) 2 2

s.t. Si = ΦGi, ST i Si = I, i = 1, 2, , c.

where 2 denotes the 2-norm of a vector. Commonly, the objective function can attain zero value since the dimension of φ(x(i) j ) is often much larger than the number of points belonging to the ith category. This problem can be separated into c sub-problems. For the ith problem, the objective function can be represented by kernel K = ΦT Φ as deﬁned in Theorem 4.

j=1 ST i φ(x(i) j ) 2 2 =

j=1 GT i ΦT φ(x(i) j ) 2 2

=Tr(GT i ΦT ΦiΦT i ΦGi) = Tr(GT i Ki KT i Gi),

where Ki is part of K and it is formulated by inner products between data points in Φ and Φi, i.e., Ki = ΦT Φi. Correspondingly, the constraints become

ST i Si = I GT i ΦT ΦGi = I GT i KGi = I. (14)

Discriminative Vanishing Component Analysis

In essence, VCA tries to ﬁnd Si that satisfy ST i Φi = 0 . In other words, it computes the projection matrix Si which can map the data points in the ith class to zero values. It does not consider the projections for points in other categories. Apparently, mapping data points in the ith category close to zeros and simultaneously mapping the points in other categories as far away from zeros as possible will beneﬁt the subsequent classiﬁcation. In the literature, there are many approaches considering the dissimilarities between samples from different categories, such as Linear Discriminative Analysis (LDA) and its kernel extension (Mika et al. 1999), and Biased Discriminative Analysis (BDA) (Zhou and Huang 2001). In this paper, we adopt the idea from BDA. By adding the requirement that the mapping results of data points in different categories should be as far away from each other as possible, we propose our Discriminative Vanishing Component Analysis (DVCA) approach. Mathematically, for the ith category, the discriminative requirement has the following objective function:

φ(x(j) t ) 1

s=1 φ(x(i) s )

It indicates that the mapping results of data from other categories should be far away from the center of {φ(x(i) s )}ni s=1.

Denote ei = [1, 1, , 1]T ni as a size ni column vector with all elements being 1. Note that Si = ΦGi, the above function can be reformulated as follow

t=1 Tr(GT i ΦT (φ(x(j) t ) 1

(φ(x(j) t ) 1

ni Φiei)T ΦGi).

Denote Φi as the matrix formulated by the points which do not belong to the ith category. Denote ni as the number of columns of Φi. Denote fi = [1, 1, , 1]T ni as a column vector with size ni and all 1 elements. Then, the above objective function can be reformulated as follows.

Tr GT i ΦT Φi( Φi)T ΦGi

+( ni/n2 i )Tr GT i ΦT Φieie T i ΦT i ΦGi

(2/ni)Tr GT i ΦT Φieif T i ΦT i ΦGi .

Likewise, denote Ki as part of K, which is formulated by inner products between data points in Φ and Φi, i.e., Ki = ΦT Φi. The above formulation in Eq. (17) becomes

Tr GT i Ki( Ki)T +( ni/n2 i )Kieie T i KT i

(2/ni)Kieif T i ( Ki)T Gi . (18)

By combining the objective functions in Eq. (13) and Eq. (18), and using the constraints in Eq. (14), we have the following formulation of DVCA.

min G1, ,Gc

i=1 Tr(GT i Ki KT i Gi) λTr GT i Ki( Ki)T Gi

+ ( ni/n2 i )Kieie T i KT i (2/ni)Kieif T i ( Ki)T Gi

s.t. GT i KGi = I. (19)

where λ is a parameter to balance two factors, i.e., the vanishing property and the discriminative requirement.

The optimization problem of DCA in Eq. (19) can be divided into c sub-problems. Thus, in the following derivation, we only focus on the ith sub-problem. Denote the eigen-decomposition of K as K = ΓΛΓT , where Λ consists of all the non-zero eigen-values. Denote Γ as the matrix formulated by vectors that span the null space of K. Gi can be expressed as Gi = ΓWi + ΓFi. For the sake of convenience, denote

M = Ki( Ki)T +( ni/n2 i )Kieie T i KT i (2/ni)Kieif T i ( Ki)T .

Then, the objective function and constraints in Eq. (19) become:

Tr(GT i (Ki KT i + λM)Gi)

=Tr((ΓWi + ΓFi)T (Ki KT i + λM))(ΓWi + ΓFi)

=Tr WT i ΓT (Ki KT i + λM)ΓWi . (20)

GT i KGi = I (ΓWi + ΓFi)T K(ΓWi + ΓFi) = I

WT i ΓT KΓWi = I WT i ΛWi = I. (21)

The above two equations hold since Γ is in the null spaces of K, and Ki and Ki are parts of K. Furthermore, Λ is invertible as Λ only consists of the nonzero eigen-values. Denote Wi = Λ1/2Wi. By combining the objective function in Eq. (20) and the constraint in Eq. (21), DVCA can be reformulated as

min Wi Tr WT i Λ 1/2ΓT (Ki KT i + λM)ΓΛ 1/2 Wi

s.t. WT i Wi = I. (22)

The optimal solution to the above problem can be derived by the eigen-decomposition strategy. After deriving the optimal Wi, we compute Wi = Λ 1/2 Wi and Gi = ΓWi. We drop ΓFi since it locates in the null space of K and it contributes no discriminative information. Another problem of the above solution is to determine how many eigen-vectors should be selected to form Wi. Recalling the intuition of VCA, in our algorithm, we only pick up the eigen-vectors which corresponds to the negative and zero eigen-values. Finally, as in VCA, the ith group of projections of training points and testing points can also be computed by kernel trick. In other words, the ith projection of any training point x(t) j can be calculated by

ST i φ(x(t) j ) = GT i ΦT φ(x(t) j ) = GT i K(t) j , (23)

where K(t) j is the column of K corresponding to φ(x(t) j ). Similarly, for any testing point y, its ith projection can also be computed by

GT i ΦT φ(y) = GT i Ky, (24)

where Ky is a n-dimensional column vector whose elements are calculated by (1+x T y)d, where x represents the training data point. After calculating all the projections, we simply connect these c projections to formulate a new representation of y as in VCA. The procedure of DVCA is listed in Algorithm 1.

Algorithm 1 Discriminative Vanishing Component Analysis (DVCA)

Run: Calculate the polynomial kernel matrix K by x(i) j .

Calculate the polynomial kernel Ky between {x(i) j } and y if testing point is available. for i=1:c do Compute Wi by solving Eq.(22). Compute Wi = Λ 1/2 Wi and Gi = ΓWi. Compute the ith group of projections for training points and testing points (if available) using the kernel trick as shown in Eq. (23) and Eq. (24). end for Formulate the new representations by all the projections as in VCA.

Computational Complexity Comparison In the training process, the most time consumption step of VCA is the decomposition of the matrix A Rni l, where l is the number of candidate functions. It is no larger than |F|2 min{|F|, s} where |F| ni and s is the total number of monomials whose degrees are less than d. Correspondingly, the most time consumption step of DVCA is the eigen-decomposition of an matrix with scale n. Since eigendecomposition and SVD decomposition has similar computational complexity and different implementations may have different time costs, it is difﬁcult to compare the time efﬁciency of two methods. In the testing process, VCA only involves the evaluation of testing points on the generators. Its computational time is linear with respect to the size of generators. Using DVCA, we need to compute the kernel formulated by the testing points and training points. Commonly, it needs more time than VCA, especially when the size of data is large.

Experimental results We would like to provide two groups of experimental results for illustration. VCA is a supervised method and the ﬁrst group contains results for classiﬁcation. The second is the numerical result concerning computational time. As the emphases of this paper are the kernel view of VCA and the improvement of VCA, we would like to just make comparisons between DVCA, KPCA and VCA. VCA is implemented by the code provided by the authors1. There are totally six various data sets collected from different real applications, including DNA data, images, voice data, etc. They are DNA data, ORL data, VOWEL data, VEHICLE data, COIL20 data and ISOLET5 data. All the data are downloaded from open sources2,3. The dimensionality ranges from about 10 to 1000 and all the data are scaled as suggested by the providers. Similar to VCA, we also use KPCA and DVCA as the method of feature selection and use its projections for classiﬁcation. Linear classiﬁer is conducted by using the Lib Svm

1http://www.cs.huji.ac.il/ rlivni73/ 2http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ 3http://www.cad.zju.edu.cn/home/dengcai/Data/data.html

software (Chang and Lin 2011). The common parameter d is tuned using 5-fold cross validation. Besides, in DVCA, we also determine λ by cross validation.

Classiﬁcation Accuracy

The original data are randomly split into two parts, training and testing samples. We select ﬁxed number of points for training and the rest are used as testing examples. In VCA, KPCA and DVCA, we use training samples for deriving transformation. Then linear SVM is trained and tested on the transformed data, including training and testing points. With 100 independent runs, the mean classiﬁcation accuracy with different numbers of training samples are shown in Fig. 1. As shown in Figure 1, we have the following observations. (1) It is clear that DVCA performs better than VCA in almost all cases. This is mainly due to the fact that we take the dissimilarities between data points of different classes into account by adding the objective function of BDA shown in Eq. (17). (2) With the increase of the number of the training points, the classiﬁcation accuracy also increases, as is often the case. (3) Although KPCA and VCA have a linear relationship between their projections, their classiﬁcation results are different. The reason may be that we evaluate their performances by the classiﬁcation accuracy of linear SVM, while linear SVM does not produce equal results when its inputs have linear relationship.

Computational Time Comparison

We have tested the algorithm by a naive Matlab implementation on a workstation with 12 processor (3.33G for each) and 47.2GB memory. We separate all the procedure into training and testing sets. The training process includes transformation computation and linear SVM model learning. The testing process includes projecting testing points and classifying testing points by linear SVM. We have selected three representative data sets with largest sample size, dimensionality and class number, and report their time consumptions with different numbers of training samples. The training time and testing time are shown in Table 1. As seen from the results in Table 1, we have the following observations. (1) The computational time of different methods is dominated by different factors. For example, when the number of classes is large, DVCA consumes more time. (2) The training time of VCA is largely determined by the number of training samples. For instance, on DNA data, when the number of training points is large, the training time of VCA increases drastically. (3) The real testing time consumption of DVCA is comparable to that of VCA, especially when the number of training points is small.

Discussion and Further Work

The insight into VCA has tightened the relationship between results in algebra and their applications in machine learning. There are also some related works concerning the computation of vanishing ideal for a data set, such as the Approximately Vanishing Ideal (AVI) algorithm (Heldt et al. 2009).

100 200 300 400 500 600 0.5

Training Points Number

Classification Accuracy

KPCA+SVM VCA+SVM DVCA+SVM

100 150 200 250 300 350 0.3

Training Points Number

Classification Accuracy

KPCA+SVM VCA+SVM DVCA+SVM

100 200 300 400 500 600 700 800 0.55

Training Points Number

Classification Accuracy

KPCA+SVM VCA+SVM DVCA+SVM

50 100 150 200 0.45

Training Points Number

Classification Accuracy

KPCA+SVM VCA+SVM DVCA+SVM

50 100 150 200 250 300 350 400 0.6

Training Points Number

Classification Accuracy

KPCA+SVM VCA+SVM DVCA+SVM

100 200 300 400 500

Training Points Number

Classification Accuracy

KPCA+SVM VCA+SVM DVCA+SVM

Figure 1: Classiﬁcation results of different methods on six different data sets with different numbers of training points. (a) DNA (b) ORL (c) VOWEL (d) VEHICLE (e) COIL20 (f) ISOLET5.

Table 1: Computational time of training and testing different methods on DNA, ORL and VEHICLE. For brief, we refer VCA as VCA for feature extraction and Linear SVM for classiﬁcation. Similarly for DVCA.

TRAINING TIME TESTING TIME NO. VCA DVCA VCA DVCA DNA 300 52.1712 0.5531 1.0729 0.8531 600 61.0889 2.2633 2.9889 2.6550 900 82.6519 6.3136 3.6290 6.1642 1200 219.394 13.948 4.5547 9.3746 1600 115.349 19.788 4.5775 10.942 ORL 80 0.7246 0.2731 0.1404 0.1636 160 0.9686 1.7563 0.1248 0.6469 240 1.1515 4.3979 0.1088 1.0721 320 1.8888 9.0895 0.0782 1.3240 VEHICLE 40 0.0074 0.0108 0.0082 0.0078 220 0.3042 0.2534 0.0644 0.0881 400 2.5052 0.5893 0.1098 0.1261 580 3.4999 1.1103 0.1108 0.1367 760 7.3911 1.6921 0.0538 0.0636

Nevertheless, it is limited to the functions with a small number of monomials. Kernel methods has attracted plenty of research interests and the results are fruitful. Our paper aims to deepen the understanding of VCA from the well-known kernel view. It facilitates the understanding and using VCA from both theoretical and practical aspects. There are still some further works. In theory, we have eliminated the null spaces of Φ. It is still interesting to investigate its inﬂuence. Although the null space of training samples does not take inﬂuence on training classiﬁer, it takes effects on testing process. In practice, although we have proposed DVCA to improve VCA, there still exists some other problems to be studied, such as how to reduce the computational complexities and determine the parameter.

Acknowledgments

This work was supported by NSF of China: No. 61005003, 60975038 and Australian Research Council Projects: DP140102164, FT-130101457.

References Bishop, C. M. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc. Chang, C.-C., and Lin, C.-J. 2011. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3):1 27. Cox, D. A.; Little, J.; and O Shea, D. 2007. Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra, (Undergraduate Texts in Mathematics). Secaucus, NJ, USA: Springer Verlag New York, Inc. Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3:1157 1182. He, X.; Cai, D.; and Niyogi, P. 2006. Laplacian score for feature selection. In NIPS 18. 507 514. Heldt, D.; Kreuzer, M.; Pokutta, S.; and Poulisse, H. 2009. Approximate computation of zero-dimensional polynomial ideals. Journal of Symbolic Computation 44(11):1566 1591. Koller, D., and Sahami, M. 1996. Toward optimal feature selection. In ICML, 284 292. Livni, R.; Lehavi, D.; Schein, S.; Nachliely, H.; Shalevshwartz, S.; and Globerson, A. 2013. Vanishing component analysis. In ICML, 597 605. Mika, S.; Ratsch, G.; Weston, J.; Scholkopf, B.; and Mullers, K. R. 1999. Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, Proceedings of the IEEE Signal Processing Society Workshop 41 48. Robnik-Sikonja, M., and Kononenko, I. 2003. Theoretical and empirical analysis of relieff and rrelieff. Machine Learning 53(1-2):23 69. Sch olkopf, B., and Smola, A. J. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press. Sch olkopf, B.; Smola, A.; and M uller, K.-R. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5):1299 1319. Van der Maaten, L.; Postma, E.; and den Herik, H. V. 2009. Dimensionality reduction: a comparative review. Technical Report Ti CC-TR 2009-005, Tilburg University. Zhou, X. S., and Huang, T. S. 2001. Small sample learning during multimedia retrieval using biasmap. In CVPR, 11 17.