# pairwisecovariance_linear_discriminant_analysis__48a186f2.pdf

Pairwise-Covariance Linear Discriminant Analysis

Deguang Kong and Chris Ding Department of Computer Science & Engineering University of Texas, Arlington, 500 UTA Blvd, TX 76010 doogkong@gmail.com; chqding@uta.edu

In machine learning, linear discriminant analysis (LDA) is a popular dimension reduction method. In this paper, we ﬁrst provide a new perspective of LDA from an information theory perspective. From this new perspective, we propose a new formulation of LDA, which uses the pairwise averaged class covariance instead of the globally averaged class covariance used in standard LDA. This pairwise (averaged) covariance describes data distribution more accurately. The new perspective also provides a natural way to properly weigh different pairwise distances, which emphasizes the pairs of class with small distances, and this leads to the proposed pairwise covariance properly weighted LDA (pc LDA). The kernel version of pc LDA is presented to handle nonlinear projections. Efﬁcient algorithms are presented to efﬁciently compute the proposed models.

Introduction In the big data era, a large number of high-dimensional data (i.e., DNA microarray, social blog, image scenes, etc) are available for data analysis in different applications. Linear Discriminant Analysis (LDA) (Hastie, Tibshirani, and Friedman 2001) is one of the most popular methods for dimension reduction, which has shown state-of-the-art performance. The key idea of LDA is to ﬁnd an optimal linear transformation which projects data into a low-dimensional space, where the data achieves maximum inter-class separability. The optimal solution to LDA is generally achieved by solving an eigenvalue problem. Despite the popularity and effectiveness of LDA, however, in standard LDA model, instead of emphasizing the pairwise-class distances, it simply takes an average of metrics computed in different pairs (i.e., computation of between-class scatter matrix Sb or within-class scatter matrix Sw). Thus, some pairwise class distances are depressed, especially for those pairs whose original class distances are relatively large. To overcome this issue, in this paper, we present a new formulation for pairwise linear discriminant analysis. To obtain a discriminant projection, the proposed method considers all the pairwise between-class and with-class distances. We call it pairwise-covariance LDA (pc LDA) . Then, the

Copyright c 2014, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

pc LDA problem is cast into solving an optimization problem, which maximizes the class separability computed from pairwise distance. An efﬁcient algorithm is proposed to solve the resultant problem, and experimental results indicate the good performance of the proposed method.

A new perspective of LDA The standard linear discriminant analysis (LDA) is to seek a projection G = (g1, , g K 1) 2 <p (K 1) which maximizes the class separability by solving,

max G Tr( GT Sb G

GT Sw G) = max G Tr(GT Sb G)(GT Sw G) 1, (1)

where Sw is the within-class scatter matrix, and Sb is the between-class scatter matrix, and given by

k=1 nk(µk µ)(µk µ)T ,

k=1 nk k, k , 1 nk

xi2Ck (xi µk)(xi µk)T ,

where nk is the number of data in class Ck, µk 2 <p 1 is the mean for the data from class Ck, µ is the global mean for all the data. In the history of LDA (Hastie, Tibshirani, and Friedman 2001), the objective function of LDA is evolved from Fisher s initial 2-class LDA:

max g g T Sbg g T Swg. (2)

For multi-class LDA, this can be generalized to either the trace-of-ratio of Eq.(1), or the following ratio-of-traces objective:

max G Tr(GT Sb G) Tr(GT Sw G). (3)

Mathematically, both generalization are natural; there is no clear difference in terms of machine learning. The trace-ofratio objective Eq.(1) is the most widely used one. However, the ratio-of-trace objective of Eq.(3) has been used by many researches, e.g., (Wang et al. 2007), (Kong and Ding 2012), etc. To our knowledge, there exist no clear explanations of the differences between these two different LDA objectives. In this paper, we bridge this gap, by providing theoretical support to the LDA objective of Eq.(1) from KL-divergence perspective, which is described in Theorem 1 below.

Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence

FODVV FODVV FODVV

(a) 2-dim Data distribution

FODVV FODVV FODVV /'$ VXEVSDFH SF/'$ VXEVSDFH

(b) LDA and pc LDA

FODVV FODVV FODVV SF/'$ VXEVSDFH

FODVV FODVV FODVV /'$ VXEVSDFH

(c) Enlarged LDA and pc LDA

Figure 1: A synthetic data set of 150 data points, 50 data of each class. (a) data distribution; (b) 1-dimensional projection of LDA and pc LDA. Note that both the subspaces (lines) pass through (0,0). We shift them to avoid clutter. (c) Enlarged 1-dim LDA and pc LDA.

From the KL-divergence to classic LDA

LDA assumes that data points of each class k are a Gaussian distribution. The covariance matrix of this class k is called the within-class scatter matrix Sk w. In this paper, we use covariance or averaged covariance instead of the usual within-class scatter matrix Sw to emphasize the new perspective. The within-class scatter matrix deﬁned in Eq.(2) is the globally averaged (i.e., averaged over all k classes) covariance matrix. Furthermore, we propose the pairwise averaged covariance as a better formulation which is used in pc LDA. We start with the KL-divergence between two Gaussian distributions Nk(µk, k), Nl(µl, l) with the same covariances: k = l = kl. The KL-divergence of Nk and Nl is:

DKL(Nk||Nl) = 1

2(µk µl)T 1 kl (µk µl). (4)

KL-divergence is used as a measure of distance between two classes. When the data are transformed using projection G, i.e., we project xi to the subspace yi = GT xi, or Y = GT X, the KL-divergence in Y-space is

DY KL(Nk||Nl) = 1

2 (µk µl)T G(GT kl G) 1GT (µk µl). (5)

We have the following results.

Theorem 1. When the covariances of all K classes are identical, i.e., k = , k = 1 K, the sum of all pairwise KL-divergences:

k<l nknl DY KL(Nk||Nl) (6)

is identical to the objective function of standard LDA of Eq.(1), where P

k<l = PK k=1 PK l=k+1.

Proof: Note that (µk µl)T 1(µk µl) = Tr[(µk µl)(µk µl)T 1] = Tr[(µkµT k + µlµT l µkµT l µlµT k ) 1], we have

l=1 nknl Tr[(µkµT k + µlµT l µkµT l µlµT k ) 1]

k=1 nk(µk µ)(µk µ)T 1] = 2n Tr[Sb 1].

í í í í í í í í

FODVV FODVV FODVV

(a) Standard LDA.

FODVV FODVV FODVV

(b) Pairwise-covariance LDA.

Figure 2: Results on Iris dataset with 3 classes, each class has 50 data points. Original 4-dimensional data are projected into 2 dimensions. (a) Results of standard LDA; (b) Results of pairwisecovariance LDA.

Now we project xi to the subspace yi = GT xi. The covariance in Y-space is Y = GT G and the between-class scatter matrix becomes: SY b = GT Sb G. Thus

J0(G) = 2n Tr(GT Sb G)(GT G) 1 (7)

is identical to the LDA objective function of Eq.(1) aside from the unimportant constant 2n. u

Pairwise-covariance LDA Motivation In standard LDA, covariances k of all K classes are assumed to be exactly identical. This results in a standard LDA of Eq.(1), as we can see from Theorem 1. In practice, data covariance for each class is often different. For 2-class problem, when 1 6= 2, the quadratic discriminant analysis (QDA) (Hastie, Tibshirani, and Friedman 2001) can be used. However, in QDA, the boundary between different classes is a quadratic surface, and the discriminant space can not be represented by GT X explicitly. For multi-class, one can directly solve it using the Gaussian mixture density function with Bayes rules. In this paper, we seek a discriminant subspace that can be obtained by the linear transformation GT X, which has not been studied before. Illustrative example In most datasets, data variance for each class is generally different, standard LDA uses the pooled (i.e., the global averaged) within-class scatter matrices of all classes. However, the global averaged covariance Sw could differ from each individual covariance sig-

niﬁcantly. A simple example is shown in Fig.1, where a 2dimensional data from three classes are shown in Fig.(1(a)). Each class has 50 data points. The covariance for data from each class is 1, 2, 3:

" 2.336 0.015 0.015 1.097

" 1.704 0.539 0.539 1.575

" 1.514 0.512 0.512 1.531

" 1.851 0.004 0.004 1.401

These individual covariances are very different. In standard LDA, we average all the classes and obtain Sw = 123. In this paper, we propose a formulation of LDA that uses pairwise classes. The three pairwise averaged class-covariance: 12, 13 and 23 are

" 2.020 0.262 0.262 1.336

" 1.925 0.264 0.264 1.314

" 1.609 0.013 0.013 1.553

We see that the pairwise averaged covariance are much closer to the two individual covariances as compared to the global average. Formulation For simplicity, we deﬁne the distance dk,l(G) between two classes k, l as

dk,l(G) = 2DY KL(Nk, Nl),

where DY KL(Nk, Nl) is deﬁned in Eq.(5), and kl is a pairwise covariance matrix (average of the pair of classes) and deﬁned as

kl = β nk k + nl l

nk + nl + (1 β) . (8)

Here we use the globally averaged covariance = Sw as a regularization. Parameter 0 β 1 controls the balance of global covariance matrix and local pairwise covariance matrix k, l. The pairwise-covariance LDA is deﬁned the same as that in Theorem 1:

max G J1(G) = X

k<l nknldkl(G), (9)

where G 2 <p (K 1) is the projection. The objective in Eq.(9) is similar to standard LDA (except that we use pairwise covariance instead of global averaged covariance).

The proposed new model Back to the form of Eq.9, it is easy to see that we can deﬁne a better objective. In maximizing J1, all pairs of distances are treated equally. However, in classiﬁcation, we wish the pair of classes with smaller distances to be given more weight, i.e., after projecting to Y = GT X subspace, they are more separated (as compared to other pairs of classes). On the other hand, if two classes are already well-separated, i.e., their distances are large, they can have less weight in the objective function. Therefore, we propose the following pairwise covariance properly weighted objective function:

min G J2(G) = X

nknl [dkl(G)]q , s.t. GT G = I, (10)

where q 1 is a hyper-parameter. In this objective function, the pair of classes with smaller distances contribute more

FODVV FODVV FODVV

(a) Results of standard LDA.

FODVV FODVV FODVV

(b) Results of pc LDA.

REMHFWLYH IXQFWLRQ YDOXH

(c) Convergence of algorithm.

Figure 3: Data: 45 data points (images) from 3 classes on mnist dataset. Original 784-dimensional data are projected into 2dimension. (a) Results of standard LDA; (b) Results of pc LDA; (c) Convergence of algorithm on mnist. Shown are objective function vs. iterations.

than the pair of classes with larger distances. Parameter q controls how much the pair of classes with smaller distances are weighted. The larger q is, the stronger that pair of classes is weighted. In practice, we found that q = {1, 2} are good choices. This model is our ﬁnal proposed model. For simplicity, we call it pairwise-covariance LDA (pc LDA) with the proper weighting implicit. As deﬁned in Eq.(10), the objective is invariant under any non-singular transformation using A 2 <(K 1) (K 1), i.e., J2(GA) = J2(G). To ﬁx this uncertainty, we require GT G = I.

Illustrations of pc LDA We illustrate pc LDA on synthetic and real data. In Fig.1, LDA and pc LDA results on a synthetic 2D dataset of 150 data points (50 data of each class) are shown. We show the data distribution and 1-dimensional projection results using LDA and pc LDA. The point here is that the globally averaged covariance Sw is a poor representation of the individual covariances, but the pairwise-covariance approach seems to give a better representation such that a single pc LDA dimension can clearly separate the 3 classes, while standard LDA needs 2-dimensions to separate data from different classes (results not shown). In Fig.2, we show the results on the widely used iris data1. Iris has 150 data points with K=3 classes. Thus LDA project to K-1=2 dimensions. Fig.2 indicates that pc LDA gives clear discrimination between classes 2 and 3 while standard LDA has strong mixing between classes 2 and 3. In Fig.3, we show results on 45 images (from K=3 classes) from mnist handwritten digits image dataset. LDA projections to 2-dimension are shown. Result of pc LDA shows that

1http://archive.ics.uci.edu/ml/datasets/Iris

the 3 classes contract strongly and become more separated as compared to the LDA results. These results demonstrate the beneﬁts of the pairwise-covariance properly weighted LDA. More experiments and comparisons with related methods are reported in 7.

Algorithm to solve Pairwise-covariance LDA The key idea of our approach is to use gradient descent algorithm to solve pc LDA of Eq.(10). The gradient of J2(G) is

qnknl [dkl(G)]q+1 @dkl(G)

For notational simplicity, we write

Bkl = (µk µl)(µk µl)T ,

dkl(G) = Tr(GT Bkl G)(GT kl G) 1. (12)

Using Eq.(12), the derivative of dkl(G) is

@G = 2[Bkl G(GT kl G) 1

kl G(GT kl G) 1(GT Bkl G)(GT kl G) 1]. (13)

Note that (GT kl G) 1 is an inverse of a small (K-1)-by- (K-1) matrix. r J2 can be efﬁciently computed using Algorithm 1.

Algorithm 1 Computation of r J2(G) (i.e., Eq.11) or r J2(A) (i.e., gradient of Eq.21).

Input: G, { k, µk}, q Output: r J2 Algorithm: 1: F = 0 2: for l = 1 to K do 3: for k = l + 1 to K do 4: Compute µkl = µk µl. 5: Compute b = GT µkl. 6: Compute kl according to Eq.(8). % φ kl according to Eq.(23) 7: Compute B = kl G . 8: Compute b = (GT B) 1b. 9: Compute a = nknl(µkl Bb)/(µT kl Gb)q+1. 10: Compute F = F + a b T % cross-product between vectors a, b 11: end for 12: end for 13: r J2 = 2q F. 14: Output: r J2.

The constraint GT G = I enforces G on the Stiefel manifold. Variations of G on this manifold is parallel transport, which gives some restriction to the gradient. This has been been worked out in (Edelman, Arias, and Smith 1998). The gradient that preserves the manifold structure is

r J2 G[r J2]T G. (14)

Thus the algorithm computes the new G as follows,

G G (r J2 G[r J2]T G) (15)

The step size is usually chosen as,

= k Gk1k/kr J2 G(r J2)T Gk1, = 0.001 0.01. (16)

where k Ak1 = P ij |Aij|. Occasionally, due to the loss of numerical accuracy, we do the projection: G G(GT G) 1

2 to restore GT G = I. Starting with the standard LDA solution of G, this algorithm is iterated until the algorithm converges to a local optimal solution. Fig. 3(c) shows the convergence of algorithm on dataset mnist.

Pairwise-covariance Kernel LDA

Kernel LDA (Mika et al. 1999; Tao et al. 2004) is nonlinear generalization of LDA. We can derive the kernel version of pc LDA. Let xi ! φ(xi) or X ! φ(X) = (φ(x1), , φ(xn). For 2-class LDA, the projection vector is g = Pn i=1 iφ(xi) = φ(X) , where = ( 1 n)T . For K-class LDA, the projection vector gk = Pn i=1 ikφ(xi) = φ(X) k, thus, G = (g1 g K 1) = φ(X)A, where A = ( 1 K 1). Under the transformation X ! φ(X) , G ! φ(X)A, it is easy to see that the LDA objective of Eq.(1) transforms into

Tr(GT Sφ b G)(GT Sφ w G) 1 ! Tr(AT Sφ b A)(AT Sφ w A) 1 (17)

where the kernel within-class scatter matrix is:

( φ k)ij = φ(xi)T [ 1

s2Ck φ(xs)φ(xs)T φ(xj)

s2Ck Kis Ksj, Sφ w = 1

k=1 nk( φ k) = 1

and the kernel between-class scatter matrix is:

(Sφ b )ij = φ(xi)T 1

k=1 nk( φk φ)( φk φ)T φ(xj)

k=1 nk % Ki k Ki &% K kj K j & , (19)

where we use the shorthand notations:

s=1 φ(xs), φk = 1

s2Ck φ(xs),

Ki = K i = 1

s=1 Kis, K ki = Ki k = 1

s2Ck Kis.(20)

The solution of kernel LDA is given by the largest k eigenvectors of the eigen-equation Sφ b v = λSφ wv. When K = 2, this reduces to the familiar 2-class kernel LDA (Tao et al. 2004). Efﬁcient computation of Sφ b is given in the end of 5.1. We are now ready to present the pairwise-covariance kernel LDA. We apply the same transformation to the pairwisecovariance LDA. We have

Theorem 2. Under the transformation X ! φ(X) , G ! φ(X)A, the pairwise-covariance LDA of J2(G) becomes J2(A):

nknl [Tr(AT Bφ kl A)(AT φ kl A) 1]q , s.t. AT A = I. (21)

(Bφ kl)ij = φ(xi)T ( φk φl)( φk φl)T φ(xj)

= % Ki k Ki l &% K kj K lj & (22)

where shorthand notations are deﬁned in Eq.(20), φ k is deﬁned in Eq.(18), and

φ kl = β nk φ k + nl φ l nk + nl + (1 β) φ. (23)

Algorithm for Kernel PC-LDA We solve J2(A) of Eq.(21) using the same algorithm in computing pc LDA using J2(G) of Eq.(10). The derivative is the same as Eqs.(19,20) except Bkl is replaced by Bφ kl, kl replaced by φ kl, G by A. The constraint AT A = I is handled in same way as GT G = I in Eqs.(20,21). The step size is given in Eq.(22). The remaining part is the efﬁcient computation of the gradient r J2(A). First, we note that {Bφ kl}, { φ k} of Eqs.(22,23) can be efﬁciently computed. Let Vk be a n-by-nk matrix consisting of nk columns of K belonging to class k. It is ready to see that in Eq.(21),

nk Vk VT k , uk = 1

nk Vke, (24)

where e = (1 1)T . Here for clarity, we use uk to represent the vector Ki, k, i = 1 n. Clearly, Bφ kl = (uk ul)(uk ul)T . Now r J2(A) is computed using Algorithm 1, with the replacement

µk uk, k φ k. (25)

Sφ b can be efﬁciently computed as Sφ b = (1/n) P

k nk(uk v)(uk v)T , v = (1/n)φ(X)e.

Related Work A detailed survey of recent LDA works can be found in (Ye and Ji 2008). Other LDA formulation There exist earlier works (Li, Jiang, and Zhang 2003), (Yan et al. 2004) which maximize the difference of traces, a.k.a maximum margin criteria (MMC). Several LDA formulations with different constraints and overﬁt analysis are given in (Luo, Ding, and Huang 2011), (Yan et al. 2004). To solve the well-known singularity or under-sampled problem, there are many extensions of LDA methods proposed, such as Regularized LDA (RLDA) (Hastie, Tibshirani, and Friedman 2001), uncorrelated LDA (ULDA) (Ye 2005b), orthogonal LDA (OLDA) (Ye 2005a) and orthogonal centroid method (OCM) (Park, Jeon, and Z 2003), etc. Among these, ULDA extracts the feature vectors which are mutually uncorrelated in lowdimensional space. Connection with metric learning David et.al. (Alipanahi, Biggs, and Ghodsi 2008) showed a strong relationship between distance metric learning methods and the Fisher Discriminant Analysis. Our pairwise-covariance LDA formulation of Eq.(10) and kernel pc LDA of Eq.(21) can serve for distance metric learning purpose, which can be used for many applications (e.g., (Kong and Yan 2013), (Kong et al. 2012), etc).

Table 1: Characteristics of datasets

Dataset # data #dimension #Class MSRCv1 210 432 7 Umist 360 644 20 Mnist 150 784 10 Binalpha 1014 320 36

There are also works discussing local discriminative Gaussian (LDG) dimensionality reduction (Parrish and Gupta 2012), local ﬁsher discriminant analysis (Sugiyama 2006). Sparsity in the LDA solution (Clemmensen et al. 2011), (Zhang and Chu 2013) is also desirable for interpretation purpose, because it is robustness to the noise and will lead to efﬁcient computation in prediction. However, to our knowledge, none of the above works consider the pairwise covariance by computing distance of the projection in a pairwise way, which is the focus of this paper.

Experiment results

Dataset We evaluate the proposed pairwise-covariance LDA using four data sets (see Table 1) for multi-class classiﬁcation experiments, including one face dataset umist, two digit datasets mnist (Lecun et al. 1998), binalpha, one image scene dataset MSRCv1 (Lee and Grauman 2009)2 . Due to space limit, we omit more details of datasets. Table 1 summarizes the datasets. Methods & Parameter Settings In our experiment, we use 5-round 5-fold cross validation to evaluate the classiﬁcation performance. Each dataset is evenly partitioned into 5 parts. Only one part is used as testing and the other 4 parts are used for training. We report the average results for 5 rounds. Next, we give an overview of the dimension reduction and classiﬁcation methods used in our experiment. The compared methods can be divided into several groups. (1) LDA and MMC (Li, Jiang, and Zhang 2003; Yan et al. 2004), kernel LDA (KLDA) For LDA, maximum margin criterion(MMC) ((Li, Jiang, and Zhang 2003; Yan et al. 2004)), kernel-LDA of Eq.(17) method, we project original data into LDA-subspace, and k(k=3) nearest neighbor classiﬁer is used for classiﬁcation. For kernel LDA, we use RBF kernel to construct the pairwise similarity Wij = e γkxi xjk2, where bandwidth γ is searched in the grid {10 4, 10 3, , 103, 104}. (2) Regularized LDA (RLDA) (Hastie, Tibshirani, and Friedman 2001), uncorrelated LDA (ULDA) (Ye 2005b), orthogonal LDA (OLDA) (Ye 2005a) and orthogonal centroid method (OCM) (Park, Jeon, and Z 2003). We compare our method against four methods of generalized LDA. It has been shown (Ye and Ji 2008) that these four LDAextensions can be described in a uniﬁed framework for generalized LDA. However, there still exist subtle differences among them. The parameter µ in regularized LDA is determined by cross validation. (3) Proposed pairwise-covariance LDA model of

2http://research.microsoft.com/enus/projects/Object Class Recognition/

Table 2: Multi-class Classiﬁcation Accuracy on 4 datasets using 9 different dimension reduction methods: LDA, kernel LDA(KLDA), pc LDA, kernel pc LDA (pc KLDA), and 5 other methods: MMC, RLDA, ULDA, OLDA, OCM.

Data LDA MMC RLDA ULDA OLDA OCM pc LDA (β=1) KLDA pc KLDA(β=1) MSRC 68.57 67.45 68.54 69.11 67.34 68.91 71.32 68.78 72.39 Binalpha 76.37 72.38 77.66 77.95 72.30 78.89 81.38 79.23 80.12 Mnist 84.37 85.29 84.14 85.01 86.69 84.45 87.10 83.09 86.26 Umist 94.16 93.45 94.44 94.24 91.94 93.61 95.35 91.41 92.07

MSRC Binalpha 0

Average classification accuracy

(a) Classiﬁcation results on MSRC, Binalpha

mnist umist 0

Average classification accuracy

LDA MMC Reg LDA ULDA OLDA OCM pc LDA (β=0.1) pc LDA (β=0.5) pc LDA (β=1) KLDA pc KLDA (β=0.1) pc KLDA (β=0.5) pc KLDA (β=1)

(b) Classiﬁcation results on mnist, umist

Figure 4: Classiﬁcation results comparisons on 4 datasets, including our methods: pc LDA, pc KLDA at β = {0.1, 0.5, 1} and seven other methods: LDA, KLDA, MMC, RLDA, ULDA, OLDA, OCM.

3DUDPHWHU β

&ODVVLILFDWLRQ DFFXUDF\

(a) pc LDA result on MSRC

3DUDPHWHU β

&ODVVLILFDWLRQ DFFXUDF\

(b) pc LDA result on mnist

3DUDPHWHU β

&ODVVLILFDWLRQ DFFXUDF\

(c) pc LDA result on umist

Figure 5: Classiﬁcation accuracy w.r.t different parameter β for our model of Eq.(10) on dataset MSRC, mnist, umist. Red line gives LDA results, and blue line draws pc LDA results at β = {0, 0.1, , 0.9, 1.0}.

Eq.(10)(pc LDA) and kernel pairwise-covariance LDA model (pc KLDA) of Eq.(21) We set q = 1 for Eq.(10), Eq.(21) in our experiments. The parameter β is set to be {0.1, 0.5, 1}. To make a fair comparison, we project all original data to (C-1) dimension, and k(k=3) nearest neighbor classiﬁer is used for classiﬁcation purpose. Classiﬁcation Performance Analysis Table 2 and Fig.4 present the classiﬁcation performance using different dimension reduction methods. We make several important observations from experiment results. (1) As compared to standard LDA, MMC and other dimension reduction methods, pc LDA consistently provides better classiﬁcation performance at different β values (e.g., β = {0.1, 0.5, 1}). For example, there is nearly 5% performance improvement on binalpha dataset when compared with standard LDA method. Note binalpha dataset is composed of data from K=36 classes, this indicates that the proposed pairwise pairwise-covariance LDA method gives much performance improvement at large class numbers. (2) In kernel space, kernel version of LDA and pc LDA do not improve the classiﬁcation performance quite a bit (sometimes even worse). However, pc KLDA still outperforms standard KLDA in kernel space.

(3) β controls the complexity of our model, i.e., when β approaches 1, pc LDA uses local pairwise covariance matrix, and when β approaches 0, pc LDA uses global covariance matrix which is equivalent to standard LDA. Fig.(5) shows the classiﬁcation results on three datasets: MSRC, mnist and umist. The experiment results suggest that, generally, we tend to get better classiﬁcation results for larger values of β. This further conﬁrms our intuition, the pairwise covariance really helps to capture the data distribution as compared to globally averaged variance, and thus the projection and classiﬁcation results are improved. Moreover, rather than maximizing the sum of inter-class distances, we minimize the sum of inverse inter-class distances. This choice makes classes that are close together have more inﬂuence on the LDA ﬁt than those classes that are well-separated.

Conclusion We present a pairwise-covariance model for linear discriminant analysis. The proposed model computes the projection by utilizing the pairwise class information. An efﬁcient algorithm is present to solve the proposed model. Proposed method can be easily extended in kernel space. Experiment results indicate the good performance of proposed method.

Acknowledgement. This research is partially supported by NSF-CCF-0917274 and NSF-DMS-0915228 grants.

References Alipanahi, B.; Biggs, M.; and Ghodsi, A. 2008. Distance metric learning vs. ﬁsher discriminant analysis. In AAAI. Clemmensen, L.; Hastie, T.; Wiiten, D.; and Ersboll, B. 2011. Sparse discriminant analysis. Technometrics. Edelman, A.; Arias, T. A.; and Smith, S. T. 1998. The geometry of algorithms with orthogonality constraints. SIAM J. MATRIX ANAL. APPL 20(2):303 353. Hastie, T.; Tibshirani, R.; and Friedman, J. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. Hoi, S. C. H.; Liu, W.; Lyu, M. R.; and Ma, W.-Y. 2006. Learning distance metrics with contextual constraints for image retrieval. In CVPR. Kong, D., and Ding, C. H. Q. 2012. A semi-deﬁnite positive linear discriminant analysis and its applications. In ICDM, 942 947. Kong, D., and Yan, G. 2013. Discriminant malware distance learning on structural information for automated malware classiﬁcation. In KDD, 1357 1365. Kong, D.; Ding, C. H. Q.; Huang, H.; and Zhao, H. 2012. Multi-label relieff and f-statistic feature selections for image annotation. In CVPR, 2352 2359. Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 2278 2324. Lee, Y. J., and Grauman, K. 2009. Foreground focus: Unsupervised learning from partially matching images. International Journal of Computer Vision 85(2):143 166. Li, H.; Jiang, T.; and Zhang, K. 2003. Efﬁcient and robust feature extraction by maximum margin criterion. In Proceedings of Advances in Neural Information Processing Systems(NIPS 2003). Luo, D.; Ding, C.; and Huang, H. 2011. Linear discriminant analysis: New formulations and overﬁt analysis. In AAAI2011. Mika, S.; Ratsch, G.; Weston, J.; Scholkopf, B.; and Muller, K. 1999. Fisher discriminant analysis with kernels. Park, H.; Jeon, L. M.; and Z, J. B. R. 2003. Lower dimensional representation of text data based on centroids and least squares. BIT 43:2003. Parrish, N., and Gupta, M. 2012. Dimensionality reduction by local discriminative gaussians. In ICML. Sugiyama, M. 2006. Local ﬁsher discriminant analysis for supervised dimensionality reduction. In ICML, 905 912. Tao, X.; Ye, J.; Li, Q.; Janardan, R.; and Cherkassky, V. 2004. Efﬁcient kernel discriminant analysis via qr decomposition. In The Eighteenth Annual Conference on Neural Information Processing Systems (NIPS 2004), 1529 1536. Wang, H.; Yan, S.; Xu, D.; Tang, X.; and Huang, T. 2007. Trace ratio vs. ratio trace for dimensionality reduction. In CVPR.

Xiang, S.; Nie, F.; and Zhang, C. 2008. Learning a mahalanobis distance metric for data clustering and classiﬁcation. Yan, J.; Zhang, B.; Yan, S.; Yang, Q.; and Li, H. 2004. Immc: incremental maximum margin criterion. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Ye, J., and Ji, S. 2008. Discriminant Analysis for Dimensionality Reduction: An Overview of Recent Developments, Biometrics: Theory, Methods & Applications. IEEE/Wiley. Ye, J. 2005a. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. The Journal of Machine Learning Research 6. Ye, J. 2005b. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning 6:483 502. Zhang, X., and Chu, D. 2013. Sparse uncorrelated linear discriminant analysis. In ICML, 45 52.