# making_fisher_discriminant_analysis_scalable__5e1203bd.pdf

Making Fisher Discriminant Analysis Scalable

Bojun Tu TUBOJUN@GMAIL.COM Department of Computer Science & Engineering, Shanghai Jiao Tong University, Shanghai, China

Zhihua Zhang ZHIHUA@SJTU.EDU.CN Department of Computer Science & Engineering, Shanghai Jiao Tong University, Shanghai, China

Shusen Wang WSS@ZJU.EDU.CN College of Computer Science & Technology, Zhejiang University, Hangzhou, China

Hui Qian QIANHUI@ZJU.EDU.CN College of Computer Science & Technology, Zhejiang University, Hangzhou, China

The Fisher linear discriminant analysis (LDA) is a classical method for classiﬁcation and dimension reduction jointly. A major limitation of the conventional LDA is a so-called singularity issue. Many LDA variants, especially two-stage methods such as PCA+LDA and LDA/QR, were proposed to solve this issue. In the two-stage methods, an intermediate stage for dimension reduction is developed before the actual LDA method works. These two-stage methods are scalable because they are an approximate alternative of the LDA method. However, there is no theoretical analysis on how well they approximate the conventional LDA problem. In this paper we present theoretical analysis on the approximation error of a two-stage algorithm. Accordingly, we develop a new two-stage algorithm. Furthermore, we resort to a random projection approach, making our algorithm scalable. We also provide an implemention on distributed system to handle large scale problems. Our algorithm takes LDA/QR as its special case, and outperforms PCA+LDA while having a similar scalability. We also generalize our algorithm to kernel discriminant analysis, a nonlinear version of the classical LDA. Extensive experiments show that our algorithms outperform PCA+LDA and have a similar scalability with it.

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

1. Introduction

The Fisher linear discriminant analysis (LDA) has received wide applications in multivariate analysis and machine learning such as face recognition (Belhumeur et al., 1997; Mart ınez & Kak, 2001), text classiﬁcation, microarray data classiﬁcation, etc. The conventional LDA problem attempts to ﬁnd an optimal linear transformation by minimizing the total class distance and maximizing the betweenclass distance simultaneously. It is well known that this optimization problem can be solved by applying eigenvalue decomposition to the scatter matrices. However, this requires the total scatter matrix to be nonsingular, which is usually not the case in real world applications. For example, we usually meet microarray datasets which have the large p but small n regime.

To address this singularity issue, the pseudoinverse and regularization methods have been widely employed (Hastie et al., 2009; Zhang et al., 2010). Recently, a two-stage approximate approach, such as the PCA+LDA algorithm (Belhumeur et al., 1997) and the LDA/QR algorithm (Ye & Li, 2005), has been also proposed. Typically, the twostage approach applies a cheap intermediate dimension reduction method before the conventional LDA algorithm is performed. This class of algorithms is more scalable than the exact algorithm. To the best of our knowledge, however, there is no theoretical analysis on how well these algorithms approximate the exact LDA problem.

In this paper we ﬁrst give theoretical analysis on how well a two-stage algorithm approximates the exact LDA in the sense of maximizing the LDA objective function. The theoretical analysis motivates us to devise a new two-stage LDA algorithm. Our algorithm outperforms the PCA+LDA (Belhumeur et al., 1997) while both have the similar scala-

Making Fisher Discriminant Analysis Scalable

bility. To make our algorithm more scalable, we introduce a random-projection-based SVD (RSVD) method (Mahoney, 2011; Halko et al., 2011). Furthermore, we provide an implemention of our algorithm on distributed system to handle large scale problems. Additionally, we also show that our algorithm could naturally generalize to the kernel discriminant analysis (KDA) problem (Mika et al., 1999; Baudat & Anouar, 2000; Park & Park, 2005), which is a nonlinear generalization of the conventional LDA.

The remainder of the paper is organized as follows. In Section 2, we review the classical LDA and two two-stage algorithms for efﬁciently solving LDA, the PCA+LDA algorithm and the LDA/QR algorithm. In Section 3, we present our algorithm and some theoretical analysis. We conduct empirical analysis and comparison in Section 4, and conclude our work in Section 5.

2. The Fisher Discriminant Analysis

In this section we brieﬂy review the conventional Fisher LDA and KDA problems, as well as efﬁcient algorithms for solving them.

2.1. Linear Discriminant Analysis

We are given a data matrix X = [x1, . . . , xn]T Rn p

where xi is a p-dimensional input instance. Suppose the input instances are partitioned into m classes, such that XT = [XT 1 , XT 2 , . . . , XT m] where Xi Rni p contains ni instances from the i-th class and Pm i=1 ni = n. The conventional LDA is to ﬁnd the optimal linear transformation G Rp q that preserves the class structure in a lowdimensional space as well as in the original space. That is, G maps each xi of X in the p-dimensional space to a vector yi in the q-dimensional space.

The between-class, and total scatter matrices are deﬁned by

x Xi (x ci)(x ci)T ,

i=1 ni(ci c)(ci c)T ,

i=1 (xi c)(xi c)T ,

where ci = 1 ni P

x Xi x is the mean of the j-th class and c = 1

x X x is the mean of the whole data set. In the low-dimensional space deﬁned by the linear transformation G, the between-class, and total scatter matrices become SG b = GT Sb G, and SG t = GT St G, respectively.

We can simplify the formulation of the scatter matrices

through precursors Hb, and Ht as

Hb = 1 n [ n1(c1 c), . . . , nm(cm c)] , (1)

XT 1 c1e T 1 , , XT m cme T m ,

Ht = 1 n(XT ce T ), (2)

where ei = [1, , 1]T Rni and e = [1, . . . , 1]T Rn. Hence, it can be shown that we can rewrite the scatter matrices as Sb = Hb HT b , and St = Ht HT t .

An optimal transformation G can be obtained by solving the following optimization problem:

n J(G) trace (SG t ) 1SG b o .

The solution to the optimization can be obtained through eigen-decomposition of the matrix S 1 t Sb whenever St is nonsingular (Fukunaga, 1990). However, when St is singular, we can use the eigen-decomposition of S t Sb, where S t is the Moore-Penrose inverse of St.

Recently, an SVD-based algorithm was proposed to solve the eigen problem(Ye, 2005), which is described in Algorithm 1. Let G be obtained from Algorithm 1. Then it consists of the top q eigenvectors of S t Sb. Here q = rank(Hb) which is usually m 1 in most cases. We also have (G )T St G = Iq and (G )T Sb G = Σ2 q. Hence the objective function of LDA is actually trace(Σ2 q).

Algorithm 1 The SVD based LDA algorithm

1: calculate Ht and Hb (2) 2: compute the reduced SVD of Ht as Ht = UtΣt V T t 3: B Σ 1 t U T t Hb 4: compute the reduced SVD of B as B = PqΣq QT q 5: return G UtΣ 1 t Pq

2.2. Kernel Discriminant Analysis

To apply LDA to nonlinear data, many KDA algorithms have been devised by using a so-called kernel trick (Mika et al., 1999; Baudat & Anouar, 2000; Park & Park, 2005). The kernel method ﬁrst maps the original data into a highdimensional space H by a nonlinear transformation Φ : Rp H. Typically, Φ is implicitly available and we only know a kernel function κ : Rp Rp R such that κ(x1, x2) = Φ(x1)T Φ(x2).

Let Kb = [bij]1 i n,1 j m where

xk Xj κ(xi, xk) 1

k=1 κ(xi, xk) , (3)

Making Fisher Discriminant Analysis Scalable

and Kt = [tij]1 i n,1 j n where

tij = n κ(xi, xj) 1

k=1 κ(xi, xj) . (4)

Then the objective function of KDA becomes

J(G) = trace (AT Kt KT t A) 1(AT Kb KT b A) ,

where A = [aij]1 i n,1 j q satisﬁes [G(x)]j = Pn i=1 aijκ(x, xi) for any x Rp (Park & Park, 2005).

This objective function has a similar form with the LDA objective function. That is, it replaces Hb with Kb and Ht with Kt. Therefore, Algorithm 1 could be easily applied to KDA. Here we omit the details.

2.3. Two-stage methods

There exist two important two-stage methods for the conventional LDA, namely, PCA+LDA (Belhumeur et al., 1997; Yang & Yang, 2003) and LDA/QR (Ye & Li, 2005; Ye, 2005) or OCM+LDA (Park et al., 2003). These two algorithms ﬁrst yield a column-orthogonal transformation matrix Z in the ﬁrst stage and then implement the classical LDA algorithm on the reduced space obtained via Z.

The PCA+LDA algorithm ﬁrst computes ZP CA to maximize trace(ZT St Z), which is equivalent to computing the top r left singular vectors of Ht. LDA maximizes the trace of ratio of between the class scatter matrix and the total scatter matrix. However, PCA maximizes the trace of the total scatter matrix, which is not compatible to the objective function of LDA. So some useful information which may be important for discrimination might be lost in the PCA stage.

The LDA/QR ﬁrst computes a column orthogonal matrix ZOCM by maximizing trace(ZT Sb Z). This objective function is more compatible to the objective function of LDA. Ye & Li (2005) showed that if the QR decomposition with pivoting (Golub & Loan, 1996) of Hb is Hb = QRΠ, then we can take ZOCM as Q. The LDA/QR algorithm reduces the dimension to q, which is too small for many problems, thus a lot of information in the original data matrix might be lost.

3. Methodology

In this section we ﬁrst give a theoretical bound on how well a two-stage LDA algorithm approximates the exact LDA algorithm when the transformation matrix of the ﬁrst stage is given. We then propose a fast two-stage LDA algorithm based on the theoretical analysis. We also generalize our algorithm to the KDA problem.

3.1. Theoretical Analysis

In the ﬁrst stage of a two-stage algorithm, we introduce a linear transformation Z Rp r to reduce the dimension to r. We assume Z to be column-orthonormal, that is, ZT Z = Ir. In the second stage we apply the conventional LDA algorithm on the reduced total scatter matrix St = ZT St Z and the reduced between-class scatter matrix Sb = ZT Sb Z to obtain a linear transformation G. The ﬁnal result then becomes ˆGZ = Z G.

Intuitively, a good Z should keep as much information as possible in both St and Sb. Fortunately, rank(Sb) = q m 1 is small under the assumption that number of classes m is small. Therefore, if we let R(Hb) R(Z), all information will be kept after linear transformation Z. Here R( ) is the range (or column-spanned space) of a matrix. The following theorem shows how well we can approximate the solution of exact LDA using such a Z.

Theorem 1 If Z Rp r satisﬁes R(Hb) R(Z) and ZT Z = Ir, then we have

J( ˆGZ) 1 (G )T ZZT Ht 2 2 J(G )

H t ZZT Ht 2 2 J(G ).

Here 2 is the spectral norm of a matrix, and G is the solution of Algorithm 1.

3.2. The SVD-QR-LDA Algorithm

To obtain a more accurate two-stage LDA algorithm, we can maximize the right-hand side of the inequality in Theorem 1. The optimization problem is

min Z H t ZZT Ht 2 s.t ZT Z = Ir, R(Hb) R(Z).

Unfortunately, it is hard to solve this minimization problem. Therefore, we use some heuristics to solve it approximately.

First, we observe that without the constraint R(Hb) R(Z), there is a closed-form solution to this maximization problem. Let U Rp s be the left singular vectors of Ht, where s = rank(Ht). For any index set I {1, . . . , s}, let UI be the columns of U indexed by I. By direct calculation, we see that for any I, H t UIU T I Ht 2 = 1. Because we always have H t ZZT Ht 2 1 for any columnorthonormal matrix Z, Z = UI is the desired solution.

Second, we simply add some columns to UI to form ZI such that all conditions in Theorem 1 are satisﬁed. We ﬁrst do QR decomposition with pivoting such that (I UIU T I )Hb = QRΠ. Then, ZI = [UI, Q] is the required result. We can easily see that ZT I ZI = I and

Making Fisher Discriminant Analysis Scalable

R(Hb) R(ZI). Moreover, ZI admits the following property:

Lemma 1 The matrix ZI described above satisﬁes

H t ZIZT I Ht 2 max1 i s,i I σi(Ht)

where σj(Ht) is the j-th singular value of Ht in decreasing order.

According to Lemma 1, we choose UI to be the left singular vectors associated with the largest r q singular values of Ht, to minimize max1 i s,i I σi(Ht). The pseudocode of our algorithm is given in Algorithm 2.

Algorithm 2 SVD-QR-LDA algorithm Input: give the target dimension r of the ﬁrst stage

1: calculate Ht and Hb by (2), q rank(Hb) 2: Z1 the top r q left singular vectors of Ht 3: compute QR decomposition with pivoting of Hb Z1ZT 1 Hb as QRΠ = Hb Z1ZT 1 Hb 4: Z2 ﬁrst q columns of Q, Z [Z1, Z2] 5: Hb ZT Hb, Ht ZT Ht 6: G result of the conventional LDA algorithm on Hb and Ht 7: return ˆG Z G

When r = q, we can easily see that Z1 in our algorithm disappears and our algorithm becomes the same as LDA/QR. Therefore, our algorithm can be regarded as a generalization of LDA/QR, with larger intermediate dimension r. But our algorithm gives a more accurate result than LDA/QR when r is much larger than q.

Note that ZPCA in PCA+LDA does not hold the conditions in Theorem 1. Thus, we cannot compare our algorithm with PCA+LDA theoretically. However, we will see in Section 4 that the objective function of our algorithm is usually larger than the one of PCA+LDA when the intermediate dimension r is the same.

3.3. The Randomized SVD Algorithm

The time complexity of our algorithm is TP CA(r) + O(nrm), where TP CA(r) is the time complexity of calculating the top r singular vectors. If we use a classical algorithm for SVD, the time complexity is O(np min{n, p}), which is too slow for large-scale matrices. To make our algorithm scalable, we use a random projection based algorithm (Halko et al., 2011) to calculate SVD. The time complexity is O(npr), which is much smaller than O(np min{n, p}) when r is small. Although the Krylov subspace method has the same time complexity O(npr), Halko et al. (2011) showed that random projection based algorithms are more scalable. When the data matrix is not

ﬁtted in memory, randomized algorithms require only constant number of passes over the data, while the subspace method requires at least O(r) passes. In our algorithm, we need as large r as possible to make our algorithm more accurate. Thus the number of passes that the Krylov subspace method needs is too large. Randomized algorithms are also more robust and are easily parallelized. Moreover, we will show empirically in Section 4 that the loss in accuracy incurred by using the randomized algorithm is very small.

We describe this randomized approach in Algorithm 3. We refer to Algorithm 2 based on the randomized approach as RSVD-QR-LDA for short. The time complexity of our RSVD-QR-LDA is O(npr), which is similar to that of PCA+LDA using the same SVD algorithm. Since the data matrix in question possibly has singular values that decay slowly, we need some power iteration steps. Setting the number of power iterations q SVD to 1 usually sufﬁces in practice.

Algorithm 3 randomized SVD algorithm Input: give an p n matrix A, a target dimension k SVD, an exponent q SVD, and an oversampling parameter p SVD 1: Ω an n (k SVD + p SVD) Gaussian test matrix 2: Y (AAT )q SVDAΩ 3: compute QR decomposition of Y as Y = QR 4: B QT A 5: compute SVD of B as B = UΣV T

6: Uk SVD ﬁrst k SVD columns of Q U, Σk SVD ﬁrst k SVD columns and k SVD rows of Σ, Vk SVD ﬁrst k SVD columns of V 7: return Uk SVD, Σk SVD, Vk SVD

3.4. Implementation on Distributed systems

For large problems with millions of training data or features, it may be impractical to apply our algorithm on a single machine. Therefore, we implementat our algorithm on distributed systems. Our implementation requires that the scale of r should be moderate, such that a matrix of size r r can ﬁt in the memory of a single machine.

Our algorithm employs the randomized SVD algorithm described in section 3.3 for performing SVD. In section 3.3, we have discussed a lot of the scalability issues of the randomized SVD algorithm and the Krylov subspace method. Furthermore, compared with the randomized SVD algorithm, the Krylov subspace method requires much more iterations. Each iteration depends on the results of the previous iteration to proceed. Thus, it needs data transfer and synchronization work in each iteration. On distributed systems, the cost of data transfer and synchronization is large. Therefore, the Krylov subspace method can be much slower than the randomized method on distributed systems.

Making Fisher Discriminant Analysis Scalable

There are two operations in our algorithm that need to be implemented on distributed systems: matrix multiplication of large matrices and the QR decomposition of tall-andskinny matrices. Here tall-and-skinny matrix means matrix which has a large row size and a small column size. Large matrix multiplication can be parallelized naturally. To implement the QR decomposition of thin matrix on distributed systems, many methods have been proposed (Cosnard et al., 1986; Halko, 2012). We use the Cholesky decomposition based algorithm for the QR decomposition to make our implementation simple. For a tall-and-skinny matrix Y , we compute the Cholesky decomposition of Y T Y as Y T Y = RT R, where R is an upper triangular matrix. Then the QR decomposition of Y should be Y = QR where Q = Y R 1.

3.5. Generalization to KDA

Algorithm 2 can be easily generalized to solve the KDA problem, because the objective function of KDA has a similar form as LDA. We present the detailed procedure in Algorithm 4.

Algorithm 4 RSVD-QR-KLDA algorithm Input: give the target dimension r of the ﬁrst stage

1: calculate Kt and Kb by (3) and (4), q rank(Kb) 2: Z1 the top r q left singular vectors of Kt 3: compute QR decomposition with pivoting of Kb Z1ZT 1 Kb as QRΠ = Kb Z1ZT 1 Kb 4: Z2 ﬁrst q columns of Q, Z [Z1, Z2] 5: Kb ZT Kb, Kt ZT Kt 6: A result of the conventional KDA algorithm on Kb and Kt 7: return ˆA Z A

We can also use the randomized SVD algorithm in Step 2 of Algorithm 4. The time complexity of the resulting algorithm is Tk + O(n2r), where Tk is the time for calculating the kernel matrix K. Thus, our algorithm and the PCA+LDA algorithm also have the similar computational complexity.

4. Experiments

In this section we perform empirical analysis of our proposed algorithms on two face datasets, Yale B&E and CMU PIE, two middle-sized document datasets, News20 (Lang, 1995) and RCV1 (Lewis et al., 2004), and a large dataset, Amazon7 (Dredze et al., 2008; Blondel et al., 2013). The sizes of the ﬁve datasets are described in Table 1.

The Yale B&E dataset consists of Yale Face Database B (Georghiades et al., 2001) and the extended Yale Face Database B (Lee et al., 2005). We collect a subset of 2414 face images of 38 subjects, and crop and resize each im-

age to 32 32. For the CMU PIE dataset, we use a subset containing 11554 images of 68 subjects, and the images are cropped and resized to 64 64. For these two datasets, we randomly pick 70% of data for training and the remaining for test. We repeat this procedure 5 times and report the averages of the objective function, classiﬁcation accuracy and running time on these 5 repeats.

For News20, we use the ﬁrst 80% of the original data for training and the left 20% for test. For RCV1, we follow the preprocessing process of Bekkerman & Scholz (2008). That is, we map all categories to the second level and delete all data with multiple labels. We also follow the setting in Bekkerman & Scholz (2008) for the training and test data.

Amazon7 dataset contains 1,362,109 reviews of Amazon product. We randomly pick 80% of the dataset for training, and the remaining is for test. We follow the preprocessing process of Blondel et al. (2013).

Table 1. The summary of datasets where n is the number of training data, p is the number of the input features, and m is the number of classes.

Data set n test size p m Yale B&E 1,689 725 1,024 38 PIE 8,087 3,467 4,096 68 News20 15,935 3,993 62,061 20 RCV1 15,564 518,571 47,236 53 Amazon7 1,089,687 272,422 262,144 7

4.1. Experiments with LDA Algorithms

We compare our RSVD-QR-LDA algorithm with the PCALDA algorithm. For the two small-size face datasets Yale B&E and CMU PIE, we also compare our algorithm with the conventional LDA solved by Algorithm 1, as well as DSVD-QR-LDA in Algorithm 2 where the SVD (Step 2) is computed by the classical SVD algorithm. After the LDA algorithm, we perform the k-NN classiﬁcation algorithm on the reduced space. Particularly, k is selected via 10-fold cross-validation. The SVD step in both RSVDQR-LDA and PCA-LDA is computed by Algorithm 3 with q SVD = 1 and p SVD = 0.1k SVD. All the algorithms are implemented in python 2.7 on a PC with an Intel Xeon X5675 3.07 GHz CPU and 12GB memory.

In Figures 1, 2, 3 and 4, we report the average of the objective function of LDA, classiﬁcation accuracy, and running time of the LDA algorithm, with respect to different r. For RSVD-QR-LDA and PCA-LDA, we run the algorithm 10 times for each training-test splitting, and report the standard deviation of both value of objective function and classiﬁcation accuracy. The standard deviation is taken with respect to different running on the same splitting. For Yale B&E and CMU PIE, we let r from m 1 to about

Making Fisher Discriminant Analysis Scalable

1000, while for News20 and RCV1, we let r from m 1 to about 2000. When r = m 1, our algorithm is identical to LDA/QR. Thus, we also compare our algorithm with LDA/QR. We summarize the experimental results in Table 2, in which we report the best accuracy, the value of r for which the best accuracy is achieved, and the running time. Here the best r value is hand-picked best value on test data, and the standard deviation in Table 2 is taken with respect to different train/test splitting.

From these results, we can see that for both PCA-LDA and SVD-QR-LDA, the value of the objective function increases as r increases. For same r, the objective function of SVD-QR-LDA is larger than that of PCA-LDA. With regard to the classiﬁcation accuracy, we ﬁnd that in News20 and RCV1, SVD-QR-LDA outperforms PCA-LDA, and in Yale B&E and CMU PIE, both our algorithm and the PCA-LDA algorithm suffer from the problem of overﬁtting. However, our algorithm achieves the best accuracy when taking a smaller r, which makes our algorithm more efﬁcient. Additionally, in Yale B&E and CMU PIE, both our algorithm and the PCA-LDA algorithm outperform the exact LDA algorithm. The running time of RSVD-QRLDA is nearly the same with PCA-LDA on the same r, and both are much faster than the conventional LDA, especially on large scale datasets.

We can also see that RSVD-QR-LDA performs nearly the same with DSVD-QR-LDA and that the standard deviations are also small. This implies that the accuracy loss incurred by using the randomized algorithm is very small.

4.2. Experiments on the distributed system

We test our implementation on the distributed system on a Hadoop (Borthakur, 2007) cluster on Amazon Elastic Map Reduce (EMR). We use an EMR cluster consisting of 8 m1.xlarge instance. Each m1.xlarge instance uses a Intel Xeon Family quad-core CPU and 15GB memory. The version of Hadoop we use is 1.0.3.

In Figure 5, we report the objective function of LDA, classiﬁcation accuracy, and running time of the LDA algorithm, with respect to different r. In Table 3, we report the best accuracy, the value of r for which the best accuracy is achieved, and the running time.

From these results, we can see that our algorithm outperforms PCA-LDA in both the value of objective function and classiﬁcation accuracy, and the running times of the two algorithms are nearly the same.

4.3. Experiments with KDA Algorithms

For KDA algorithms, we only conduct comparison on two face datasets: Yale B&E and CMU PIE. We use RBF kernel κ(x1, x2) = exp( x1 x2 2

θ ) in our experiment, where θ

was set to the mean Euclidean distance among training data points. We report the objective function, classiﬁcation accuracy and running time of the three algorithms in Figures 3 and 4. Here the running time does not include the time for computing the kernel matrix. We also summarize our results in Table 4, including the best accuracy, the value of r on which the best accuracy is achieved, and the running time.

It is seen from the results that our algorithm outperforms PCA+KDA in both the optimal value of objective function and classiﬁcation accuracy. The running times of the two algorithms are nearly the same. But they are much faster than the conventional KDA on larger scale datasets.

5. Conclusion

In this paper we have proposed a novel two-stage algorithm for approximately solving the LDA problem, based on our theoretical analysis on the approximation error of two-stage methods. Our algorithm includes LDA/QR as a special case. We have made our algorithm scalable by using a randomized SVD algorithm and generalized our algorithm to the kernel LDA problem. Furthermore, we devise an implementation of our algorithm on distributed systems. We have conducted empirical analysis on several different datasets. The experimental results show that our algorithm outperforms PCA+LDA while they have the same scalability.

Baudat, G. and Anouar, F. Generalized discriminant analysis using a kernel approach. Neural Computation, 12 (10):2385 2404, 2000.

Bekkerman, R. and Scholz, M. Data weaving: scaling up the state-of-the-art in data clustering. In Proceedings of the 17th ACM conference on Information and knowledge management (CIKM 08), pp. 1083 1092, New York, NY, USA, 2008. ACM.

Belhumeur, P. N., Hespanha, J., and Kriegman, D. J. Eigenfaces vs. ﬁsherfaces: Recognition using class speciﬁc linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711 720, 1997.

Blondel, Mathieu, Seki, Kazuhiro, and Uehara, Kuniaki. Block coordinate descent algorithms for large-scale sparse multiclass classiﬁcation. Machine Learning, pp. 1 22, 2013.

Borthakur, Dhruba. The hadoop distributed ﬁle system: Architecture and design, 2007.

Cosnard, Michel, Muller, J-M, and Robert, Yves. Parallel

Making Fisher Discriminant Analysis Scalable

Table 2. The best mean classiﬁcation accuracy (acc) and corresponding standard deviation (std), the best r value for which the best accuracy is achieved, and the running time of the algorithms (in second). Here N/A means that LDA is not applicable to this dataset.

Dataset RSVD-QR-LDA PCA-LDA LDA acc ( std) best r time ( std) acc ( std) best r time ( std) acc ( std) time ( std) Yale B&E 95.5( 0.6) 429 0.85( 0.03) 95.4( 0.6) 723 1.38( 0.04) 93.7( 1.1) 0.74( 0.03) CMU PIE 96.6( 0.4) 167 6.3( 0.3) 96.3( 0.4) 267 4.3( 0.7) 93.0( 0.4) 69.3( 18.2) News20 85.8 2052 184 83.1 2052 180 N/A N/A RCV1 84.0 302 14 82.3 2052 135 N/A N/A

0 500 1000 1500 2000 2500 r

objective function

RSVD-QR-LDA PCA-LDA

0 500 1000 1500 2000 2500 r

classiﬁcation accuracy

RSVD-QR-LDA PCA-LDA

0 500 1000 1500 2000 2500 r

running time(s)

RSVD-QR-LDA PCA-LDA

Figure 1. Objective function, classiﬁcation accuracy and running time of two algorithms on the News20 dataset with respect to intermediate dimension r. We also draw the standard deviations of objective function and classiﬁcation accuracy in the ﬁgure. The standard deviations are taken with respect to different running on the same training-test splitting.

0 500 1000 1500 2000 2500 r

objective function

RSVD-QR-LDA PCA-LDA

0 500 1000 1500 2000 2500 r

classiﬁcation accuracy

RSVD-QR-LDA PCA-LDA

0 500 1000 1500 2000 2500 r

running time(s)

RSVD-QR-LDA PCA-LDA

Figure 2. Objective function, classiﬁcation accuracy and running time of two algorithms on the RCV1 dataset.

0 200 400 600 800 1000 1200 r

objective function

LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA

0 200 400 600 800 1000 1200 r

classiﬁcation accuracy

LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA

0 200 400 600 800 1000 1200 r

running time(s)

LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA

Figure 3. Objective function, classiﬁcation accuracy and running time of four algorithms on the Yale B&E dataset.

Table 3. The best mean classiﬁcation accuracy (acc) and the standard deviation (std), the best r value and the running time of algorithm (in minutes).

Dataset RSVD-QR-LDA PCA-LDA acc best r time acc best r time Amazon7 92.2 500 82.4 90.6 500 73.3

Making Fisher Discriminant Analysis Scalable

0 200 400 600 800 1000 1200 r

objective function

LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA

0 200 400 600 800 1000 1200 r

classiﬁcation accuracy

LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA

0 200 400 600 800 1000 1200 r

running time(s)

LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA

Figure 4. Objective function, classiﬁcation accuracy and running time of four algorithms on the CMU PIE dataset.

0 100 200 300 400 500 r

objective function

RSVD-QR-LDA PCA-LDA

0 100 200 300 400 500 r

classiﬁcation accuracy

RSVD-QR-LDA PCA-LDA

0 100 200 300 400 500 r

running time(min)

RSVD-QR-LDA PCA-LDA

Figure 5. Objective function, classiﬁcation accuracy and running time of the three algorithms on the Amazon7 dataset.

Table 4. The best mean classiﬁcation accuracy (acc) and the standard deviation (std), the best r value and the running time of algorithm (in second).

Dataset RSVD-QR-KDA PCA-KDA KDA acc ( std) best r time ( std) acc ( std) best r time ( std) acc ( std) time ( std) Yale B&E 95.7( 0.6) 821 2.33( 0.06) 95.7( 0.6) 1017 2.75( 0.09) 95.7( 0.6) 1.89( 0.30) CMU PIE 98.2( 0.3) 1067 28.9( 0.2) 98.1( 0.1) 1067 29.3( 0.3) 98.3( 0.3) 162.5( 0.6)

0 200 400 600 800 1000 1200 r

objective function

KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA

0 200 400 600 800 1000 1200 r

classiﬁcation accuracy

KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA

0 200 400 600 800 1000 1200 r

running time(s)

KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA

Figure 6. Objective function, classiﬁcation accuracy and running time of the three algorithms on the Yale B&E dataset.

0 200 400 600 800 1000 1200 r

objective function

KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA

0 200 400 600 800 1000 1200 r

classiﬁcation accuracy

KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA

0 200 400 600 800 1000 1200 r

running time(s)

KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA

Figure 7. Objective function, classiﬁcation accuracy and running time of the three algorithms on the CMU PIE dataset.

Making Fisher Discriminant Analysis Scalable

qr decomposition of a rectangular matrix. Numerische Mathematik, 48(2):239 249, 1986.

Dredze, Mark, Crammer, Koby, and Pereira, Fernando. Conﬁdence-weighted linear classiﬁcation. In Proceedings of the 25th international conference on Machine learning, pp. 264 271. ACM, 2008.

Fukunaga, K. Introduction to statistical pattern recognition. Academic Pr, 1990.

Georghiades, A. S., Belhumeur, P. N., and Kriegman, D. J. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:643 660, 2001.

Golub, G. H. and Loan, C. F. Van. Matrix computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996.

Halko, N., Martinsson, P. G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217 288, 2011.

Halko, Nathan P. Randomized methods for computing lowrank approximations of matrices. Ph D thesis, University of Colorado, 2012.

Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction. Springer, 2009.

Lang, K. Newsweeder: Learning to ﬁlter netnews. In In Proceedings of the Twelfth International Conference on Machine Learning, pp. 331 339. Morgan Kaufmann, 1995.

Lee, K.-C., Ho, J., and Kriegman, D. J. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):684 698, 2005.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361 397, 2004.

Mahoney, M. W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3 (2):123 224, 2011.

Mart ınez, A. M. and Kak, A. C. Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228 233, 2001.

Mika, S., Ratsch, G., Weston, J., Sch olkopf, B., and M uller, K.-R. Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41 48, 1999.

Park, C. H. and Park, H. Nonlinear discriminant analysis using kernel functions and the generalized singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 27(1):87 102, 2005.

Park, H., Jeon, L. M., and Rosen, Z. J. B. Lower dimensional representation of text data based on centroids and least squares. BIT Numerical Mathematics, 43:427 448, 2003.

Yang, J. and Yang, J.-Y. Why can LDA be performed in PCA transformed space? Pattern Recognition, 36(2): 563 566, 2003.

Ye, J. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483 502, 2005.

Ye, J. and Li, Q. A two-stage linear discriminant analysis via QR-decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):929 941, 2005.

Zhang, Z., Dai, G., Xu, C., and Jordan, M. I. Regularized discriminant analysis, ridge regression and beyond. Journal of Machine Learning Research, 11:2199 2228, 2010.