# making_fisher_discriminant_analysis_scalable__5e1203bd.pdf Making Fisher Discriminant Analysis Scalable Bojun Tu TUBOJUN@GMAIL.COM Department of Computer Science & Engineering, Shanghai Jiao Tong University, Shanghai, China Zhihua Zhang ZHIHUA@SJTU.EDU.CN Department of Computer Science & Engineering, Shanghai Jiao Tong University, Shanghai, China Shusen Wang WSS@ZJU.EDU.CN College of Computer Science & Technology, Zhejiang University, Hangzhou, China Hui Qian QIANHUI@ZJU.EDU.CN College of Computer Science & Technology, Zhejiang University, Hangzhou, China The Fisher linear discriminant analysis (LDA) is a classical method for classification and dimension reduction jointly. A major limitation of the conventional LDA is a so-called singularity issue. Many LDA variants, especially two-stage methods such as PCA+LDA and LDA/QR, were proposed to solve this issue. In the two-stage methods, an intermediate stage for dimension reduction is developed before the actual LDA method works. These two-stage methods are scalable because they are an approximate alternative of the LDA method. However, there is no theoretical analysis on how well they approximate the conventional LDA problem. In this paper we present theoretical analysis on the approximation error of a two-stage algorithm. Accordingly, we develop a new two-stage algorithm. Furthermore, we resort to a random projection approach, making our algorithm scalable. We also provide an implemention on distributed system to handle large scale problems. Our algorithm takes LDA/QR as its special case, and outperforms PCA+LDA while having a similar scalability. We also generalize our algorithm to kernel discriminant analysis, a nonlinear version of the classical LDA. Extensive experiments show that our algorithms outperform PCA+LDA and have a similar scalability with it. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). 1. Introduction The Fisher linear discriminant analysis (LDA) has received wide applications in multivariate analysis and machine learning such as face recognition (Belhumeur et al., 1997; Mart ınez & Kak, 2001), text classification, microarray data classification, etc. The conventional LDA problem attempts to find an optimal linear transformation by minimizing the total class distance and maximizing the betweenclass distance simultaneously. It is well known that this optimization problem can be solved by applying eigenvalue decomposition to the scatter matrices. However, this requires the total scatter matrix to be nonsingular, which is usually not the case in real world applications. For example, we usually meet microarray datasets which have the large p but small n regime. To address this singularity issue, the pseudoinverse and regularization methods have been widely employed (Hastie et al., 2009; Zhang et al., 2010). Recently, a two-stage approximate approach, such as the PCA+LDA algorithm (Belhumeur et al., 1997) and the LDA/QR algorithm (Ye & Li, 2005), has been also proposed. Typically, the twostage approach applies a cheap intermediate dimension reduction method before the conventional LDA algorithm is performed. This class of algorithms is more scalable than the exact algorithm. To the best of our knowledge, however, there is no theoretical analysis on how well these algorithms approximate the exact LDA problem. In this paper we first give theoretical analysis on how well a two-stage algorithm approximates the exact LDA in the sense of maximizing the LDA objective function. The theoretical analysis motivates us to devise a new two-stage LDA algorithm. Our algorithm outperforms the PCA+LDA (Belhumeur et al., 1997) while both have the similar scala- Making Fisher Discriminant Analysis Scalable bility. To make our algorithm more scalable, we introduce a random-projection-based SVD (RSVD) method (Mahoney, 2011; Halko et al., 2011). Furthermore, we provide an implemention of our algorithm on distributed system to handle large scale problems. Additionally, we also show that our algorithm could naturally generalize to the kernel discriminant analysis (KDA) problem (Mika et al., 1999; Baudat & Anouar, 2000; Park & Park, 2005), which is a nonlinear generalization of the conventional LDA. The remainder of the paper is organized as follows. In Section 2, we review the classical LDA and two two-stage algorithms for efficiently solving LDA, the PCA+LDA algorithm and the LDA/QR algorithm. In Section 3, we present our algorithm and some theoretical analysis. We conduct empirical analysis and comparison in Section 4, and conclude our work in Section 5. 2. The Fisher Discriminant Analysis In this section we briefly review the conventional Fisher LDA and KDA problems, as well as efficient algorithms for solving them. 2.1. Linear Discriminant Analysis We are given a data matrix X = [x1, . . . , xn]T Rn p where xi is a p-dimensional input instance. Suppose the input instances are partitioned into m classes, such that XT = [XT 1 , XT 2 , . . . , XT m] where Xi Rni p contains ni instances from the i-th class and Pm i=1 ni = n. The conventional LDA is to find the optimal linear transformation G Rp q that preserves the class structure in a lowdimensional space as well as in the original space. That is, G maps each xi of X in the p-dimensional space to a vector yi in the q-dimensional space. The between-class, and total scatter matrices are defined by x Xi (x ci)(x ci)T , i=1 ni(ci c)(ci c)T , i=1 (xi c)(xi c)T , where ci = 1 ni P x Xi x is the mean of the j-th class and c = 1 x X x is the mean of the whole data set. In the low-dimensional space defined by the linear transformation G, the between-class, and total scatter matrices become SG b = GT Sb G, and SG t = GT St G, respectively. We can simplify the formulation of the scatter matrices through precursors Hb, and Ht as Hb = 1 n [ n1(c1 c), . . . , nm(cm c)] , (1) XT 1 c1e T 1 , , XT m cme T m , Ht = 1 n(XT ce T ), (2) where ei = [1, , 1]T Rni and e = [1, . . . , 1]T Rn. Hence, it can be shown that we can rewrite the scatter matrices as Sb = Hb HT b , and St = Ht HT t . An optimal transformation G can be obtained by solving the following optimization problem: n J(G) trace (SG t ) 1SG b o . The solution to the optimization can be obtained through eigen-decomposition of the matrix S 1 t Sb whenever St is nonsingular (Fukunaga, 1990). However, when St is singular, we can use the eigen-decomposition of S t Sb, where S t is the Moore-Penrose inverse of St. Recently, an SVD-based algorithm was proposed to solve the eigen problem(Ye, 2005), which is described in Algorithm 1. Let G be obtained from Algorithm 1. Then it consists of the top q eigenvectors of S t Sb. Here q = rank(Hb) which is usually m 1 in most cases. We also have (G )T St G = Iq and (G )T Sb G = Σ2 q. Hence the objective function of LDA is actually trace(Σ2 q). Algorithm 1 The SVD based LDA algorithm 1: calculate Ht and Hb (2) 2: compute the reduced SVD of Ht as Ht = UtΣt V T t 3: B Σ 1 t U T t Hb 4: compute the reduced SVD of B as B = PqΣq QT q 5: return G UtΣ 1 t Pq 2.2. Kernel Discriminant Analysis To apply LDA to nonlinear data, many KDA algorithms have been devised by using a so-called kernel trick (Mika et al., 1999; Baudat & Anouar, 2000; Park & Park, 2005). The kernel method first maps the original data into a highdimensional space H by a nonlinear transformation Φ : Rp H. Typically, Φ is implicitly available and we only know a kernel function κ : Rp Rp R such that κ(x1, x2) = Φ(x1)T Φ(x2). Let Kb = [bij]1 i n,1 j m where xk Xj κ(xi, xk) 1 k=1 κ(xi, xk) , (3) Making Fisher Discriminant Analysis Scalable and Kt = [tij]1 i n,1 j n where tij = n κ(xi, xj) 1 k=1 κ(xi, xj) . (4) Then the objective function of KDA becomes J(G) = trace (AT Kt KT t A) 1(AT Kb KT b A) , where A = [aij]1 i n,1 j q satisfies [G(x)]j = Pn i=1 aijκ(x, xi) for any x Rp (Park & Park, 2005). This objective function has a similar form with the LDA objective function. That is, it replaces Hb with Kb and Ht with Kt. Therefore, Algorithm 1 could be easily applied to KDA. Here we omit the details. 2.3. Two-stage methods There exist two important two-stage methods for the conventional LDA, namely, PCA+LDA (Belhumeur et al., 1997; Yang & Yang, 2003) and LDA/QR (Ye & Li, 2005; Ye, 2005) or OCM+LDA (Park et al., 2003). These two algorithms first yield a column-orthogonal transformation matrix Z in the first stage and then implement the classical LDA algorithm on the reduced space obtained via Z. The PCA+LDA algorithm first computes ZP CA to maximize trace(ZT St Z), which is equivalent to computing the top r left singular vectors of Ht. LDA maximizes the trace of ratio of between the class scatter matrix and the total scatter matrix. However, PCA maximizes the trace of the total scatter matrix, which is not compatible to the objective function of LDA. So some useful information which may be important for discrimination might be lost in the PCA stage. The LDA/QR first computes a column orthogonal matrix ZOCM by maximizing trace(ZT Sb Z). This objective function is more compatible to the objective function of LDA. Ye & Li (2005) showed that if the QR decomposition with pivoting (Golub & Loan, 1996) of Hb is Hb = QRΠ, then we can take ZOCM as Q. The LDA/QR algorithm reduces the dimension to q, which is too small for many problems, thus a lot of information in the original data matrix might be lost. 3. Methodology In this section we first give a theoretical bound on how well a two-stage LDA algorithm approximates the exact LDA algorithm when the transformation matrix of the first stage is given. We then propose a fast two-stage LDA algorithm based on the theoretical analysis. We also generalize our algorithm to the KDA problem. 3.1. Theoretical Analysis In the first stage of a two-stage algorithm, we introduce a linear transformation Z Rp r to reduce the dimension to r. We assume Z to be column-orthonormal, that is, ZT Z = Ir. In the second stage we apply the conventional LDA algorithm on the reduced total scatter matrix St = ZT St Z and the reduced between-class scatter matrix Sb = ZT Sb Z to obtain a linear transformation G. The final result then becomes ˆGZ = Z G. Intuitively, a good Z should keep as much information as possible in both St and Sb. Fortunately, rank(Sb) = q m 1 is small under the assumption that number of classes m is small. Therefore, if we let R(Hb) R(Z), all information will be kept after linear transformation Z. Here R( ) is the range (or column-spanned space) of a matrix. The following theorem shows how well we can approximate the solution of exact LDA using such a Z. Theorem 1 If Z Rp r satisfies R(Hb) R(Z) and ZT Z = Ir, then we have J( ˆGZ) 1 (G )T ZZT Ht 2 2 J(G ) H t ZZT Ht 2 2 J(G ). Here 2 is the spectral norm of a matrix, and G is the solution of Algorithm 1. 3.2. The SVD-QR-LDA Algorithm To obtain a more accurate two-stage LDA algorithm, we can maximize the right-hand side of the inequality in Theorem 1. The optimization problem is min Z H t ZZT Ht 2 s.t ZT Z = Ir, R(Hb) R(Z). Unfortunately, it is hard to solve this minimization problem. Therefore, we use some heuristics to solve it approximately. First, we observe that without the constraint R(Hb) R(Z), there is a closed-form solution to this maximization problem. Let U Rp s be the left singular vectors of Ht, where s = rank(Ht). For any index set I {1, . . . , s}, let UI be the columns of U indexed by I. By direct calculation, we see that for any I, H t UIU T I Ht 2 = 1. Because we always have H t ZZT Ht 2 1 for any columnorthonormal matrix Z, Z = UI is the desired solution. Second, we simply add some columns to UI to form ZI such that all conditions in Theorem 1 are satisfied. We first do QR decomposition with pivoting such that (I UIU T I )Hb = QRΠ. Then, ZI = [UI, Q] is the required result. We can easily see that ZT I ZI = I and Making Fisher Discriminant Analysis Scalable R(Hb) R(ZI). Moreover, ZI admits the following property: Lemma 1 The matrix ZI described above satisfies H t ZIZT I Ht 2 max1 i s,i I σi(Ht) where σj(Ht) is the j-th singular value of Ht in decreasing order. According to Lemma 1, we choose UI to be the left singular vectors associated with the largest r q singular values of Ht, to minimize max1 i s,i I σi(Ht). The pseudocode of our algorithm is given in Algorithm 2. Algorithm 2 SVD-QR-LDA algorithm Input: give the target dimension r of the first stage 1: calculate Ht and Hb by (2), q rank(Hb) 2: Z1 the top r q left singular vectors of Ht 3: compute QR decomposition with pivoting of Hb Z1ZT 1 Hb as QRΠ = Hb Z1ZT 1 Hb 4: Z2 first q columns of Q, Z [Z1, Z2] 5: Hb ZT Hb, Ht ZT Ht 6: G result of the conventional LDA algorithm on Hb and Ht 7: return ˆG Z G When r = q, we can easily see that Z1 in our algorithm disappears and our algorithm becomes the same as LDA/QR. Therefore, our algorithm can be regarded as a generalization of LDA/QR, with larger intermediate dimension r. But our algorithm gives a more accurate result than LDA/QR when r is much larger than q. Note that ZPCA in PCA+LDA does not hold the conditions in Theorem 1. Thus, we cannot compare our algorithm with PCA+LDA theoretically. However, we will see in Section 4 that the objective function of our algorithm is usually larger than the one of PCA+LDA when the intermediate dimension r is the same. 3.3. The Randomized SVD Algorithm The time complexity of our algorithm is TP CA(r) + O(nrm), where TP CA(r) is the time complexity of calculating the top r singular vectors. If we use a classical algorithm for SVD, the time complexity is O(np min{n, p}), which is too slow for large-scale matrices. To make our algorithm scalable, we use a random projection based algorithm (Halko et al., 2011) to calculate SVD. The time complexity is O(npr), which is much smaller than O(np min{n, p}) when r is small. Although the Krylov subspace method has the same time complexity O(npr), Halko et al. (2011) showed that random projection based algorithms are more scalable. When the data matrix is not fitted in memory, randomized algorithms require only constant number of passes over the data, while the subspace method requires at least O(r) passes. In our algorithm, we need as large r as possible to make our algorithm more accurate. Thus the number of passes that the Krylov subspace method needs is too large. Randomized algorithms are also more robust and are easily parallelized. Moreover, we will show empirically in Section 4 that the loss in accuracy incurred by using the randomized algorithm is very small. We describe this randomized approach in Algorithm 3. We refer to Algorithm 2 based on the randomized approach as RSVD-QR-LDA for short. The time complexity of our RSVD-QR-LDA is O(npr), which is similar to that of PCA+LDA using the same SVD algorithm. Since the data matrix in question possibly has singular values that decay slowly, we need some power iteration steps. Setting the number of power iterations q SVD to 1 usually suffices in practice. Algorithm 3 randomized SVD algorithm Input: give an p n matrix A, a target dimension k SVD, an exponent q SVD, and an oversampling parameter p SVD 1: Ω an n (k SVD + p SVD) Gaussian test matrix 2: Y (AAT )q SVDAΩ 3: compute QR decomposition of Y as Y = QR 4: B QT A 5: compute SVD of B as B = UΣV T 6: Uk SVD first k SVD columns of Q U, Σk SVD first k SVD columns and k SVD rows of Σ, Vk SVD first k SVD columns of V 7: return Uk SVD, Σk SVD, Vk SVD 3.4. Implementation on Distributed systems For large problems with millions of training data or features, it may be impractical to apply our algorithm on a single machine. Therefore, we implementat our algorithm on distributed systems. Our implementation requires that the scale of r should be moderate, such that a matrix of size r r can fit in the memory of a single machine. Our algorithm employs the randomized SVD algorithm described in section 3.3 for performing SVD. In section 3.3, we have discussed a lot of the scalability issues of the randomized SVD algorithm and the Krylov subspace method. Furthermore, compared with the randomized SVD algorithm, the Krylov subspace method requires much more iterations. Each iteration depends on the results of the previous iteration to proceed. Thus, it needs data transfer and synchronization work in each iteration. On distributed systems, the cost of data transfer and synchronization is large. Therefore, the Krylov subspace method can be much slower than the randomized method on distributed systems. Making Fisher Discriminant Analysis Scalable There are two operations in our algorithm that need to be implemented on distributed systems: matrix multiplication of large matrices and the QR decomposition of tall-andskinny matrices. Here tall-and-skinny matrix means matrix which has a large row size and a small column size. Large matrix multiplication can be parallelized naturally. To implement the QR decomposition of thin matrix on distributed systems, many methods have been proposed (Cosnard et al., 1986; Halko, 2012). We use the Cholesky decomposition based algorithm for the QR decomposition to make our implementation simple. For a tall-and-skinny matrix Y , we compute the Cholesky decomposition of Y T Y as Y T Y = RT R, where R is an upper triangular matrix. Then the QR decomposition of Y should be Y = QR where Q = Y R 1. 3.5. Generalization to KDA Algorithm 2 can be easily generalized to solve the KDA problem, because the objective function of KDA has a similar form as LDA. We present the detailed procedure in Algorithm 4. Algorithm 4 RSVD-QR-KLDA algorithm Input: give the target dimension r of the first stage 1: calculate Kt and Kb by (3) and (4), q rank(Kb) 2: Z1 the top r q left singular vectors of Kt 3: compute QR decomposition with pivoting of Kb Z1ZT 1 Kb as QRΠ = Kb Z1ZT 1 Kb 4: Z2 first q columns of Q, Z [Z1, Z2] 5: Kb ZT Kb, Kt ZT Kt 6: A result of the conventional KDA algorithm on Kb and Kt 7: return ˆA Z A We can also use the randomized SVD algorithm in Step 2 of Algorithm 4. The time complexity of the resulting algorithm is Tk + O(n2r), where Tk is the time for calculating the kernel matrix K. Thus, our algorithm and the PCA+LDA algorithm also have the similar computational complexity. 4. Experiments In this section we perform empirical analysis of our proposed algorithms on two face datasets, Yale B&E and CMU PIE, two middle-sized document datasets, News20 (Lang, 1995) and RCV1 (Lewis et al., 2004), and a large dataset, Amazon7 (Dredze et al., 2008; Blondel et al., 2013). The sizes of the five datasets are described in Table 1. The Yale B&E dataset consists of Yale Face Database B (Georghiades et al., 2001) and the extended Yale Face Database B (Lee et al., 2005). We collect a subset of 2414 face images of 38 subjects, and crop and resize each im- age to 32 32. For the CMU PIE dataset, we use a subset containing 11554 images of 68 subjects, and the images are cropped and resized to 64 64. For these two datasets, we randomly pick 70% of data for training and the remaining for test. We repeat this procedure 5 times and report the averages of the objective function, classification accuracy and running time on these 5 repeats. For News20, we use the first 80% of the original data for training and the left 20% for test. For RCV1, we follow the preprocessing process of Bekkerman & Scholz (2008). That is, we map all categories to the second level and delete all data with multiple labels. We also follow the setting in Bekkerman & Scholz (2008) for the training and test data. Amazon7 dataset contains 1,362,109 reviews of Amazon product. We randomly pick 80% of the dataset for training, and the remaining is for test. We follow the preprocessing process of Blondel et al. (2013). Table 1. The summary of datasets where n is the number of training data, p is the number of the input features, and m is the number of classes. Data set n test size p m Yale B&E 1,689 725 1,024 38 PIE 8,087 3,467 4,096 68 News20 15,935 3,993 62,061 20 RCV1 15,564 518,571 47,236 53 Amazon7 1,089,687 272,422 262,144 7 4.1. Experiments with LDA Algorithms We compare our RSVD-QR-LDA algorithm with the PCALDA algorithm. For the two small-size face datasets Yale B&E and CMU PIE, we also compare our algorithm with the conventional LDA solved by Algorithm 1, as well as DSVD-QR-LDA in Algorithm 2 where the SVD (Step 2) is computed by the classical SVD algorithm. After the LDA algorithm, we perform the k-NN classification algorithm on the reduced space. Particularly, k is selected via 10-fold cross-validation. The SVD step in both RSVDQR-LDA and PCA-LDA is computed by Algorithm 3 with q SVD = 1 and p SVD = 0.1k SVD. All the algorithms are implemented in python 2.7 on a PC with an Intel Xeon X5675 3.07 GHz CPU and 12GB memory. In Figures 1, 2, 3 and 4, we report the average of the objective function of LDA, classification accuracy, and running time of the LDA algorithm, with respect to different r. For RSVD-QR-LDA and PCA-LDA, we run the algorithm 10 times for each training-test splitting, and report the standard deviation of both value of objective function and classification accuracy. The standard deviation is taken with respect to different running on the same splitting. For Yale B&E and CMU PIE, we let r from m 1 to about Making Fisher Discriminant Analysis Scalable 1000, while for News20 and RCV1, we let r from m 1 to about 2000. When r = m 1, our algorithm is identical to LDA/QR. Thus, we also compare our algorithm with LDA/QR. We summarize the experimental results in Table 2, in which we report the best accuracy, the value of r for which the best accuracy is achieved, and the running time. Here the best r value is hand-picked best value on test data, and the standard deviation in Table 2 is taken with respect to different train/test splitting. From these results, we can see that for both PCA-LDA and SVD-QR-LDA, the value of the objective function increases as r increases. For same r, the objective function of SVD-QR-LDA is larger than that of PCA-LDA. With regard to the classification accuracy, we find that in News20 and RCV1, SVD-QR-LDA outperforms PCA-LDA, and in Yale B&E and CMU PIE, both our algorithm and the PCA-LDA algorithm suffer from the problem of overfitting. However, our algorithm achieves the best accuracy when taking a smaller r, which makes our algorithm more efficient. Additionally, in Yale B&E and CMU PIE, both our algorithm and the PCA-LDA algorithm outperform the exact LDA algorithm. The running time of RSVD-QRLDA is nearly the same with PCA-LDA on the same r, and both are much faster than the conventional LDA, especially on large scale datasets. We can also see that RSVD-QR-LDA performs nearly the same with DSVD-QR-LDA and that the standard deviations are also small. This implies that the accuracy loss incurred by using the randomized algorithm is very small. 4.2. Experiments on the distributed system We test our implementation on the distributed system on a Hadoop (Borthakur, 2007) cluster on Amazon Elastic Map Reduce (EMR). We use an EMR cluster consisting of 8 m1.xlarge instance. Each m1.xlarge instance uses a Intel Xeon Family quad-core CPU and 15GB memory. The version of Hadoop we use is 1.0.3. In Figure 5, we report the objective function of LDA, classification accuracy, and running time of the LDA algorithm, with respect to different r. In Table 3, we report the best accuracy, the value of r for which the best accuracy is achieved, and the running time. From these results, we can see that our algorithm outperforms PCA-LDA in both the value of objective function and classification accuracy, and the running times of the two algorithms are nearly the same. 4.3. Experiments with KDA Algorithms For KDA algorithms, we only conduct comparison on two face datasets: Yale B&E and CMU PIE. We use RBF kernel κ(x1, x2) = exp( x1 x2 2 θ ) in our experiment, where θ was set to the mean Euclidean distance among training data points. We report the objective function, classification accuracy and running time of the three algorithms in Figures 3 and 4. Here the running time does not include the time for computing the kernel matrix. We also summarize our results in Table 4, including the best accuracy, the value of r on which the best accuracy is achieved, and the running time. It is seen from the results that our algorithm outperforms PCA+KDA in both the optimal value of objective function and classification accuracy. The running times of the two algorithms are nearly the same. But they are much faster than the conventional KDA on larger scale datasets. 5. Conclusion In this paper we have proposed a novel two-stage algorithm for approximately solving the LDA problem, based on our theoretical analysis on the approximation error of two-stage methods. Our algorithm includes LDA/QR as a special case. We have made our algorithm scalable by using a randomized SVD algorithm and generalized our algorithm to the kernel LDA problem. Furthermore, we devise an implementation of our algorithm on distributed systems. We have conducted empirical analysis on several different datasets. The experimental results show that our algorithm outperforms PCA+LDA while they have the same scalability. Baudat, G. and Anouar, F. Generalized discriminant analysis using a kernel approach. Neural Computation, 12 (10):2385 2404, 2000. Bekkerman, R. and Scholz, M. Data weaving: scaling up the state-of-the-art in data clustering. In Proceedings of the 17th ACM conference on Information and knowledge management (CIKM 08), pp. 1083 1092, New York, NY, USA, 2008. ACM. Belhumeur, P. N., Hespanha, J., and Kriegman, D. J. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711 720, 1997. Blondel, Mathieu, Seki, Kazuhiro, and Uehara, Kuniaki. Block coordinate descent algorithms for large-scale sparse multiclass classification. Machine Learning, pp. 1 22, 2013. Borthakur, Dhruba. The hadoop distributed file system: Architecture and design, 2007. Cosnard, Michel, Muller, J-M, and Robert, Yves. Parallel Making Fisher Discriminant Analysis Scalable Table 2. The best mean classification accuracy (acc) and corresponding standard deviation (std), the best r value for which the best accuracy is achieved, and the running time of the algorithms (in second). Here N/A means that LDA is not applicable to this dataset. Dataset RSVD-QR-LDA PCA-LDA LDA acc ( std) best r time ( std) acc ( std) best r time ( std) acc ( std) time ( std) Yale B&E 95.5( 0.6) 429 0.85( 0.03) 95.4( 0.6) 723 1.38( 0.04) 93.7( 1.1) 0.74( 0.03) CMU PIE 96.6( 0.4) 167 6.3( 0.3) 96.3( 0.4) 267 4.3( 0.7) 93.0( 0.4) 69.3( 18.2) News20 85.8 2052 184 83.1 2052 180 N/A N/A RCV1 84.0 302 14 82.3 2052 135 N/A N/A 0 500 1000 1500 2000 2500 r objective function RSVD-QR-LDA PCA-LDA 0 500 1000 1500 2000 2500 r classification accuracy RSVD-QR-LDA PCA-LDA 0 500 1000 1500 2000 2500 r running time(s) RSVD-QR-LDA PCA-LDA Figure 1. Objective function, classification accuracy and running time of two algorithms on the News20 dataset with respect to intermediate dimension r. We also draw the standard deviations of objective function and classification accuracy in the figure. The standard deviations are taken with respect to different running on the same training-test splitting. 0 500 1000 1500 2000 2500 r objective function RSVD-QR-LDA PCA-LDA 0 500 1000 1500 2000 2500 r classification accuracy RSVD-QR-LDA PCA-LDA 0 500 1000 1500 2000 2500 r running time(s) RSVD-QR-LDA PCA-LDA Figure 2. Objective function, classification accuracy and running time of two algorithms on the RCV1 dataset. 0 200 400 600 800 1000 1200 r objective function LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA 0 200 400 600 800 1000 1200 r classification accuracy LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA 0 200 400 600 800 1000 1200 r running time(s) LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA Figure 3. Objective function, classification accuracy and running time of four algorithms on the Yale B&E dataset. Table 3. The best mean classification accuracy (acc) and the standard deviation (std), the best r value and the running time of algorithm (in minutes). Dataset RSVD-QR-LDA PCA-LDA acc best r time acc best r time Amazon7 92.2 500 82.4 90.6 500 73.3 Making Fisher Discriminant Analysis Scalable 0 200 400 600 800 1000 1200 r objective function LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA 0 200 400 600 800 1000 1200 r classification accuracy LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA 0 200 400 600 800 1000 1200 r running time(s) LDA DSVD-QR-LDA RSVD-QR-LDA PCA-LDA Figure 4. Objective function, classification accuracy and running time of four algorithms on the CMU PIE dataset. 0 100 200 300 400 500 r objective function RSVD-QR-LDA PCA-LDA 0 100 200 300 400 500 r classification accuracy RSVD-QR-LDA PCA-LDA 0 100 200 300 400 500 r running time(min) RSVD-QR-LDA PCA-LDA Figure 5. Objective function, classification accuracy and running time of the three algorithms on the Amazon7 dataset. Table 4. The best mean classification accuracy (acc) and the standard deviation (std), the best r value and the running time of algorithm (in second). Dataset RSVD-QR-KDA PCA-KDA KDA acc ( std) best r time ( std) acc ( std) best r time ( std) acc ( std) time ( std) Yale B&E 95.7( 0.6) 821 2.33( 0.06) 95.7( 0.6) 1017 2.75( 0.09) 95.7( 0.6) 1.89( 0.30) CMU PIE 98.2( 0.3) 1067 28.9( 0.2) 98.1( 0.1) 1067 29.3( 0.3) 98.3( 0.3) 162.5( 0.6) 0 200 400 600 800 1000 1200 r objective function KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA 0 200 400 600 800 1000 1200 r classification accuracy KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA 0 200 400 600 800 1000 1200 r running time(s) KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA Figure 6. Objective function, classification accuracy and running time of the three algorithms on the Yale B&E dataset. 0 200 400 600 800 1000 1200 r objective function KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA 0 200 400 600 800 1000 1200 r classification accuracy KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA 0 200 400 600 800 1000 1200 r running time(s) KDA DSVD-QR-KDA RSVD-QR-KDA PCA-KDA Figure 7. Objective function, classification accuracy and running time of the three algorithms on the CMU PIE dataset. Making Fisher Discriminant Analysis Scalable qr decomposition of a rectangular matrix. Numerische Mathematik, 48(2):239 249, 1986. Dredze, Mark, Crammer, Koby, and Pereira, Fernando. Confidence-weighted linear classification. In Proceedings of the 25th international conference on Machine learning, pp. 264 271. ACM, 2008. Fukunaga, K. Introduction to statistical pattern recognition. Academic Pr, 1990. Georghiades, A. S., Belhumeur, P. N., and Kriegman, D. J. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:643 660, 2001. Golub, G. H. and Loan, C. F. Van. Matrix computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996. Halko, N., Martinsson, P. G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217 288, 2011. Halko, Nathan P. Randomized methods for computing lowrank approximations of matrices. Ph D thesis, University of Colorado, 2012. Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction. Springer, 2009. Lang, K. Newsweeder: Learning to filter netnews. In In Proceedings of the Twelfth International Conference on Machine Learning, pp. 331 339. Morgan Kaufmann, 1995. Lee, K.-C., Ho, J., and Kriegman, D. J. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):684 698, 2005. Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361 397, 2004. Mahoney, M. W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3 (2):123 224, 2011. Mart ınez, A. M. and Kak, A. C. Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228 233, 2001. Mika, S., Ratsch, G., Weston, J., Sch olkopf, B., and M uller, K.-R. Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pp. 41 48, 1999. Park, C. H. and Park, H. Nonlinear discriminant analysis using kernel functions and the generalized singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 27(1):87 102, 2005. Park, H., Jeon, L. M., and Rosen, Z. J. B. Lower dimensional representation of text data based on centroids and least squares. BIT Numerical Mathematics, 43:427 448, 2003. Yang, J. and Yang, J.-Y. Why can LDA be performed in PCA transformed space? Pattern Recognition, 36(2): 563 566, 2003. Ye, J. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483 502, 2005. Ye, J. and Li, Q. A two-stage linear discriminant analysis via QR-decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):929 941, 2005. Zhang, Z., Dai, G., Xu, C., and Jordan, M. I. Regularized discriminant analysis, ridge regression and beyond. Journal of Machine Learning Research, 11:2199 2228, 2010.