# randomized_nonlinear_component_analysis__f7363668.pdf Randomized Nonlinear Component Analysis David Lopez-Paz DLOPEZ@TUE.MPG.DE Max-Planck-Institute for Intelligent Systems, University of Cambridge Suvrit Sra SUVRIT@TUE.MPG.DE Max-Planck-Institute for Intelligent Systems, Carnegie Mellon University Alexander J. Smola ALEX@SMOLA.ORG Carnegie Mellon University, Google Research Zoubin Ghahramani ZOUBIN@ENG.CAM.AC.UK University of Cambridge Bernhard Sch olkopf BS@TUE.MPG.DE Max-Planck-Institute for Intelligent Systems Classical methods such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques are only able to reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, these are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving drastic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on realworld data, on which we compare against the state-of-the-art. A simple R implementation of the presented algorithms is provided. 1. Introduction Principal Component Analysis (Pearson, 1901) and Canonical Correlation Analysis (Hotelling, 1936) are two of the most popular multivariate analysis methods. They have Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). played a crucial role in a vast array of applications since their conception a century ago. Principal Component Analysis (PCA) rotates a collection of correlated variables into their uncorrelated principal components (also known as factors or latent variables). Principal components owe their name to the following property: the first principal component captures the maximum amount of variance in the data; successive components account for the maximum amount of remaining variance in dimensions orthogonal to the preceding ones. PCA is commonly used for dimensionality reduction, assuming that core properties of a high-dimensional sample are largely captured by a small number of principal components. Canonical Correlation Analysis (CCA) computes linear transformations of a pair of random variables such that their projections are maximally correlated. Analogous to principal components, the projections of the pair of random variables are mutually orthogonal and ordered by their amount of explained cross-correlation. CCA is widely used to learn from multiple modalities of data (Kakade & Foster, 2007), an ability particularly useful when some of the modalities are only available at training time but keeping information about them at testing time is beneficial (Chaudhuri et al., 2009; Vapnik & Vashist, 2009). The applications of PCA and CCA are ubiquitous. Some examples are feature extraction, time-series prediction, finance, medicine, meteorology, chemometrics, biology, neurology, natural language processing, speech recognition, computer vision or multimodal signal processing (Jolliffe, 2002). Despite their success, an impediment of PCA and CCA for modern data analysis is that both reveal only linear relationships between the variables under study. To overcome this Randomized Nonlinear Component Analysis limitation, several nonlinear extensions have been proposed for both PCA and CCA. For PCA, these include Kernel Principal Component Analysis or KPCA (Sch olkopf et al., 1999) and Autoencoder Neural Networks (Baldi & Hornik, 1989; Hinton & Salakhutdinov, 2006). For CCA, common extensions are Kernel Canonical Correlation Analysis or KCCA (Lai & Fyfe, 2000; Bach & Jordan, 2002) and Deep Canonical Correlation Analysis (Andrew et al., 2013). However, these solutions tend to have rather high computational complexity (often cubic in the sample size), are difficult to parallelize or are not accompanied by theoretical guarantees. In a separate strand of recent research, randomized strategies have been introduced to construct features that can help reveal nonlinear patterns in data when used in conjunction with linear algorithms (Rahimi & Recht, 2008; Le et al., 2013). For basic tasks such as regression or classification, using nonlinear random features incurs little or no loss in performance compared with exact kernel methods, while achieving drastic savings in computational complexity (from cubic to linear in the sample size). Furthermore, random features are amenable to simple implementation and clean theoretical analysis. The main contribution of this paper is to lay the foundations for nonlinear, randomized variants of PCA and CCA. Therefore, we dedicate attention to studying the spectral properties of low-rank kernel matrices constructed as sums of random feature dot-products. Our analysis relies on the recently developed matrix Bernstein inequality (Mackey et al., 2014). With little additional effort, our analysis extends to other popular multivariate analysis tools such as linear discriminant analysis, spectral clustering and the randomized dependence coefficient. We demonstrate the effectiveness of the proposed randomized methods by experimenting with several real-world data and comparing against the state-of-the-art Deep Canonical Correlation Analysis (Andrew et al., 2013). As a novel application of the presented methods, we derive an algorithm to learn using privileged information (Vapnik & Vashist, 2009) and a scalable strategy to train nonlinear autoencoder neural networks. Additional numerical simulations are provided to validate the tightness of the concentration bounds derived in our theoretical analysis. Lastly, the presented methods are very simple to implement; we provide R source code at: http://lopezpaz.org/code/rca.r 1.1. Related Work There has been a recent stream of research in kernel approximations via randomized feature maps since the seminal work of Rahimi & Recht (2008). For instance, their extensions to dot-product kernels (Kar & Karnick, 2012) and polynomial kernels (Hamid et al., 2014); the development of advanced sampling techniques using Quasi-Monte-Carlo methods (Yang et al., 2014) or their accelerated computation via fast Walsh-Hadamard transforms (Le et al., 2013). The use of randomized techniques for kernelized component analysis methods dates back to (Achlioptas et al., 2002), where three kernel sub-sampling strategies were suggested to speed up KPCA. On the other hand, (Avron et al., 2013) made use of randomized Walsh-Hadamard transforms to adapt linear CCA to large-scale datasets. The use of nonlinear random features is more scarce and has only appeared twice in previous literature. First, Lopez-Paz et al. (2013) defined the dependence statistic RDC as the largest canonical correlation between two sets of copula random projections. Second, Mc Williams et al. (2013) used the Nystr om method to define a randomized feature map and performed CCA to achieve state-of-the-art semi-supervised learning. 2. Random Nonlinear Features We start our presentation by recalling a few key aspects of nonlinear random features. Consider the class Fp of functions whose weights decay faster than some sampling distribution p; formally: Fp := f(x) = Z Rd α(w)φ(x T w)dw : |α(w)| Cp(w) , (1) Here, α : Rd R is a nonlinear map of weights , while φ : R R is a nonlinear map that satisfies |φ(z)| 1; x, w are vectors in Rd, p(w) is a probability density of the parameter vectors w and C is a regularizing constant. More simply, we may consider the finite version of f: fm(x) := Xm i=1 αiφ(w T i x). (2) Kernel machines, Gaussian processes, Ada Boost, and neural networks are models that fit within this function class. Let D = {(xi, yi)}n i=1 Rd R be a finite sample of input-output pairs drawn from a distribution Q(X, Y ). We seek to approximate a function f in class Fp by minimizing the empirical risk (over dataset D) Remp(f) := 1 i=1 c(fm(xi), yi), (3) for a suitable loss function c(ˆy, y) that penalizes departure of fm(x) from the true label y; for us, the least-squares loss c(ˆy, y) = (ˆy y)2 will be most convenient. Jointly optimizing (3) over (α, w1, . . . , wm) used in defining fm, is a daunting task given the nonlinear nature of φ. The key insight of Rahimi & Recht (2008) is that we can instead randomly sample the parameters wi Rd from a data-independent distribution p(w) and construct an mdimensional randomized feature map z(X) for the input Randomized Nonlinear Component Analysis data X Rn d that obeys the following structure: w1, . . . , wm p(w), zi := [cos(w T i x1 + bi), . . . , cos(w T i xn + bi)] Rn, z(X) := [z1 zm] Rn m. (4) Using the (precomputed) nonlinear random features z(X) ultimately transforms the nonconvex optimization of (3) into a least-squares problem of the form minα Rd y z(X)α 2 2 , s.t. α C. (5) This form remarkably simplifies computation (in practice, we solve a regularized version of it), while incurring only a bounded error. Theorem 1 formalizes this claim. Theorem 1. (Rahimi & Recht, 2008) Let Fp be defined as in (1). Draw D Q(X, Y ). Construct z( ) as in (4). Let c : R2 R+ be a loss-function L-Lipschitz in its first argument. Then, for any δ > 0, with probability at least 1 2δ there exist some α = (α1, . . . , αm) such that EQ [c (z(x)α, y)] min f Fp EQ[c(f(x), y)] O LC n + LC m q Solving (5) takes O(ndm + m2n) operations, while testing t points on the fitted model takes O(tdm) operations. Recent techniques that use subsampled Hadamard randomized transforms (Le et al., 2013) allow faster computation of the random features, yielding O(n log(d)m + m2n) operations to solve (5) and O(t log(d)m) to test t new points. It is of special interest that randomized algorithms are in many cases more robust than their deterministic analogues (Mahoney, 2011) because of the implicit regularization induced by randomness. 2.1. Random Features, Nystr om and Kernel Matrices Bochner s theorem helps connect shift-invariant kernels (Sch olkopf & Smola, 2002) and random nonlinear features. Let k(x, y) be a real valued, normalized (k(x, y) 1), shift-invariant kernel on Rd Rd. Then, k(x, y) = Z Rd p(w)e jw T (x y)dw i=1 1 me jw T i xejw T i y i=1 1 m cos(w T i x + bi) cos(w T i y + bi) = 1 mz(x), 1 mz(y) , where p(w) is set to be the inverse Fourier transform of k and bi U(0, 2π) (Rahimi & Recht, 2008) e.g., the Gaussian kernel k(x, y) = exp( s x y 2 2) can be approximated using wi N(0, 2s I). Let K Rn n be the kernel matrix of some data X Rn d, i.e., Kij = k(xi, xj) . When approximating the kernel k using m random Fourier features, we may as well approximate the kernel matrix K ˆ K, where mz(X)z(X)T = 1 i=1 ziz T i = i=1 ˆ K(i). (6) The focus of this paper is on building scalable kernel component analysis methods which not only exploit these approximations but are also accompanied by theoretical guarantees. Importantly, our analysis extends straight-forwardly to features constructed using the Nystr om method (Williams & Seeger, 2001) when its basis are bounded and sampled at random. Recent theoretical and empirical evidence suggest the superiority of the Nystr om method when compared to the aforementioned Fourier features (Yang et al., 2012). However, we did not experience large differences (Section 5), while random Fourier features are faster to compute and do not need to be stored at test time (Le et al., 2013). 3. Principal Component Analysis (PCA) Principal Component Analysis or PCA (Pearson, 1901; Jolliffe, 2002) is the orthogonal transformation of a set of n observations of d correlated variables X Rn d into a set of n observations of d uncorrelated principal components. For a centered data matrix (zero mean columns) X, PCA requires computing the (full) singular value decomposition X = UΣF T , where Σ is a diagonal matrix containing the singular values of X in decreasing order. The principal components are computed via the linear transformation XF . Nonlinear variants of PCA are also known; notably, Kernel PCA (Sch olkopf et al., 1999) uses the kernel trick to embed data into a high-dimension Reproducing Kernel Hilbert Space, where regular PCA is performed. Computation of the principal components reduces to an eigenvalue problem, which takes O(n3) operations. Autoencoders (Hinton & Salakhutdinov, 2006) are artificial neural networks configured to learn their own input. They are trained to learn compressed representations of data. The transformation computed by a linear autoencoder with a bottleneck of size r < d is the projection into the subspace spanned by the first r principal components of the training data (Baldi & Hornik, 1989). 3.1. Randomized Nonlinear PCA (RPCA) We propose RPCA, a nonlinear randomized variant of PCA. We may view RPCA as a low-rank approximation of KPCA Randomized Nonlinear Component Analysis when the latter is equipped with a shift-invariant kernel. RPCA may be thus understood as linear PCA on a randomized nonlinear mapping of the data. Schematically, {F , z( )} =: RPCA(X) PCA(z(X)) KPCA(X), where F Rm m are the principal component loading vectors and z : Rn d Rn m is a random feature map generated as in (4) (typically, m n). The computational complexity is O(n3) for KPCA, O(d2n) for PCA and O(m2n) for RPCA. PCA and RPCA are both linear in the sample size n. When using nonlinear features as in (4), PCA loadings are no longer linear transformations but approximations of nonlinear functions belonging to the function class Fp described in Section 2. As a consequence of Bochner s theorem (Section 2.1), the RPCA kernel matrix will approximate the one of KPCA as the number of random features m tends to infinity. This is because random feature dot-products converge uniformly to the exact kernel evaluations in expectation (Rahimi & Recht, 2008). Since the solution of KPCA is the eigensystem of the kernel matrix K for the data matrix X Rn d, one may study how fast the approximation ˆ K made in (6) converges in operator (or spectral) norm to K as m grows. To analyze this convergence we appeal to the recently proposed Matrix Bernstein Inequality. In the theorem below and henceforth X denotes the operator norm. Theorem 2 (Matrix Bernstein, (Mackey et al., 2014)). Let Z1, . . . Zm be independent d d random matrices. Assume that E[Zi] = 0 and that Zi R. Define the variance parameter σ2 := max P i E[ZT i Zi] , P i E[Zi ZT i ] . Then, for all t 0, i Zi t 2d exp n t2 Furthermore, 3σ2 log(2d) + R log(2d). The convergence rate of RPCA to its exact kernel counterpart KPCA is expressed by the following theorem, which actually invokes the Hermitian matrix version of Theorem 3.1, and hence depends on d instead of 2d, and uses matrix squares when defining the variance σ2. Theorem 3. Assume access to the data X Rn d and a shift-invariant, even kernel k. Construct the kernel matrix Kij := k(xi, xj) and its approximation ˆ K using m random features as per (6). Then, m + 2n log n Proof. We follow a derivation similar to Tropp (2012). i=1 ziz T i = Xm the n n sample kernel matrix, and by K its population counterpart such that E[ ˆ K] = K. Note that ˆ K is the sum of the m independent matrices ˆ K(i), since our random features are sampled i.i.d. and the matrix X is defined to be constant. Consider the individual error matrices E = ˆ K K = Xm i=1 Ei, Ei = 1 m( ˆ K(i) K), each of which satisfies E[Ei] = 0. Since we are using bounded features see z(x) in (4) it follows that there exists a constant B such that z 2 B. Thus, we see that m ziz T i E[zz T ] 1 m( zi 2+E[ z 2]) 2B because of the triangle inequality on the norm and Jensen s inequality on the expected value. To bound the variance of E, bound first the variance of each of its sumands Ei (noting that E[ziz T i ] = K): E[E2 i ] = 1 m2 E (ziz T i K)2 = 1 m2 E zi 2ziz T i ziz T i K Kziz T i + K2 1 m2 BK 2K2 + K2 BK Next, taking all summands Ei together we obtain i=1 EE2 i 1 Where the first inequality follows by Jensen. We can now invoke the matrix Bernstein inequality (Theorem 3.1) on E E[E] to obtain the bound m + 2B log n Observe that random features and kernel evaluations are upper-bounded by 1; thus both B and K are upperbounded by n, yielding the bound (7). To obtain a characterization in relative-error, we can divide both sides of (7) by K . This results in a bound that depends on n logarithmically (since K = O(n)). Moreover, additional information may be extracted from the tail-probability version of Theorem . Please refer to Section 5.1 for additional discussion on this aspect. Before closing this section, we mention a byproduct of our above analysis. Randomized Nonlinear Component Analysis Extension to Spectral Clustering. Spectral clustering (Luxburg, 2007) uses the spectrum of K to perform dimensionality reduction before applying k-means. Therefore, the analysis of RPCA may be easily extended to obtain a randomized and nonlinear variant of spectral clustering. 4. Canonical Correlation Analysis (CCA) Canonical Correlation Analysis or CCA (Hotelling, 1936) measures the correlation between two multidimensional random variables. Specifically, given two samples X Rn p and Y Rn q, CCA computes a pair of canonical bases F Rp r and G Rq r such that corr(XF , Y G) I F is minimized, corr(XF , XF ) = I, corr(Y G, Y G) = I, where I stands for the identity matrix. The correlations between the canonical variables XF and Y G are referred to as canonical correlations, and up to r = max(rank(X), rank(Y )) of them can be computed. The canonical correlations ρ2 1, . . . , ρ2 r and basis vectors f1, . . . , fr Rp and g1, . . . , gr Rq form the eigensystem of the generalized eigenvalue problem (Bie et al., 2005): 0 CXY CY X 0 ρ2 CXX + γx I 0 0 CY Y + γy I where CXY is the covariance cov(X, Y ), while the diagonal terms γI act as regularization. In another words, CCA processes two different views of the same data (i.e., speech audio signals and paired speaker video frames) and returns their maximally correlated linear transformations. This is particularly useful when the two views are available at training time, but only one of them is available at test time (Kakade & Foster, 2007; Chaudhuri et al., 2009; Vapnik & Vashist, 2009). Several nonlinear extensions of CCA have been proposed: Kernel Canonical Correlation Analysis or KCCA (Lai & Fyfe, 2000; Bach & Jordan, 2002) uses the kernel trick to derive a nonparametric, nonlinear regularized CCA algorithm. Its exact computation takes time O(n3). Deep Canonical Correlation Analysis or DCCA (Andrew et al., 2013) feeds the pair of input variables through a deep neural network. Transformation weights are learnt using gradient descent to maximize the correlation of the output mappings. 4.1. Randomized Nonlinear CCA (RCCA) We now propose RCCA, a nonlinear and randomized variant of CCA. As will be shown, RCCA is a low-rank approxima- tion of KCCA when the latter is equipped with a pair of shiftinvariant kernels. RCCA can be understood as linear CCA performed on a pair of randomized nonlinear mappings (see 4): zx : Rn p Rn mx, zy : Rn q Rn my of the data X Rn p and Y Rn q. Schematically, RCCA(X, Y ) := CCA(zx(X), zy(Y )) KCCA(X, Y ). The computational complexity is O(n3) for KCCA, O((p2+ q2)n) for CCA and O((m2 x + m2 y)n) for RCCA. CCA and RCCA are both linear in the sample size n. When performing RCCA, the basis vectors f1, . . . , fr Rp and g1, . . . , gr Rq become the basis functions f1, . . . , fr : Rp R and g1, . . . , gr : Rq R, which approximate functions in the class Fp defined in Section 2. As with PCA, we are interested in characterizing the convergence rate of RCCA to its exact kernel counterpart KCCA as mx and my grow. The solution of KCCA is the eigensystem of the matrix R 1L, where, R 1 := (Kx + γx I) 1 0 0 (Ky + γy I) 1 L := 0 Ky Kx 0 and (γx, γy) are positive regularizers mandatory to avoid spurious 1 correlations (Bach & Jordan, 2002). Theorem 4 characterizes the convergence rate of RCCA to KCCA. Let ˆR 1 and ˆL be the approximations to (8) and (9) obtained by using m random features; that is ˆR 1 := ( ˆ Kx + γx I) 1 0 0 ( ˆ Ky + γy I) 1 ˆL := 0 ˆ Ky ˆ Kx 0 Theorem 4. Assume access to the datasets X Rn p, Y Rn q and shift-invariant kernels kx, ky. Define the kernel matrices (Kx)ij := kx(xi, xj), (Ky)ij := ky(yi, yj) and their approximations ˆ Kx, ˆ Ky using mx, my random features as in (4), respectively. Let L, R, ˆL, ˆR be as defined in (8 11), where γx, γy > 0 are regularization parameters. Furthermore, define γ := min(γx, γy), m := min(mx, my). Then, E ˆR 1 ˆL R 1L 1 m + 2n log 2n Proof. As the matrices are block-diagonal, we have E ˆR 1 ˆL R 1L max(E ( ˆ Kx + γx I) 1 ˆ Ky (Kx + γx I) 1Ky , E ( ˆ Ky + γy I) 1 ˆ Kx (Ky + γy I) 1Kx ). Randomized Nonlinear Component Analysis We analyze the first term of the maximum; the latter can be analyzed analogously. Let ˆ A := ( ˆ Kx + γx I) 1 and A := (Kx + γx I) 1. Define the individual error terms Ei = 1 my ˆ A ˆ K(i) y AKy , E = Xmy Recall that the mx + my random features are sampled i.i.d. and that the data matrices X, Y are constant. Therefore, the random matrices ˆ K(1) x , . . . , ˆ K(mx) x , ˆ K(1) y , . . . , ˆ K(my) y are i.i.d. random variables. Hence, their expectations factorize: E [Ei] = 1 my E[ ˆ A]Ky AKy , where we used E[ ˆ K(i) y ] = Ky. The deviation of the individual error matrices from their expectations is Zi := Ei E [Ei] = 1 my ˆ A ˆ K(i) y E[ ˆ A]Ky , and the norm of this deviation is bounded as Zi = 1 my ˆ A ˆ K(i) y E[ ˆ A]Ky 2B myγx =: R. The inequality follows by applying H older twice after using the triangle inequality. We now turn to the issue of computing the variance, which is defined as σ2 := max n Xmy i=1 E[Zi ZT i ] , Xmy i=1 E[ZT i Zi] o . Consider first second argument of the maximum above, for which we expand an individual term in the summand: ZT i Zi = 1 m2y ˆ K(i) y ˆ A2 ˆ K(i) y + Ky E[ ˆ A]2Ky ˆ K(i) y ˆ AE[ ˆ A]Ky E[ ˆ A]Ky ˆ A ˆ K(i) y . Taking expectations we see that E[ZT i Zi] = 1 m2y E[ ˆ K(i) y ˆ A2 ˆ K(i) y ] Ky E[ ˆ A]2Ky 1 m2y E[ ˆ K(i) y ˆ A2 ˆ K(i) y ], where the inequality follows as Ky E[ ˆ A]2Ky 0. Taking norms and invoking Jensen s inequality we then obtain E[ZT i Zi] B Ky A similar argument shows that E[Zi ZT i ] 1 m2y E[ ˆ A( ˆ K(i) y )2 ˆ A] E[Zi ZT i ] B Ky An invocation of Jensen on the definition of σ2 along with the two bounds above yields the worst-case estimate We may now appeal to the matrix Bernstein inequality (Theorem 3.1) to obtain the bound E ( ˆ Kx + γx I) 1 ˆ Ky (Kx + γx I) 1Ky my + 2n log 2n The result follows by analogously bounding E ( ˆ Ky + γy I) 1 ˆ Kx (Ky + γy I) 1Kx and taking maxima. Before concluding this section, we briefly comment on two easy extensions of our above result. Extension to Linear Discriminant Analysis. Linear Discriminant Analysis (LDA) seeks a linear combination of the features of the data X Rn d such that the samples become maximally separable with respect to a paired labeling y with yi {1, . . . , c}. LDA can be solved by CCA(X, T ), where Tij = I{yi = j} (Bie et al., 2005). Therefore, a similar analysis to the one of RCCA could be used to obtain a randomized nonlinear variant of LDA. Extension to RDC. The Randomized Dependence Coefficient or RDC (Lopez-Paz et al., 2013) is defined as the largest canonical correlation of RCCA when performed on the copula transformation of the data matrices of X and Y . Our analysis applies to the further understanding of RDC. 5. Experiments We investigate the performance of RCCA in multiple experiments with real-world data against state-of-the-art algorithms. Section 5.3 provides a novel algorithm based on RCCA to perform learning using privileged information (Vapnik & Vashist, 2009). Section 5.4 introduces the use of RPCA as a tool to train autoencoders in a scalable manner. We set our random (Fourier) features to approximate the Gaussian kernel, as described in the second paragraph of Section 2.1. We also compare to the Nystr om method, set to construct an m dimensional feature space formed by the evaluations of the Gaussian kernel on m random points from the training set (Yang et al., 2012). Gaussian kernel widths {sx, sy} are set using the median heuristic. 5.1. Empirical Validation of Bernstein Inequalities We first turn to the issue of empirically validating the bounds obtained in Theorems 3 and 4. To do so, we perform simulations in which we separately vary the values of the sample size n, the number of random projections m, and the regularization parameter γ. We use synthetic data matrices X Rn 10 and Y Rn 10 formed by i.i.d. normal entries. When not varying, the parameters are fixed to n = 1000, m = 1000 and γ = 10 3. Randomized Nonlinear Component Analysis 0 2000 4000 0 20 40 60 80 0 2000 4000 0 10000 25000 ˆR 1 ˆL R 1L 0 2000 4000 log( ˆR 1 ˆL R 1L ) Figure 1. Error-norms as a function of a varying parameter, depicted in the x-axis. Left: RPCA. Right: RCCA. Figure 1 depicts the value of the norms from equations (7, 12) as the parameters {n, m, γ} vary, when averaged over a total of 1000 random samples {Xi, Yi}1000 i=1 . The simulations agree with the presented theoretical analysis: the sample size n and regularization parameter γ exhibit a linear effect, while increasing the number of random features m induces an O(m 1/2) reduction in error (the closest function O(m 1/2) is overlaid in red for comparison). 5.2. Canonical Correlation Analysis We compare three variants of CCA on the task of learning correlated features from two modalities of the same data: linear CCA, state-of-the-art Deep CCA (Andrew et al., 2013) and the proposed (Fourier and Nystr om based) RCCA. We were unable to run exact KCCA on the proposed datasets due to its cubic complexity; other low-rank approximations such as the one of Arora & Livescu (2012) were shown inferior to DCCA, and hence omitted in our analysis. We replicate the two experiments presented in Andrew et al. (2013). The task is to measure performance as the accumulated correlation between the canonical variables associated with the largest training canonical correlations on some unseen test data. The participating datasets are MNIST and XRMB, which are introduced in the following. MNIST Handwritten Digits. Learn correlated representations between the left and right halves of the MNIST images (Le Cun & Cortes, 1998). Each image has a width and height of 28 pixels; therefore, each of the two views of CCA consists on 392 features. 54000 random samples are used for training, 10000 for testing and 6000 to cross-validate the parameters of (D)CCA. X-Ray Microbeam Speech Data. Learn correlated representations of simultaneous acoustic and articulatory speech measurements (Westbury, 1994). The articulatory measurements describe the position of the speaker s lips, tongue and jaws for seven consecutive frames, yielding a 112dimensional vector at each point in time; the acoustic measurements are the MFCCs for the same frames, producing a 273-dimensional vector for each point in time. 30000 random samples are used for training, 10000 for testing and 10000 to cross-validate the parameters of (D)CCA. RCCA on MNIST (50 largest canonical correlations) mx, my Fourier Nystr om corr. minutes corr. minutes 1000 36.31 5.55 41.68 5.29 2000 39.56 19.45 43.15 18.57 3000 40.95 41.98 43.76 41.25 4000 41.65 73.80 44.12 75.00 5000 41.89 112.80 44.36 115.20 6000 42.06 153.48 44.49 156.07 RCCA on XRMB (112 largest canonical correlations) mx, my Fourier Nystr om corr. minutes corr. minutes 1000 68.79 2.95 81.82 3.07 2000 82.62 11.45 93.21 12.05 3000 89.35 26.31 98.04 26.07 4000 93.69 48.89 100.97 50.07 5000 96.49 79.20 103.03 81.6 6000 98.61 120.00 104.47 119.4 linear CCA DCCA corr. minutes corr. minutes MNIST 28.0 0.57 39.7 787.38 XRMB 16.9 0.11 92.9 4338.32 Table 1. Sum of largest test canonical correlations and running times by all CCA variants in the MNIST and XRMB datasets. Summary of Results. Table 1 shows the sum of the largest canonical correlations (corr.) obtained by each CCA variant and their running times (minutes, single 1.8GHz core) on the MNIST and XRMB test sets. Given enough random projections (m = mx = my), RCCA is able to explain the most amount of test correlation while running drastically faster than DCCA1. Moreover, when using ran- 1Running times for DCCA correspond to a single crossvalidation iteration of its ten hyper-parameters. DCCA has 2 layers for MNIST and 8 layers for XRMB. Randomized Nonlinear Component Analysis dom features (i) the number of weights required to be stored at test time for RCCA is up to two orders of magnitude lower than for DCCA and (ii) the use of Fastfood multiplications (Le et al., 2013) allows much faster model evaluation. Parameter Selection. No parameters were tuned for RCCA: the kernel widths were heuristically set and CCA regularization is implicitly provided by the use of randomness (thus set to 10 8). The number of random features m can be set to the maximum value that fits within the available (training or test time) computational budget. On the contrary, previous state-of-the-art DCCA has ten parameters (two autoencoder parameters for pretraining, number of hidden layers, number of hidden units and CCA regularizers for each view), which were cross-validated using the grids described in Andrew et al. (2013). Cross-validating RCCA parameters did not significantly improve performance. If desired, further speed improvements for RCCA could be achieved by distributing the computation of covariance matrices over several CPUs or GPUs, and by making use of truncated SVD routines (Baglama & Reichel, 2006). 5.3. Learning Using Privileged Information In Vapnik s Learning Using Privileged Information (LUPI) paradigm (Vapnik & Vashist, 2009) the learner has access to a set of privileged features or information X , exclusive of training time. These features are understood as helpful high-level teacher explanations about each of the training samples. The challenge is to build algorithms able to extract information from this privileged features at training time in order to build a better classifier at test time. We propose to use RCCA to construct a highly correlated subspace between the regular features X and the privileged features X , accessible at test time through a nonlinear transformation of X. We experiment with the Animals-with-Attributes dataset (Lampert et al., 2009). In this dataset, the regular features X are the SURF descriptors of 30000 pictures of 35 different animals; the privileged features X are 85 high-level binary attributes associated with each picture (such as eatsfish or can-fly). To extract information from X at training time, we build a feature space formed by the concatenation of the 85, five-dimensional top canonical variables zx(X)F (i) 1:5 associated with each RCCA(X, [X(i) , y]), i {1, . . . , 85}. The vector y denotes the training labels. We perform 14 random training/test partitions of 1000 samples each. Each partition groups a random subset of 10 animals as class 0 and a second random subset of 10 animals as class 1 . Hence, each experiment is a different, challenging binary classification problem. Figure 2 shows the test classification accuracy of a linear SVM when using as features the images SURF descriptors or the RCCA 2 4 6 8 10 12 14 50 55 60 65 70 Classification Acc. Figure 2. Classification accuracy on the LUPI Experiments. semi-privileged features. As a side note, directly using the high-level attributes yields 100% accuracy. The cost parameter of the linear SVM is cross-validated on the grid [10 4, . . . , 104]. We observe an average improvement of 14% in classification when using the RCCA basis instead of the image features alone. Results are statistically significant respect to a paired Wilcoxon test on a 95% confidence interval. The SVM+ algorithm (Vapnik & Vashist, 2009) did not improve on the regular SVM using SURF descriptors. 5.4. Randomized Autoencoders RPCA can be used for scalable training of nonlinear autoencoders. The process involves (i) mapping the observed data Y RD n into the latent factors X Rd n using the top d nonlinear principal components from RPCA and (ii) reconstructing Y from X using D nonlinear regressors. Figure 3. Autoencoder reconstructions of unseen test images for the MNIST (top) and CIFAR-10 (bottom) datasets. Figure 3 shows the reconstruction of unseen MNIST and CIFAR-10 images after being compressed with RPCA. The number random projections was set to m = 2000. The number of latent dimensions was set to d = 20 for MNIST, and d = 40 (first row) or d = 100 (second row) for CIFAR10. Training took under 200 seconds for each full dataset. Acknowledgements We thank the anonymous reviewers for their numerous comments, and the fruitful discussions had with Yarin Gal, Mark van der Wilk and Maxim Rabinovich. Lopez-Paz is supported by Obra Social la Caixa . Randomized Nonlinear Component Analysis Achlioptas, D., Mc Sherry, F., and Sch olkopf, B. Sampling techniques for kernel methods. NIPS, 2002. Andrew, G., Arora, R., Livescu, K., and Bilmes, J. Deep canonical correlation analysis. ICML, 2013. Arora, R. and Livescu, K. Kernel CCA for multi-view learning of acoustic features using articulatory measurements. MLSLP, 2012. Avron, H., Boutsidis, C., Toledo, S., and Zouzias, A. Efficient dimensionality reduction for canonical correlation analysis. ICML, 2013. Bach, F. R. and Jordan, M. I. Kernel independent component analysis. JMLR, 2002. Baglama, J. and Reichel, L. Restarted block lanczos bidiagonalization methods. Numerical Algorithms, 2006. Baldi, P. and Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 1989. Bie, T. De, Cristianini, N., and Rosipal, R. Eigenproblems in pattern recognition. Handbook of Geometric Computing, 2005. Chaudhuri, K., Kakade, S. M., Livescu, K., and Sridharan, K. Multi-view clustering via canonical correlation analysis. ICML, 2009. Hamid, R., Xiao, Y., Gittens, A., and De Coste, D. Compact random feature maps. ICML, 2014. Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science, 2006. Hotelling, H. Relations Between Two Sets of Variates. Biometrika, 1936. Jolliffe, I. T. Principal Component Analysis. Springer, 2002. Kakade, S. M. and Foster, D. P. Multi-view regression via canonical correlation analysis. COLT, 2007. Kar, P. and Karnick, H. Random feature maps for dot product kernels. ar Xiv:1201.6530, 2012. Lai, P. and Fyfe, C. Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 2000. Lampert, C. H., Nickisch, H., and Harmeling, S. Learning to detect unseen object classes by betweenclass attribute transfer. CVPR, 2009. Le, Q., Sarlos, T., and Smola, A. Fastfood Approximating kernel expansions in loglinear time. ICML, 2013. Le Cun, Y. and Cortes, C. The MNIST database of handwritten digits. 1998. Lopez-Paz, D., Hennig, P., and Sch olkopf, B. The Randomized Dependence Coefficient. NIPS, 2013. Luxburg, U. A tutorial on spectral clustering. Statistics and Computing, 2007. Mackey, L., Jordan, M. I., Chen, R. Y., Farrell, B., and Tropp, J. A. Matrix Concentration Inequalities via the Method of Exchangeable Pairs. Annals of Probability, 2014. Mahoney, M. W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 2011. Mc Williams, B., Balduzzi, D., and Buhmann, J. Correlated random features for fast semi-supervised learning. NIPS, 2013. Pearson, K. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 1901. Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. NIPS, 2008. Sch olkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002. Sch olkopf, B., Smola, A., and M uller, K. R. Kernel principal component analysis. Advances in kernel methods - Support vector learning, 1999. Tropp, J. A. User-Friendly Tools for Random Matrices: An Introduction. NIPS Tutorials, 2012. Vapnik, V. and Vashist, A. A new learning paradigm: Learning using privileged information. Neural Networks, 2009. Westbury, J. R. X-Ray microbeam speech production database user s handbook version 1.0. 1994. Williams, C. and Seeger, M. Using the Nystr om method to speed up kernel machines. NIPS, 2001. Yang, J., Sindhwani, V., Avron, H., and Mahoney, M. W. Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. ICML, 2014. Yang, T., Li, Y., Mahdavi, M., Jin, R., and Zhou, Z. Nystr om method vs random Fourier features: A theoretical and empirical comparison. NIPS, 2012.