# streaming_kernel_pca_with_tildeosqrtn_random_features__8125ba15.pdf

Streaming Kernel PCA with O( n) Random Features

Enayat Ullah enayat@jhu.edu Poorya Mianjy mianjy@jhu.edu Teodor V. Marinov tmarino2@jhu.edu Raman Arora arora@cs.jhu.edu

We study the statistical and computational aspects of kernel principal component analysis using random Fourier features and show that under mild assumptions, O( n log (n)) features sufﬁce to achieve O(1/ϵ2) sample complexity. Furthermore, we give a memory efﬁcient streaming algorithm based on classical Oja s algorithm that achieves this rate.

1 Introduction

Kernel methods represent an important class of machine learning algorithms that simultaneously enjoy strong theoretical guarantees as well as empirical performance. However, it is notoriously hard to scale them to large datasets due to space and runtime complexity (typically O(n2) and O(n3), respectively, for most problems) [Smola and Schölkopf, 1998]. There have been many efforts to overcome these computational challenges, including Nyström method [Williams and Seeger, 2001], incomplete Cholesky factorization [Fine and Scheinberg, 2001], random Fourier features (RFF) [Rahimi and Recht, 2007] and randomized sketching [Yang et al., 2015]. In this paper, we focus on random Fourier features due to its broad applicability to a large class of kernel problems.

In a seminal paper by Rahimi and Recht [2007], the authors appealed to Bochner s theorem to argue that any shift-invariant kernel can be approximated as k(x, y) z(x), z(y) , where the random Fourier feature mapping z : Rd Rm is obtained by sampling from the inverse Fourier transform of the kernel function. This allows one to invoke fast linear techniques to solve the linear problem in Rm. However, subsequent work analyzing kernel methods based on RFF for learning problems suggests that to achieve the same asymptotic rates (as obtained using the true kernel) on the excess risk, one requires m = Ω(n) random features [Rahimi and Recht, 2009], which defeats the purpose of using random features from a computational perspective and fails to explain its empirical success.

Last year at NIPS, while Rahimi and Recht won the test-of-time award for their work on RFF [Rahimi and Recht, 2007], Rudi and Rosasco [2017] showed for the ﬁrst time that at least for the kernel ridge regression problem, under some mild distributional assumptions and for appropriately chosen regularization parameter, one can achieve minimax optimal statistical rates using only m = O( n log (n)) random features. It is then natural to ask if the same holds for other kernel problems.

In this paper, we focus on Kernel Principal Component Analysis (KPCA) [Schölkopf et al., 1998], which is a popular technique for unsupervised nonlinear representation learning. We argue that scalability is an even bigger issue in the unsupervised setting since big data is largely unlabeled. Furthermore, when extending the results from the supervised learning to unsupervised learning we have to deal with additional challenges stemming from the non-convexity of the KPCA problem. We pose KPCA as a stochastic optimization problem and investigate the tradeoff between statistical samples and random features needed to guarantee ϵ-suboptimality on the population objective (aka a small generalization error).

KPCA entails computing the top-k principal components of the data mapped into a Reproducing Kernel Hilbert Space (RKHS) induced by a positive deﬁnite kernel [Aronszajn, 1950]. In Schölkopf et al. [1998], authors showed that given a sample of n i.i.d. draws from the underlying distribution, the inﬁnite dimensional problem (over RKHS) can be reduced to a ﬁnite dimensional problem (in Rn)

Department of Computer Science, Johns Hopkins University, Baltimore, MD 21204

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Algorithm Reference Sample complexity Per-iteration cost Memory

ERM Shawe-Taylor et al. [2005] O(1/ϵ2) O(k/ϵ4) O(1/ϵ4)

Blanchard et al. [2007] O(1/ϵ) O(k/ϵ2) O(1/ϵ2) RF-DSG Xie et al. [2015] O(1/ϵ2) O(k/ϵ2) O(k/ϵ2)

RF-ERM Lopez-Paz et al. [2014] O(1/ϵ2) O(k/ϵ4) O(1/ϵ4) Corollary 4.4 O(1/ϵ2) O(k/ϵ3) O(1/ϵ3) RF-Oja Corollary 4.4 O(1/ϵ2) O(k/ϵ) O(k/ϵ)

Table 1: Comparing different approaches to KPCA in terms of sample complexity, per-iteration computational cost and space complexity. : Optimistic rates realized under (potentially different) higher-order distributional assumptions (See Corollary 4.3, and Blanchard et al. [2007]).

using the kernel trick. In particular, the solution entails computing the top-k eigenvectors of the kernel matrix computed on the given sample. Statistical consistency of this approach was established in Shawe-Taylor et al. [2005] and further improved in Blanchard et al. [2007]. However, computational aspects of KPCA are less well understood. Note that the eigendecomposition of the kernel matrix alone requires O(kn2) computation, which can be prohibitive for large datasets. Several recent works have attempted to accelerate KPCA using random features. In Lopez-Paz et al. [2014], authors show that the kernel matrix computed using random features converges to the true kernel matrix in operator norm at a rate of O(n p

(log n)/m). In Ghashami et al. [2016], authors extended this guarantee to a streaming setting using the Frequent Direction algorithm [Liberty, 2013] on random features. In a related line of work, Xie et al. [2015] propose a stochastic optimization algorithm based on doubly stochastic gradients with a 1/n convergence in the sense of angle between subspaces. However, all these results require m = Ω(n) random features to guarantee a O(1/ n) generalization bound.

More recently, Sriperumbudur and Sterge [2017] studied statistical consistency of ERM with randomized Fourier features. They showed that the top-k eigenspace of the empirical covariance matrix in the random feature space converges to that of the population covariance operator in the RKHS when lifted to the space of square integrable functions, at a rate of O(1/ m + 1/ n) 1. This result suggests that statistical and computational efﬁciency cannot be achieved at the same time without making further assumptions. In this paper, we assume a spectral decay on the distribution of the data in the feature space to show that we can simultaneously guarantee spectral and computational efﬁciency for KPCA using random features. Our main contributions are as follows.

1. We study kernel PCA as stochastic optimization problem and show that under mild distributional assumptions, for a wide range of kernels, the empirical risk minimizer (ERM) in the random feature space converges in objective as O(1/ n) whenever m = Ω(k n log (n)), with overall runtime of O(kn 3 2 log (n)). 2. We propose a stochastic approximation algorithm based on classical Oja s updates on random features which enjoys the same statistical guarantees as the ERM above but with better runtime and space requirements. 3. We overcome a key challenge associated with kernel PCA using random features which is to ensure that the output of the algorithm corresponds to a projection operator in the (potentially inﬁnite dimensional) RKHS. We establish that the output of the proposed algorithms converges to a projection operator. 4. In order to better understand the computational beneﬁts of using random features, we also consider the KPCA problem in a streaming setting, where at each iteration, the algorithm is provided with a fresh sample drawn i.i.d. from the underlying distribution and is required to output a solution based on the samples observed so far. In such a setting, comparison with other algorithmic approaches suggests that Oja s algorithm on random Fourier features (see RF-Oja in Table 1) enjoys the best overall runtime as well as superior space complexity. 5. We contribute novel analytical tools that should be useful broadly when designing algorithms for kernel methods based on random features. We provide crucial and novel insights that exploit connections between covariance operators in RKHS and the space of square integrable functions with respect to data distribution. This connection allows us to look

1While our paper was under review, Sriperumbudur and Sterge [2017], which initially focused on statistical consistency of kernel PCA with random features, was replaced by Sriperumbudur and Sterge [2018], with a new title and focus on computational and statistical tradeoffs of KPCA much like our paper.

at the kernel approximation using random features as an estimation problem in the space of square integrable functions, where we appeal to recent results in local Rademacher complexity [Massart, 2000, Bartlett et al., 2002, Blanchard et al., 2007] to yield faster rates. 6. Finally, we provide empirical results on a real dataset to support our theoretical results.

The rest of the paper is organized as follows. In Section 2, we give the problem setup. In Section 3, we provide mathematical preliminaries and introduce the key notation. The main algorithm and the results are in Section 4 and the empirical results are discussed in Section 5.

2 Problem setup

Given a random vector x Rd with underlying distribution ρ, principal component analysis (PCA) can be formulated as the following stochastic optimization problem [Arora et al., 2012, 2013]:

maximize Ex ρ P, xx s.t. P Pk , (1)

where Pk is the set of d d rank-k orthogonal projection matrices. Essentially, PCA seeks a kdimensional subspace of Rd that captures maximal variation with respect to the underlying distribution. It is well understood that the solution to the problem above is given by the projection matrix corresponding to the subspace spanned by the top-k eigenvectors of the covariance matrix E xx .

In most real world applications, however, the data does not have a linear structure. In other words, the underlying distribution may not be well-represented by any low-rank subspace of the ambient space. In such settings, the representations learned using PCA may not be very informative. This motivates the need for non-linear dimensionality reduction methods. For example, in kernel PCA [Schölkopf et al., 1998], a canonical approach for manifold learning, a nonlinear feature map lifts the data into a higher (potentially inﬁnite) dimensional Reproducing Kernel Hilbert Space (RKHS), where a low-rank subspace corresponds to a (non-linear) low-dimensional manifold in ambient space. Hence, solving the PCA problem in an RKHS can better capture the complicated nonlinear structure in data.

Formally, given a kernel function k( , ) : Rd Rd R, KPCA can be formulated as the following stochastic optimization problem:

maximize Ex ρ P, k(x, ) H k(x, ) s.t. P Pk HS(H) , (2)

where Pk HS(H) is the set of all orthogonal projection operators onto a k-dimensional subspace of the RKHS. The solution to the above problem is given by Pk C, the projection operator corresponding to the top-k eigenfunctions of the covariance operator C := Ex ρ[k(x, ) H k(x, )]. The primary goal of any KPCA algorithm is then to guarantee generalization, i.e. providing a solution b P Pk HS(H) with a small excess risk:

E(b P) := Ex ρ Pk C, k(x, ) H k(x, ) Ex ρ b P, k(x, ) H k(x, ) . (3)

Given access to i.i.d. samples {xi}n i=1 ρ, one approach to solving Problem (2) is Empirical Risk Minimization (ERM), which amounts to ﬁnding the top-k eigenfunctions of the empirical covariance operator b C := 1

n Pn i=1 k(xi, ) k(xi, ). Using kernel trick, Schölkopf et al. [1998] showed that this problem is equivalent of ﬁnding the top-k eigenvectors of the kernel matrix associated with the samples. Alternatively, when approximating the kernel map with random features, Problem (2) reduces to the PCA problem (given in Equation (1)) in the random feature space. Here, we discuss two natural approaches to solve this problem. First, the ERM in the random feature space (called RF-ERM), which is given by the top-k eigenvectors of the empirical covariance matrix of data in the feature space. Second, the classical Oja s algorithm (called RF-Oja) [Oja, 1982].

Note that while the output of ERM is guaranteed to induce a projection operator in the RKHS of k( , ), this may not be the case when using RFF (equivalently, when working in the RKHS associated with the approximate kernel map). Therefore, a key technical challenge when designing KPCA algorithm based on RFF is to ensure that the output is close to the set of projection operators in the true RKHS induced by k( , ), i.e. d(b P, Pk HS(H)) is small.

3 Mathematical Preliminaries and Notation

In this section, we review basic concepts we need from functional analysis [Reed and Simon, 1972]. We begin with a simple observation that given an underlying distribution on data, and a ﬁxed kernel

map, it induces a distribution on the feature map. We work with this distribution implicitly by considering measurable Hilbert spaces. We denote matrices and Hilbert-Schmidt operators with capital roman letters D, vectors with lower-case roman letters v, and scalars with lower-case letters a. We denote operators over the space of Hilbert-Schmidt operators with capital Fraktur letters A.

Hilbert space notation and operator norm. Let H and H be two separable Hilbert spaces over ﬁelds F and F with measures µ and µ, respectively. Let {ei}i 1 and { ei}i 1 denote some ﬁxed orthonormal basis for H and H respectively. The inner product between two elements h1, h2 H is denoted as h1, h2 H, or h1, h2 µ. Similarly, we denote the norm of an element h H as h H, or h µ. For h1, h2 H the outer product denoted as h1 H h2, or h1 µ h2, is a linear operator on H that maps any h3 H to (h1 H h2)h3 = h2, h3 H h1. For a linear operator D : H H, the operator norm of D is deﬁned as D 2 := sup{ Dh H , h H, h H 1}.

Adjoint, Hilbert-Schmidt, and trace-class operators. The adjoint of a linear operator D : H H, is given as the linear operator D : H H such that Dh, h H = h, D h H, for all h H, h H. A linear operator D : H H is self-adjoint if D = D. The linear operator D : H H is compact if the image of any bounded set of H is a relatively compact subset of H. A linear operator D : H H is a Hilbert-Schmidt operator if P

i 1 Dei 2 H = P

i,j 1 Dei, ej 2 H < . The

Hilbert-Schmidt norm of D, denoted as D HS(H) or D HS(µ), is deﬁned as (P

i 1 Dei 2 H) 1 2 . The space of all Hilbert-Schmidt operators on H is denoted as HS(H). A compact operator D : H H is trace-class if D L1(H) := P

i 1 (DD )1/2ei, ei

H < , where D L1(H) denotes the nuclear norm of D. For a vector space X, L2(X, ρ) denotes the space of square integrable functions with respect to measure ρ, i.e L2(X, ρ) = {f : X R, R

X (f(x))2dρ(x) < }. L2(X, ρ) is a Hilbert space with the inner product denoted as f, g ρ := R

X f(x)g(x)dρ(x), where f, g L2(X, ρ). The

norm induced on L2(X, ρ) is denoted as f ρ := f, f 1/2 ρ for f L2(X, ρ).

Projection operators, spectral decomposition. Given a vector space X, let Pk X denote the set of rank-k projection operators on X. For a Hilbert-Schmidt operator D over a separable Hilbert space H, let λi(D) denote its ith largest eigenvalue. The projection operator associated with the ﬁrst k eigenfunctions of D is denoted as Pk D; given the spectral decomposition D = P i=1 µiψi ψi, we have that Pk D = Pk i=1 ψi ψi. For a ﬁnite dimensional vector v, v p denotes the ℓp-norm of v. For operators D over ﬁnite dimensional spaces, D 2 and D F denote the spectral and Frobenius norm of D, respectively. For a metric space (Y, d) and a closed subset S Y, we denote the distance from q Y to S by d(q, S) = mins S d(q, s). In a Hilbert space, d is the underlying metric induced by the respective norm. [n] denotes the set of natural numbers from 1 to n.

Mercer kernels, and random feature maps. Let X Rd be a compact (data) domain and ρ be a distribution on X. We are given n independent and identically distributed samples from ρ, {xi}n i=1 ρn. Let k : X X R be a Mercer kernel with the following integral representation, k(x, y) = R

Ωz(x, ω)z(y, ω)dπ(ω). Here, (Ω, π) is the probability space induced by the Mercer kernel. Let zω( ) := z( , ω). We know that zω( ) L2(X, ρ) almost surely with respect to π. We draw i.i.d. samples, ωi π, for i = 1, . . . , m, to approximate the kernel function. Let z( ) denote the random feature map, i.e. z : Rd Rm, z(x) = 1 m (zω1(x), zω2(x), . . . , zωm(x)). Let F Rm

be the linear subspace spanned by the range of z, with the inner product inherited from Rm. The approximate kernel map is denoted as km( , ), where km(x, y) = z(x), z(y) F. Let H denote the separable RKHS associated with the kernel function k( , ). Assumption 3.1. The kernel function k is a Mercer kernel (see Theorem A.5) and has the following integral representation, k(x, y) = R

Ωz(x, ω)z(y, ω)dπ(ω) x, y X where H is a separable RKHS of real-valued functions on X with a bounded positive deﬁnite kernel k. We also assume that there exists τ > 1 such that |z(x, ω)| τ for all x X, ω Ω.

Note that z(x, ) are continuous functions because k( , ) is continuous. Note that when X is separable and k( , ) is continuous, H is separable. Deﬁnition 3.2. C : H H is the covariance operator of the random variables k(x, ) with measure ρ, deﬁned as Cf := R

X k(x, )f(x)dρ(x). C is compact and self-adjoint, which implies C has a spectral

decomposition C = P i=1 λi φi H φi, where λi s and φi s are the eigenvalues and eigenfunctions of C, respectively. The set of eigenfunctions, { φi} i=1, forms a unitary basis for H.

Since we are approximating the kernel k by sampling m i.i.d. copies of zω, this implies an approximation to the covariance operator C (in the space HS(ρ)) by a sample average of the random linear operators zω ρ zω. The tools we use to establish concentration require a sufﬁcient spectral decay of the variance of this random operator, which we deﬁne next.

Deﬁnition 3.3. Let C1 denote the random linear operator on L2(X, ρ) given by C1 = zω ρ zω. Let C2 = C1 HS(ρ) C1 and deﬁne the covariance operator of C1 to be C = Eπ [C2] Eπ [C1] HS(ρ) Eπ [C1].

We note that C can also be interpreted as the fourth moment of the random variable zω in L2(X, ρ). The spectrum of C plays a crucial role in our results through the following key-quantity:

κ(Bk, k, m) = inf h 0

, where Bk :=

Eπ zω, zω 4ρ

λk λk+1 (4)

Essentially, we will see that the constant κ(Bk, k, m) is the dominating factor when bounding the excess risk, and, therefore, will determine the rate of convergence of our algorithms.

From a practical perspective, working in HS(ρ) is not computationally feasible. However, our approximation to C has a representation in the ﬁnite dimensional space F, as deﬁned here.

Deﬁnition 3.4. Cm : F F is the covariance operator in HS(F), deﬁned as Cm := Eρ [z(x) F z(x)]. Equivalently, for any v F, Cmv = R

X z(x), v z(x)dρ(x). Cm is compact and self-adjoint which implies that Cm has a spectral decomposition Cm = Pm i=1 λiφi F φi.

As mentioned at the beginning of the section, our convergence tools work most conveniently when we can incorporate the randomness with respect to ρ in the geometry of the space we study, hence, the need to study L2(X, ρ). Since we are essentially dealing with random operators on F, H and L2(X, ρ), it is most appropriate to also work in the respective spaces of Hilbert-Schmidt operators. Thus, we introduce the inclusion and approximation operators, which allow us to transition with ease between the aforementioned spaces.

Deﬁnition 3.5. [Inclusion Operators I and I] The inclusion operator is deﬁned as I : H L2(X, ρ), (If) = f, where f H.

Also, for an operator D HS(H) with spectral decomposition D = P i=1 µiψi ψi,

I : HS(H) HS(ρ), ID :=

i=1 µi Iψi p

Cψi, ψi H Iψi p

Cψi, ψi H .

In Lemma A.8 and Lemma A.9 in the appendix, we show that the adjoint of the Inclusion operator I is I : L2(X, ρ) H given by (I g)( ) = R k(x, )g(x)dρ(x), and that C = I I, L = II .

Deﬁnition 3.6. [Approximation Operators A and A] The Approximation operator A is deﬁned as

A : F L2(X, ρ), (Av)( ) = z( ), v , where v F.

For an operator D HS(F) with rank k with spectral decomposition D = P i=1 µiψi ψi, let Ψ be the matrix with eigenvectors ψi as columns and let Φ be the matrix with eigenvectors of Cm as columns (see Deﬁnition 3.4). Deﬁne

R = arg min R R=RR =I ΨR Φ 2 F , Ψ := ΨR .

Let ψi be the ith column of Ψ, deﬁne

A : HS(F) HS(ρ), AD :=

i=1 µi A ψi q

Cm ψi, ψi F ρ A ψi q

Cm ψi, ψi F .

In Lemma A.11 and Lemma A.12, we show that the adjoint of the Approximation Operator is A : L2(X, ρ) F, (A f)i = R

X f(x)zωi(x)dρ(x), and Cm = A A, Lm = AA .

We note that the deﬁnition of the approximation operator A requires knowledge of the covariance matrix Cm to ﬁnd the optimal rotation matrix R , but this is solely for the purpose of analysis and is not used in the algorithm in any form.

The following deﬁnition enables us to bound the excess risk in HS(H) (Section B in the appendix).

Deﬁnition 3.7. [Operator L] Let P HS(F). Let A P = Pk i=1 pi ρ pi be P lifted to HS(ρ). Consider the equivalence relation pi pj if L1/2pi = L1/2pj. Let [pi] be the equivalence class such that L1/2pi = pi. The operator L : HS(F) HS(H) is deﬁned as Lb P = Pk i=1 I pi H I pi. Here I is the restriction of the operator I to the quotient space L2(X, ρ)/ .

The quotient space in the deﬁnition above is with respect to the kernel of L, i.e., L2(X, ρ)/ L2(X, ρ)/ker(L). This quotient is benign since the optimal solution to our optimization problem lives in the range of L and intuitively we can disregard any components in the kernel of L.

Figure 1: Maps between the data domain (X), space of square integrable functions on X (L2(X, ρ)), the RKHS of kernel k( , ), and RKHS of the approximate feature map, as well as maps between Hilbert-Schmidt operators on these spaces.

Finally, to conclude the section we give a visual schematic in Figure 1 to help the reader connect different spaces. To summarize, the key spaces of interest are the data domain X, the RKHS H of the kernel map k( , ), and the feature space F obtained via random feature approximation. The space L2(X, ρ) consists of functions over the data domain X that are square integrable with respect to the data distribution ρ. The space L2(X, ρ) allows us to embed objects from different spaces into a common space so as to compare them. Speciﬁcally, we map functions from H to L2(X, ρ) via the inclusion operator I, and vectors from F to L2(X, ρ) via the approximation operator A. I and A denote the adjoints of I and A, respectively. The space of Hilbert-Schmidt operators on H, F and L2(X, ρ), are denoted by HS(H), HS(F) and HS(ρ), respectively. Analogous to I and A, I maps operators from HS(H) to HS(ρ), and A maps operators from HS(F) to HS(ρ), respectively. Speciﬁcally, these are essentially constructed by mapping eigenvectors of operators via I and A respectively. The above mappings thus allow us to embed operators in the common space, i.e., HS(ρ) and to bound estimation and approximation errors. However, the problem of Kernel PCA is formulated in HS(H) and bounds in HS(ρ) are therefore not sufﬁcient. To this end, we establish an equivalence between kernel PCA in HS(H) and HS(ρ). We use the map A and the established equivalence to get L, which maps operators from HS(F) to HS(H). We encourage the reader to go through Sections A and B in the appendix for a gentler and a more rigorous presentation.

4 Main Results

Recall that our primary goal is to study the generalization behaviour of algorithms solving KPCA using random features. Rather than stick to a particular algorithm, we deﬁne a class of algorithms that are suitable to the problem. We characterize this class as follows. Deﬁnition 4.1 (Efﬁcient Subspace Learner (ESL)). Let A be an algorithm which takes as input n points from F and outputs a rank-k projection matrix over F. Let b PA denote the output of the algorithm A and b PA = Φ Φ be an eigendecompostion of b PA. Let Φ k be an orthogonal matrix corresponding to the orthogonal complement of the top k eigenvectors of Cm. We say that algorithm A is an Efﬁcient Subspace Learner if the following holds with probability at least 1 δ, (Φ k ) Φ 2

F qρ,π A (1/δ, log (m) , log (n))

Algorithm 1 KPCA with Random Features (Meta Algorithm) Input: Training data X = {xi}n i=1 Output: b PA

1: Obtain Training data X = {xt}n i=1 in a batch or stream 2: Sample ωi π, i.i.d, i = 1 to m 3: Z Random Features(X, {ωi}m i=1) 4: b PA A(Z) //A is an Efficient Subspace Learner, Definition 4.1

where qρ,π A is a function given the triple (A, ρ, π) which has polynomial dependence on 1/δ, log (m) and log (n). For notational convenience, we drop superscripts from qρ,π A and write it as q A henceforth.

Intuitively, an ESL is an algorithm which returns a projection onto a k-dimensional subspace such that the angle between the subspace and the space spanned by the top k eigenvectors of Cm decays at a sub-linear rate with the number of samples. Our guarantees are in terms of any algorithm which belongs to this class. Algorithm 1 gives a high-level view of the algorithmic routine. To discuss the associated computational aspects, we instantiate this with two speciﬁc algorithms, ERM and Oja s algorithm, and show how the result looks in terms of their algorithmic parameters. Similar results can be obtained for other ESL algorithms such as ℓ2-RMSG [Mianjy and Arora, 2018]. We now give the main theorem of the paper which characterizes the excess risk of an ESL.

Theorem 4.2 (Main Theorem). Let A be an efﬁcient subspace learner. Let b PA be the output of A run with m random features on n 2λ2 1q A(2/δ,log(m),log(n))2

2 1) points, where λi is the ith eigenvalue of Cm. Then, with probability at least 1 δ it holds that

(a). E(Lb PA) 24κ(Bk, k, m) + log(δ/2)+7Bk

q A(2/δ,log(m),log(n))

(b). d Lb PA, Pk HS(H) q

q A(2/δ,log(m),log(n))

A few remarks are in order. First, as we forewarned the reader in Section 3, the error bound is dominated by the additive term κ(Bk, k, m). This, in a sense, determines the hardness of the problem. As we will see, under appropriate assumptions on data distribution in the feature space, this term can be bounded by something that is in O(1/m). Second, the output of our algorithm, Lb P, need not be a projection operator in the RKHS. This is precisely why we need to bound the difference between Lb P and the set of all projection operators in HS(H), which we see is of the order O(1/ n). Third, note that the dependence on the number of random features is at worst poly-logarithmic. From part (b) of Theorem 4.2, it is easy to see that if we project Lb PA to the set of rank k projection operators in HS(H), we get the same rate of convergence. This is presented as Corollary C.12 in the appendix.

Next, we characterize easy instances of KPCA problems under which we are guaranteed a fast rate. Speciﬁcally, we show that if the decay of the spectrum of the fourth order moment, C , of zω, is exponential, then the dominating factor, κ(Bk, k, m) is in O(1/m). Then, optimizing the number of random features w.r.t. the sample complexity term gives us the following result. Corollary 4.3 (Main - Good decay). Along with the assumptions and notation of Theorem 4.2, if the spectrum of the operator C has an exponential decay, i.e., λj(C ) = αj for some α < 1, then with m = O( n log (n)) random features, we have

E(Lb PA) c Bk n + c (k + log (δ/2) + 7Bk) n log (n) +

q A(2/δ, log (m) , log (n))

where c and c are universal constants.

Finally, we instantiate the above corollary with two algorithms, namely ERM and Oja s algorithm. Corollary 4.4 (ERM and Oja). With the same assumptions and notation as in Corollary 4.3,

(a). RF-ERM is an ESL with q ERM(1/δ, log (m) , log (n)) = kλ1τ 2

(b). RF-Oja is an ESL with qoja(1/δ, log (m) , log (n)) = Θ Λ gap2 , where Λ = Pk i=1 λi.

where gap := λk(Cm) λk+1(Cm).

Error Decomposition: There are two sources of error when solving KPCA using random features the estimation error (ϵe) resulting from the fact that we have access to the distribution only through an i.i.d. sample, and approximation error (ϵa) resulting from the approximate feature map. Therefore, to get a better handle on the excess error, we decompose it as follows.

E(Lb PA) = Pk C, C HS(H) LPk Cm, C HS(H) | {z } ϵa: Approximation Error

+ LPk Cm, C HS(H) Lb PA, C HS(H) | {z } ϵe: Estimation Error

The main idea behind controlling the approximation error is to interpret it as the error incurred in eigenspace estimation in L2(X, ρ), and then use local Rademacher complexity to get faster rates. In the context of Kernel PCA, this technique was ﬁrst used by Blanchard et al. [2007] which allowed them to get sharper O(1/n) excess risk. The estimation error is controlled by the deﬁnition of our lifting map A together with the convergence rate implicit in the deﬁnition of an ESL. Below, we guide the reader through the main steps taken to bound each of the error terms.

Bounding the Approximation error: Using simple algebraic manipulations, we can show that the approximation error is exactly the error incurred by the ERM in estimating the top k eigenfunctions of the kernel integral operator L using m samples drawn from π. This problem of eigenspace estimation is well studied in the literature and has optimal statistical rates of O (1/ m) [Zwald and Blanchard, 2006]. This appears to be a key bottleneck and reinforces the view that the use of random features cannot provide computational beneﬁts it suggests m=Ω(n) random features are required to get a O (1/ n) rate. However, these rates are conservative when viewed in the sense of excess risk. This has been extensively studied in empirical process theory and one of the primary techniques to get sharper rates is the use of local Rademacher complexity [Bartlett et al., 2002]. The key idea is to show that around the best hypothesis in the class, variance of the empirical process is bounded by a constant times the mean of the difference from the best hypothesis (see Theorem F.3). This technique was used in the context of Kernel PCA by Blanchard et al. [2007] to get fast O(1/m) rates. We now state Lemma 4.5 which bounds the approximation error, the proof of which is deferred to appendix. Lemma 4.5 (Approximation Error). With probability at least 1 δ, we have

ϵa 24κ(Bk, k, m) + 11τ 2 log (δ) + 7Bk

Bounding the Estimation error: Since the objective with respect to the inner product in HS(ρ) equals the objective with respect to the inner product in HS(H) (See Lemma B.4), we focus on bounding the estimation error in L2(X, ρ). Using a Cauchy-Schwartz type of inequality in HS(ρ), we see that it is enough to bound the difference APk Cm Ab PA HS(ρ). We can do this in two steps bound the error Pk Cm b PA F (we already have this from the ESL guarantee) and construct A : HS(F) HS(ρ). We already have a lifting from F to L2(X, ρ) in the form of A. The natural attempt to lift an operator on F would be by lifting and appropriately rescaling its eigenfunctions. Since the eigendecomposition of b PA is not unique, we need to choose an appropriate one to be lifted. Since the goal of A is to preserve distances between operators, we choose the unique eigendecomposition for which the distance Pk i=1 Ui φi 2 2 is minimized. Notice that the lifting operator A depends on the eigendecomposition of Cm, which can not be obtained in practice. This is not a problem, because A is only used for the purposes of showing the main result and is not part of the proposed algorithms. We now state Lemma 4.6 which bounds the estimation error. Lemma 4.6 (Estimation Error). With the same assumptions as Theorem 4.2, the following holds with probability at least 1 δ,

2 q A(1/δ, log (m) , log (n))2

5 Experiments

The goal of this section is to provide empirical evidence supporting our theoretical ﬁndings in Section 4. As we motivated in Section 2, the success of an algorithm is measured in terms of it s generalization ability, i.e. the variance captured by the output of the algorithm on the unseen data2.

2Details on how we evaluate objective for RF-ERM/Oja are deferred to Section E due to space limitations.

k = 5 k = 10 k = 15

101 102 103 104

ERM Nystrom RF-Oja RF-ERM truth

101 102 103 104

101 102 103 104

100 101 102

100 101 102

100 101 102

Figure 2: Comparisons of ERM, Nyström, Oja+RFF, and Oja+ERM for KPCA on the MNIST dataset, in terms of the objective value as a function of iterations (top) and as a function of CPU runtime (bottom).

We perform experiments on the MNIST dataset that consists of 70K samples, partitioned into a training, tuning, and a test set of sizes 20K, 10K, and 40K, respectively. We use a ﬁxed kernel in all our experiments, since we are not concerned about model selection here. In particular, we choose the RBF kernel k(x, x )=exp x x 2/2σ2 with bandwidth parameter σ2 = 50. The bandwidth is chosen such that ERM converges in objective within observing few thousands training samples. The objective of the ERM3 is used as the baseline. Furthermore, to evaluate the computational speedup gained by using random features, we compare against Nyström method [Drineas and Mahoney, 2005] as a secondary baseline. In particular, upon receiving a new sample, we do a full Nyström approximation and ERM on the set of samples observed so far. Finally, empirical risk minimization (RF-ERM) and Oja s algorithm (RF-Oja) are used with random features to verify the theoretical results presented in Corollary 4.4.

Figure 2 shows the population objective as a function of iteration (top row) as well as the total runtime4 (bottom row). Each curve represents an average over 100 runs of the corresponding algorithm on training samples drawn independently and uniformly at random from the whole dataset. Number of random features and the size of Nyström approximation are set to 750 and 100, respectively. We note:

As predicted by Corollary 4.4, for both RF-ERM and RF-Oja, n log (n) 750 random features is sufﬁcient to achieve the same suboptimality as that of ERM. The performance of ERM is similar to that of RF-Oja and RF-ERM in terms to overall runtime. However, due to larger space complexity of O(n2), ERM becomes infeasible for large-scale problems; this makes a case for streaming/stochastic approximation algorithms.

Finally, we note that the iterates of RF-ERM and RF-Oja reduce the objective as they approach from above to the maximizer of the population objective. Although it might seem counter-intuitive, we note that the output of RF-ERM and RF-Oja are not necessarily projection operators. Hence, they can achieve higher objective than the maximum. However, as guaranteed by Corollary 4.4, the output of both algorithms will converge to a projection operator as more training samples are introduced.

3The kernel matrix is computed in an online fashion for computational efﬁciency 4Runtime is recorded in a controlled environment; each run executed on identical unloaded compute node.

Acknowledgements

This research was supported in part by NSF BIGDATA grant IIS-1546482.

Zeyuan Allen-Zhu and Yuanzhi Li. First efﬁcient convergence for streaming k-pca: a global, gap-free, and near-optimal rate. ar Xiv preprint ar Xiv:1607.07837, 2016a.

Zeyuan Allen-Zhu and Yuanzhi Li. Lazysvd: Even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974 982, 2016b.

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337 404, 1950.

Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. Stochastic optimization for PCA and PLS. In Allerton Conference, pages 861 868. Citeseer, 2012.

Raman Arora, Andy Cotter, and Nati Srebro. Stochastic optimization of PCA with capped MSG. In Advances in Neural Information Processing Systems, NIPS, 2013.

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Localized rademacher complexities. In International Conference on Computational Learning Theory, pages 44 58. Springer, 2002.

Gilles Blanchard, Olivier Bousquet, and Laurent Zwald. Statistical properties of kernel principal component analysis. Machine Learning, 66(2-3):259 294, 2007.

Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(Dec):2153 2175, 2005.

Shai Fine and Katya Scheinberg. Efﬁcient svm training using low-rank kernel representations. Journal of Machine Learning Research, 2(Dec):243 264, 2001.

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A uniﬁed geometric analysis. ar Xiv preprint ar Xiv:1704.00708, 2017.

Mina Ghashami, Daniel J Perry, and Jeff Phillips. Streaming kernel principal component analysis. In Artiﬁcial Intelligence and Statistics, pages 1365 1374, 2016.

Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 581 588. ACM, 2013.

David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized nonlinear component analysis. ar Xiv preprint ar Xiv:1402.0119, 2014.

Pascal Massart. Some applications of concentration inequalities to statistics. In Annales-Faculte des Sciences Toulouse Mathematiques, volume 9, pages 245 303. Université Paul Sabatier, 2000.

Poorya Mianjy and Raman Arora. Stochastic PCA with l 2 and l 1 regularization. In International Conference on Machine Learning, pages 3528 3536, 2018.

Erkki Oja. Simpliﬁed neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267 273, 1982.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2007.

Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pages 1313 1320, 2009.

M Reed and B Simon. Methods of modern mathematical physics i: Functional analysis (academic, new york). Google Scholar, page 151, 1972.

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pages 3218 3228, 2017.

Walter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017.

Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299 1319, 1998.

Dino Sejdinovic and Arthur Gretton. What is an rkhs?, 2012.

John Shawe-Taylor, Christopher KI Williams, Nello Cristianini, and Jaz Kandola. On the eigenspectrum of the gram matrix and the generalization error of kernel-pca. IEEE Transactions on Information Theory, 51(7):2510 2522, 2005.

Alex J Smola and Bernhard Schölkopf. Learning with kernels, volume 4. Citeseer, 1998.

Bharath Sriperumbudur and Nicholas Sterge. Statistical consistency of kernel pca with random features. ar Xiv preprint ar Xiv:1706.06296v1, 2017.

Bharath Sriperumbudur and Nicholas Sterge. Approximate kernel pca using random features: Computational vs. statistical trade-off. ar Xiv preprint ar Xiv:1706.06296v2, 2018.

Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R in Machine Learning, 8(1-2):1 230, 2015.

Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682 688, 2001.

Bo Xie, Yingyu Liang, and Le Song. Scale up nonlinear component analysis with doubly stochastic gradients. Co RR, abs/1504.03655, 2015. URL http://arxiv.org/abs/1504.03655.

Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels: Fast and optimal non-parametric regression. ar Xiv preprint ar Xiv:1501.06195, 2015.

Laurent Zwald and Gilles Blanchard. On the convergence of eigenspaces in kernel principal component analysis. In Advances in neural information processing systems, pages 1649 1656, 2006.