# a_perturbationbased_kernel_approximation_framework__611b29d2.pdf

Journal of Machine Learning Research 23 (2022) 1-26 Submitted 9/20; Revised 4/22; Published 4/22

A Perturbation-Based Kernel Approximation Framework

Roy Mitz roymitz@mail.tau.ac.il Yoel Shkolnisky yoelsh@tauex.tau.ac.il School of Mathematical Sciences Tel-Aviv University Tel-Aviv, Israel

Editor: Karsten Borgwardt

Kernel methods are powerful tools in various data analysis tasks. Yet, in many cases, their time and space complexity render them impractical for large datasets. Various kernel approximation methods were proposed to overcome this issue, with the most prominent method being the Nystr om method. In this paper, we derive a perturbation-based kernel approximation framework building upon results from classical perturbation theory. We provide an error analysis for this framework, and prove that in fact, it generalizes the Nystr om method and several of its variants. Furthermore, we show that our framework gives rise to new kernel approximation schemes, that can be tuned to take advantage of the structure of the approximated kernel matrix. We support our theoretical results numerically and demonstrate the advantages of our approximation framework on both synthetic and real-world data. Keywords: perturbation theory, kernel approximation, kernel-based non-linear dimensionality reduction, Nystr om method

1. Introduction

In the last decades, kernel methods became widely used in various data analysis tasks. Two notable examples that are widely used for machine learning are kernel SVM (Cortes and Vapnik, 1995) and kernel ridge regression (Saunders et al., 1998). Another common application of kernel methods is non-linear dimensionality reduction, where the lower-dimensional embedding of the data is derived from the eigenvectors of some data-dependent kernel matrix. Examples of such dimensionality reduction algorithms include Laplacian eigenmaps (Belkin and Niyogi, 2003), LLE (Roweis and Saul, 2000), Isomap (Balasubramanian et al., 2002), MDS (Buja et al., 2008), Spectral clustering (Shi and Malik, 2000; Ng et al., 2002), and more. In all of the aforementioned kernel-based methods, the dimension of the kernel matrix grows linearly with the number of data points n. This makes kernel-related algorithms impractical for large n, both in terms of memory and runtime. To overcome this issue, several methods for kernel approximation were proposed. The most common form of kernel approximation is low-rank kernel approximation, where we seek to ﬁnd a low-rank representation of the kernel matrix. It is known that the best rankm approximation of a matrix in the spectral and Frobenius norms is given by truncating its SVD decomposition. However, since the dimension of the kernel matrix grows with the number of data points, the computation of the full or even the partial eigendecomposition

2022 Roy Mitz and Yoel Shkolnisky.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-984.html.

Mitz and Shkolnisky

of a large kernel matrix is impractical due to its runtime and space requirements. For example, algorithms for partial eigendecomposition such as the Lanczos algorithm and some variants of SVD require O(n2m) ﬂoating point operations, where n is the dimension of the matrix (number of data points) and m is the number of eigenvectors calculated. Randomized algorithms (Halko et al., 2011a,b) use random projections of the data to reduce the time complexity of the decomposition to O(n2 log m), which is still impractical for large n. Moreover, all eigendecomposition algorithms require to store the n n kernel matrix either in RAM or on disk. Since the exact calculation of the low-rank decomposition of a kernel matrix is unfeasible for large datasets, several approximated methods were proposed. The most prominent lowrank kernel approximation method is the Nystr om method (Williams and Seeger, 2001), that will be described in detail in the next section. Some variants of the Nystr om method were proposed in literature: the Randomized SVD Nystr om method (Li et al., 2014), which uses eﬃcient randomized eigensolvers that enable it to use more samples in order to improve performance; the ensemble Nystr om method (Kumar et al., 2009), which averages several Nystr om approximations in order to improve accuracy (we note that this is in fact not a low-rank approximation); the spectral shifted Nystr om method (Wang et al., 2014), which provides superior performance in cases where the spectrum of the matrix decays slowly; and the modiﬁed Nystr om method (Wang and Zhang, 2013). The latter usually outperforms the classical Nystr om method, but is impractical for large datasets due to its time and memory requirements. To alleviate these requirements, Wang and Zhang (2014) suggested a faster version of the modiﬁed Nystr om method. When the approximated kernel is not low-rank, the kernel approximation methods mentioned above may result in a poor approximation. More recently, works that use the structure of the kernel matrix were proposed. For example, the MEKA algorithm (Si et al., 2017) provides superior kernel approximation for kernels that admit a block-diagonal structure, and are not necessarily low-rank. A problem related to low-rank approximation that is relevant to this paper is updating a known eigendecomposition of a matrix following a small perturbation, without calculating the entire decomposition from scratch. Classical perturbation results (Stewart, 1990; Byron and Fuller, 2012) exist for a general symmetric perturbation, and will be described in detail in the next section. Other related works consider perturbations that have some structure; see for example, Bunch et al. (1978); Mitz et al. (2019) for the case where the perturbation is of rank one, and Oh and Hu (2018); Brand (2006) for a general low-rank perturbation. Other approaches of updating a given eigendecomposition include restarting the power method (Langville and Meyer, 2006) or the inverse iteration algorithm (Trefethen and Bau III, 1997), both require applying the updated matrix several times until convergence, which may be expensive if the matrix is large. The contribution of the current paper is threefold. First, we derive eigendecomposition perturbation formulas accompanied by error bounds for matrices that only part of their spectrum is known. Second, we use these perturbation formulas to derive a new framework for kernel approximation. Unlike some of the existing methods, we present explicit error bounds for the individual approximated eigenvectors rather than only to the kernel approximation. Third, we prove that the Nystr om method and its variants are in fact special cases of our framework. This reveals the essence behind existing Nystr om methods, allows

A Perturbation-Based Kernel Approximation Framework

to analyze their accuracy, and provides means to derive new Nystr om-type approximations that take advantage of the structure of the kernel matrix. The rest of this paper is organized as follows. In Section 2, we describe classical perturbation results, along with the Nystr om method and some of its variants. In Section 3, we extend the classical perturbation formulas to the case where only part of the spectrum of the perturbed matrix is known, and derive their error bounds. In Section 4, we use these formulas to develop a perturbation-based kernel approximation framework. In Section 5, we prove that our kernel approximation framework generalizes the Nystr om method. In Section 6, we suggest several types of kernesl approximations based on our framework, and prove that some of them are related to variants of the Nystr om method. In Section 7, we provide numerical results to support our theory and show the advantages of our kernel approximation framework. In Section 8, we provide some concluding remarks and discuss future research.

2. Preliminaries

In this section, we describe two methods for approximating the eigendecomposition of a matrix that are relevant to our work. We ﬁrst describe the Nystr om method in Section 2.1, and then describe the perturbation method in Section 2.2.

2.1 The Nystr om method and its variants

In this section, we describe in detail some of the Nystr om-type methods discussed in the Introduction. We begin with describing the classical Nystr om method, continue with some of its variants that are particularly relevant to our work, and ﬁnish by discussing some results regarding the error bounds of the Nystr om method.

2.1.1 The classical Nystr om method

Let K Rn n be a symmetric positive-deﬁnite matrix. We wish to ﬁnd the k leading eigenpairs {(λi, ui)}k i=1 of K. The Nystr om method (Sun et al., 2015; Williams and Seeger, 2001) ﬁnds an approximation {(eλi, eui)}k i=1 to these eigenpairs as follows. First, we select randomly k columns of K (typically uniformly without replacement). We assume, without loss of generality, that the columns and rows of K are rearranged so that the ﬁrst k columns of K are sampled. Denote by K the k k upper-left submatrix of K and by C the n k matrix consisting of the ﬁrst k columns of K. Then, we calculate the k eigenpairs of K

and denote them by {(λ i, u i)}k i=1. Finally, the Nystr om method approximates the k leading eigenvectors of K by

k n 1 λ i Cu i, i = 1, . . . , k. (1)

Moreover, the k leading eigenvalues of K are approximated by

k λ i, i = 1, . . . , k. (2)

Mitz and Shkolnisky

Finally, the Nystr om approximation of K is

i=1 eλieu T i eui. (3)

The runtime complexity of the Nystr om method (Formula (1)) is O(nk2 + k3). The Nystr om method requires sampling a representative subset of the data and hence many methods for sampling this subset were proposed, see Kumar et al. (2012); Sun et al. (2015) and the references therein. Our results derived below are independent of the methodology used to obtain the subset of samples, and hence we do not discuss this issue in detail.

2.1.2 Variants of the Nystr om method

One generalization of the Nystr om method is the Randomized SVD Nystr om method (Li et al., 2014). It introduces a parameter l k and chooses K to be the l l top-left submatrix of K. Then, the k leading eigenpairs of K are calculated via some eﬃcient randomized SVD method, followed by the use of (1) and (2) for the approximation. This form of approximation is a generalization of the Nystr om method, since choosing l = k is equivalent to the Nystr om method. On the other hand, if we choose l = n, we get the exact eigenvectors of K. Intuitively, the larger l is, the better the approximation is likely to be, at the cost of a greater computational complexity. The runtime complexity of this method is O(nk2 + lk2). Since the Nystr om approximation (3) of the kernel matrix K is a low-rank approximation, it may provide poor results when K is not low-rank. This might occur, for example, when the spectrum of K decays slowly. A possible approach to overcome this problem is the spectrum shifted Nystr om method (Wang et al., 2014). This method essentially applies the classical Nystr om method on a shifted kernel matrix, that is, applies the Nystr om method on Kshift = K µI, (4)

for some µ 0. The updated eigenvalues (2) are then shifted-back by µ. If we denote by e Kshift the kernel approximation obtained by using the shifted Nystr om method, it is shown in Wang et al. (2014) that

K e Kshift F K e Knys F .

Alternatively, the kernel K may admit a block-diagonal structure. As demonstrated in Si et al. (2017), this may happen for some kernel functions when the data consist of several clusters. In this case, the MEKA algorithm (Si et al., 2017) essentially performs a Nystr om approximation on each cluster of the data. Each such approximation corresponds to a block on the diagonal of the kernel matrix, and the resulting approximation is block-diagonal. An approach related to the MEKA algorithm for improving the Nystr om approximation (3) is the ensemble Nystr om method (Kumar et al., 2009). The idea behind this method is to perform q independent Nystr om kernel approximations on random subsets of the data, and then average them. Formally, given q independent Nystr om approximations { e Ki}q i=1,

A Perturbation-Based Kernel Approximation Framework

the ensemble Nystr om approximation is given by

i=1 µi e Ki,

for some weights {µi}q i=1. It is suggested in Kumar et al. (2009) to use µi = 1

q for 1 i q. Better error bounds for this method compared to the classical Nystr om method are proven in Kumar et al. (2009). The diﬀerence between the ensemble Nystr om method and the MEKA algorithm is that in the former, the individual Nystr om approximations are chosen at random rather than by clusters, and the resulting approximation is their average rather than their concatenation in a block-diagonal matrix. We note that in both the ensemble Nystr om method and the MEKA algorithm, the resulting approximation is not low-rank, and is generally of a rank greater than k.

2.1.3 Error bounds for the Nystr om method

The error bounds of the Nystr om method and its variants have been of great interest. For a comprehensive review we refer the readers to Sun et al. (2015), Wang and Zhang (2013) and Gittens and Mahoney (2013). If we denote by Km the best rank-m approximation of a kernel matrix K, then all error bounds obtained in the literature are of the forms K e Knys α K Km or K e Knys K Km + β,

for some α, β 0, where is usually the Frobenius norm or the 2-norm. To the best of our knowledge, no error bounds were obtained for the individual Nystr om-approximated eigenvectors, though such bounds may be of interest when using the Nystr om method as part of a dimensionality reduction algorithm.

2.2 Perturbation of eigenvalues and eigenvectors

Let A Rn n be a real symmetric positive-deﬁnite matrix with distinct eigenvalues {ti}n i=1 and their corresponding orthonormal eigenvectors {vi}n i=1. Assume that t1 > t2 > > tn. Let E Rn n a real symmetric matrix. Consider a perturbation A of A given by A = A + E, with the eigenpairs of A denoted by {(si, wi)}n i=1, so that s1 > s2 > > sm. We wish to ﬁnd an approximation {(esi, ewi)}n i=1 to the eigenpairs of A. The classical perturbation solution to this problem (Stewart, 1990; Byron and Fuller, 2012) is as follows. The approximated eigenvectors of A are given by

ti tk vk + O( E 2 2), 1 i n, (5)

where , is the standard dot product between vectors, and the approximated eigenvalues of A are given by esi = ti + v T i Evi + O( E 2 2), 1 i n. (6)

We note that the eigenvalues update formula (6) depends only on the eigenvalue we wish to update and its corresponding eigenvector, whereas the eigenvectors update formula (5) depends on all eigenvalues and eigenvectors of A .

Mitz and Shkolnisky

There exist perturbation results for matrices with non-simple eigenvalues (Byron and Fuller, 2012). However, since non-simple eigenvalues are highly unlikely in data-dependent matrices, we do not discuss this case and leave it for a future work.

3. Truncating the perturbation formulas

In this section, we consider a variant of the problem presented in Section 2.2, in which only the m leading eigenpairs {(ti, vi)}m i=1 of the unperturbed matrix A are known, and we wish to approximate the m leading eigenpairs of A. To this end, we introduce a parameter µ R whose purpose is to approximate the unknown eigenvalues {ti}n i=m+1 of A . We denote by V (m) the n m matrix consisting of the m leading eigenvectors of A , and deﬁne

ri = I V (m)V (m)T Evi, (7)

for 1 i m. We derive two approximation formulas based on the classical perturbation formula (5). These two approximation formulas diﬀer in their order of approximation and in their computational complexity. The ﬁrst formula, which we refer to as the ﬁrst-order truncated perturbation formula, provides a ﬁrst-order approximation to the eigenvectors of A, whereas the second formula, which we refer to as the second-order truncated perturbation formula, provides a second-order approximation to the eigenvectors of A. We describe these approximations in the following propositions.

Proposition 1 (First-order approximation) Let 1 i m. Let µ R a parameter. Using the notation of Section 2.2, the ﬁrst-order approximation to wi is given by

ew(1) i = vi +

ti tk vk + 1 ti µri, (8)

with an error satisfying

wi ew(1) i 2

Pn k=m+1|tk µ| |ti tm+1||ti µ| E 2 + O E 2 2 . (9)

Proposition 2 (Second-order approximation) Let 1 i m. Let µ R a parameter. Using the notation of Section 2.2, the second-order approximation to wi is given by

ew(2) i = vi +

ti tk vk + 1 ti µri µ (ti µ)2 ri + 1 (ti µ)2 A ri, (10)

with an error satisfying

wi ew(2) i 2

Pn k=m+1|tk µ|2

|ti tm+1||ti µ|2 E 2 + O E 2 2 . (11)

A Perturbation-Based Kernel Approximation Framework

The proofs of Proposition 1 and Proposition 2 are given in Appendix A and Appendix B, respectively. We discuss the runtime and memory requirements of formulas (8) and (10) in Appendix C. The update formulas (8) and (10) depend on a parameter µ, whose choice is discussed in Mitz et al. (2019). To be concrete, if A is known to be low-rank, we use µ = 0. When A is not low-rank, and especially if its spectrum is known to decay slowly, we follow Mitz et al. (2019) and suggest to use

µmean = trace(A ) Pm i=1 ti n m , (12)

which is the mean of the unknown eigenvalues of the perturbed matrix A . µmean can be easily computed since the trace of A and its m leading eigenvalues are known. We conclude this section by proving that under a certain assumption on A , the ﬁrstorder truncated perturbation formula (8) and the second-order truncated perturbation formula (10) coincide. Furthermore, in this case, the O E 2 term in the error bound of both approximations cancels out, as stated by the following proposition.

Proposition 3 Let δ 0 and assume that A can be written in the form of a low-rank matrix plus a spectrum shift, that is A = V (m)TV (m)T + δI for some diagonal matrix T Rm m. Then, for µ = δ, the ﬁrst-order truncated perturbation formula (8) and the second-order truncated perturbation formula (10) coincide, that is

ew(1) i = ew(2) i ,

and the approximation errors satisfy wi ew(1) i 2 = wi ew(2) i 2 = O E 2 2 ,

for all 1 i m.

The proof of Proposition 3 is given in Appendix D.

Corollary 4 If A is of rank m, and µ = 0 in (8) and (10), then the ﬁrst-order and secondorder truncated perturbation formulas give rise to the same approximation. The error of this approximation is O E 2 2 .

4. Perturbation-based kernel approximation framework

In this section, we derive our perturbation-based kernel approximation framework based on Proposition 1. Let K Rn n be a symmetric positive-deﬁnite matrix with m distinct leading eigenpairs that are denoted by {(λi, ui)}m i=1, ordered in descending order. Let Ks Rn n

be a symmetric matrix consisting of any subset of entries of K, with the rest of its entries being 0, as illustrated in Figure 1. Our kernel approximation framework approximates the eigenvectors of K using the eigenvectors of any such Ks, as follows. Let {(λs i, us i)}m i=1 be the leading eigenpairs of Ks ordered in a descending order, and assume that λs i are distinct. Let Us(m) Rn m be the matrix whose columns are the m

Mitz and Shkolnisky

(b) Possible Ks.

(c) Another possible Ks.

Figure 1: Illustration of the submatrix Ks. Blank entries indicate 0.

eigenvectors of Ks corresponding to the m largest eigenvalues of Ks, and let µ 0 be a parameter. Let 1 i m. By the ﬁrst-order approximation in Proposition 1, the eigenvector ui is approximated by

eui = us i +

(K Ks)us i, us k λs i λs k us k + 1 λs i µ Im Us(m)Us(m)T (K Ks)us i, (13)

with an error satisfying

Pn k=m+1 λs k µ λs i λsm λs i µ K Ks 2 + O K Ks 2 2 . (14)

Furthermore, by (6), the eigenvalue λi is approximated by

eλi = λs i + us T i (K Ks)us i, (15)

with an error of magnitude λi eλi = O( K Ks 2 2). We refer to (13) and (15) as our perturbation approximation for the eigenvectors and eigenvalues, respectively. Finally, we deﬁne our perturbation kernel approximation by

i=1 eλieu T i eui. (16)

Note that this framework is quite general, and can be applied to any symmetric submatrix of K. We will propose and discuss several choices for Ks in Section 6. A kernel approximation framework analogous to (13) that is based on the second-order approximation in Proposition 2 can also be obtained in a similar way, but is typically impractical for large n due to its time and space requirements.

5. Equivalence with the Nystr om method

In this section, we prove that our perturbation-based kernel approximation framework (13) is in fact a generalization of the Nystr om method described in Section 2.1, by showing that

A Perturbation-Based Kernel Approximation Framework

the Nystr om method arises from our kernel approximation framework by a speciﬁc choice of Ks. Let K Rn n be a kernel matrix, and let m < n. Assume without loss of generality, that we apply the classical Nystrom method, given in (1) and (2), to the m m upper-left submatrix of K, and denote by {(ˆλi, ˆui)}m i=1 the resulting approximate eigenpairs of K. The following proposition states that for a speciﬁc choice of the matrix Ks, the perturbation approximation e Kpert of (16) is exactly the classical Nystr om approximation e Knys of (3).

Proposition 5 Using the above notation, let Ks be the n n matrix whose top left m m submatrix is the top left m m submatrix of K, and the rest of its entries are 0. Denote by {(λs i, us i)}m i=1 the eigenpairs of Ks. Set µ in (13) to be 0, and denote by {(eλi, eui)}m i=1 the perturbation approximation of the eigenpairs {(λs i, us i)}m i=1 of Ks to the eigenpairs of K. Denote by K the top left m m submatrix of K and by {(λ i, u i)}m i=1 its eigenpairs. Denote by {(ˆλi, ˆui)}m i=1 the Nystr om approximation of {(λ i, u i)}m i=1 (see (1) and (2)). Then,

n eui and ˆλi = n

for all 1 i m. In particular,

e Knys = e Kpert.

The proof of Proposition 5 is given in Appendix E. Formulating the Nystr om method as a perturbation-based approximation using Proposition 5 enables us to provide an error bound for the Nystr om method based on Propositions 1 and 2. Contrary to previous works that only provide error bounds for the approximated kernel resulting from the Nystr om method (3), our error bound is for the individual approximated eigenvectors, as stated in the following proposition.

Proposition 6 (Error bound for the Nystr om method) Using the above notation, the error induced by the Nystr om method satisﬁes

ui ˆui 2 = O K Ks 2 2 , 1 i m.

The proof follows directly from the equivalence stated in Proposition 5 by noting that in the Nystr om method setting, the requirements of Corollary 4 hold.

6. New kernel approximation schemes based on the perturbation framework

The perturbation-based kernel approximation framework derived in Section 4 is very ﬂexible, and allows for various approximations that depend on the choice of the matrix Ks. There are two main considerations in choosing Ks. First, since calculating the eigendecomposition of Ks is the most expensive part of the perturbation-based kernel approximation, we wish to choose a matrix Ks whose eigendecomposition is easy to compute. This will make our framework computationally attractive. Second, we would like to take advantage of the ﬂexibility of our framework and choose a matrix Ks that captures as much of a given

Mitz and Shkolnisky

kernel K as possible (that is, to minimize the K Ks 2 term in (13)). This will allow for better approximation results compared to classical Nystr om-type methods. In this section, we propose several approximation schemes corresponding to diﬀerent choices of Ks. Furthermore, we prove that several of the Nyst oom method variants described in Section 2 also arise from our general approximation framework for suitable choices of Ks. The list below is not by all means comprehensive, and users might come up with diﬀerent approximation schemes that are more suitable to their problems settings.

6.1 l-block kernel approximation

We deﬁne the l-block kernel approximation by choosing the matrix Ks to be the top left l l submatrix of K padded with zeros to size n n (see Figure 2), with l m being a parameter. The diﬀerence of this approach from the Nystr om method is that while we still calculate m eigenpairs, we do so on a larger block, which captures more of K. This comes at the price of a greater computational cost.

Proposition 7 Using the notation of this section, the eigenpairs calculated by the randomized SVD Nystr om method (Li et al., 2014) and the l-block kernel approximation method are equal.

The proof easily follows from the deﬁnitions and Proposition 5.

6.2 µ-shifted kernel approximation

We deﬁne the µ-shifted kernel approximation by choosing the matrix Ks to be the top left m m submatrix of K padded with zeros to size n n, similarly to the Nystr om method (see Figure 2). The diﬀerence from the Nystr om method lies in the parameter µ of (13). In Proposition 5, we used the value of the parameter µ to be µ = 0. This is a reasonable choice when the kernel matrix K is low-rank, or when its spectrum decays fast. When that is not the case, it might be beneﬁcial to choose a parameter µ that approximates the unknown eigenvalues of K. A reasonable choice for µ in such a case is µmean of (12). We now prove that given a parameter µ 0, the spectral shifted Nystr om method (Wang et al., 2014) with parameter µ coincides with the perturbation approximation method with the same µ, as detailed in the following proposition.

Proposition 8 Using the notation of this section, the eigenpairs calculated by the spectrum shifted Nystr om method (Wang et al., 2014) and the µ-shifted approximation method are equal.

The proof of Proposition 8 is given in Appendix F.

6.3 Block-diagonal kernel approximation

We deﬁne the block-diagonal kernel approximation by choosing the matrix Ks to be a block diagonal matrix (see Figure 2). The block sizes can be arbitrary, but for simplicity of notation, we choose k blocks of an identical size l m. For each block, we pad the block with zeros to obtain an n n matrix, and calculate its m leading eigenpairs. We

A Perturbation-Based Kernel Approximation Framework

then approximate the eigenpairs of K using (13). Denote by {(eλ(j) i , eu(j) i )}m i=1 the extended eigenpairs of block j, and by e Kj Rn n the resulting kernel approximation. To combine the k approximations { e Kj}k j=1 to an approximation of K, we set

We note that the kernel matrix approximation obtained by this method generally won t be low-rank.

Proposition 9 Using the notation of this section, the eigenpairs calculated by the ensemble Nystr om method (Kumar et al., 2009) and the block diagonal kernel approximation method are equal.

The proof easily follows from the deﬁnitions and Proposition 5.

6.4 p-band kernel approximation

We deﬁne the p-band kernel approximation by choosing the matrix Ks to be a band matrix of width p (see Figure 2). This approximation may provide superior results when the kernel K has most of its energy concentrated along the diagonal. Such a kernel may arise naturally for sequential data, where adjacent entries are more similar to each other. In such a case, the p-band kernel approximation will capture most of K. Existing Nystr omtype approximations, on the other hand, are not able to do so since they are limited to block matrices. We demonstrate the advantage of this kernel approximation method in Section 7.2.1.

6.5 Sparse kernel approximation

We deﬁne the sparse kernel approximation by choosing the matrix Ks to be some sparse submatrix of K, as illustrated in Figure 3. More concretely, in the sparse approximation framework, we denote by nnz(K) the number of non-zero entries of K, and deﬁne Ks by choosing q nnz(K) entries of K, for some 0 < q 1. While this approximation is valid for any (symmetric) subset of elements of K, motivated by the E 2 term in the error bounds (9) and (11), we suggest choosing the q nnz(K) largest entries of K in their absolute value. We demonstrate the advantage of this kernel approximation method in Section 7.2.2.

7. Numerical examples

In this section, we demonstrate numerically the results obtained in the previous sections. We start by demonstrating numerically the error bounds derived in Section 3. Then, we demonstrate the advantages of the kernel approximation methods proposed in Section 6 using both real and synthetic datasets. The MATLAB code to reproduce the graphs in this section is found in github.com/roymitz/perturbation kernel approximation.

Mitz and Shkolnisky

(a) Ks for l-block kernel approximation.

(b) Ks for µ-shifted kernel approximation.

(c) Ks for block diagonal kernel approximation.

(d) Ks for p-band kernel approximation.

Figure 2: Illustration of the submatrix Ks for each of the discussed kernel approximations. Blank entries indicate 0.

7.1 Perturbation error bounds

In this section, we demonstrate numerically the behavior of the error bounds (9) and (11) in Propositions 1 and 2, respectively. In our ﬁrst example, we demonstrate the predicted linear dependence of the errors wi ew(1) i 2 and wi ew(2) i 2 on E 2. We also show the

quadratic dependence of these errors on E 2 for a proper choice of µ. To that end, we generate a random symmetric matrix A R1000 1000 whose 10 leading eigenvalues are between 1 and 2, and the rest are exactly 0.5. We then generate a random symmetric matrix E, and normalize it to have a unit spectral norm. Then, for various values of c, we approximate the 10 leading eigenpairs of Ac = A + c E by the ﬁrst and second-order approximations (8) and (10) using µ = 0 and µmean. Denote by vc the leading eigenvector of Ac, and by u1 c and u2 c its approximations by (8) and (10) using µ = 0, respectively. Denote by w1 c and w2 c the respective approximations using µmean. For each c, we measure the errors vc u1 c 2, vc u2 c 2, vc w1 c 2 and vc w2 c 2. In Figure 4a, we plot all the log10-errors versus log10 c E 2. As predicted by theory, there is a linear dependence between the error in

A Perturbation-Based Kernel Approximation Framework

(a) Sparse K.

(b) Corresponding Ks.

Figure 3: Illustration of a sparse kernel approximation. Blank entries indicate 0.

the eigenvector approximation and the norm of the perturbation matrix when using µ = 0. Furthermore, we see that when using µmean, both formulas coincide and give rise to the same second-order approximation, as predicted by Proposition 3. In our second example, we demonstrate the linear and quadratic dependence of the error on Pn j=m+1 λj µ , that is, on the unknown eigenvalues of the perturbed matrix. For various values of c R, we generate matrices A c R1000 1000 as follows. Their 10 leading eigenvalues are between 1 and 2, and are the same for all values of c. The rest of their eigenvalues are exactly c. Then, we generate a random symmetric matrix E R1000 1000

and normalize it to have a norm of 10 6. We choose E 2 to be relatively small, so that its contribution to the error will not mask the eﬀect of Pn j=m+1 λj µ . We approximate the 10 leading eigenpairs of Ac = A c + E by the ﬁrst and second-order approximations (8) and (10) using µ = 0, and measure the error in the same way as in the previous example. In Figure 4b, we plot log vc u1 c 2 and log vc u2 c 2 versus log λj µ = log λj = log c. As predicted by Propositions 1 and 2, there is a linear dependence between the error in the eigenvector approximation and c for the ﬁrst-order approximation, and a quadratic dependence for the second-order formula. We note that using µmean in this example will cancel out the Pn j=m+1 λj µ term, since in such a case µmean = c.

7.2 Perturbation-based approximation for synthetic and real data

In this section, we compare the various kernel approximation methods proposed in Section 6 for both synthetic and real data. We demonstrate that the various approximation schemes perform diﬀerently, depending on the structure of the kernel matrix. As our metric for comparing the performance of the various methods we use the kernel reconstruction error, i.e., if Km is the best rank-m approximation of the kernel matrix of the entire data (obtained by SVD), and e K is its approximation, we deﬁne

Km e K 2 Km 2 .

We note that other metrics we tested performed qualitatively similarly. The metrics we tested were the principal angle (Knyazev and Zhu, 2012) between the subspace spanned by

Mitz and Shkolnisky

log(||c E||)

1st order (mu = 0) 2nd order (mu = 0) 1st order (mu-mean) 2nd order (mu-mean)

(a) The dependence of the ﬁrst and second-order formulas on the perturbation matrix norm E 2.

1st order 2nd order

(b) The dependence of the ﬁrst and second-order formulas on the unknown eigenvalues of the perturbed matrix Pn j=m+1 λj µ .

Figure 4: Numerical demonstration of the error terms in the approximations (8) and (10). (a) log (error) vs. log c E 2. The slope of both curves corresponding to µ = 0 is 1, demonstrating the linear dependence of the error terms (9) and (11) on E 2 for both the ﬁrst and second-order approximations. On the other hand, both curves corresponding to µmean coincide with slope 2, demonstrating Proposition 3. (b) log (error) vs. log λj µ . The slope of the line corresponding to the ﬁrst-order approximation is 1, whereas the slope of the line corresponding to the second-order approximation is 2, demonstrating the linear and quadratic dependence of the ﬁrst and second-order error terms on Pn j=m+1 λj µ , respectively.

A Perturbation-Based Kernel Approximation Framework

the kernel s top eigenvectors and the subspace spanned by their approximations, and the

subspace projection error UUT V V T 2

UUT 2 , where U and V are the ground truth subspace and

its approximation, respectively. The results for all tested metrics are given in Appendix G. In all the numerical examples to follow, we use µ = 0 for all approximation schemes. We note that we typically witness a marginal diﬀerence between the performance of µ = 0 and µ = µmean for real-world data, as such data are usually close to being low-rank. Thus, we do not include the µ-shifted kernel approximation in the experiments of this section. We also set n = 1000 in all experiments. We choose m to be the number of components that account for 90% of the energy of K, with a maximum value of 5. For the block-diagonal kernel approximation, we always use two blocks of the same size. Each experiment is performed as follows. For each kernel type, we generate several kernels of that type, each with a diﬀerent kernel parameter (to be explained later for each example). Then, we approximate each of the kernels using each of the approximation methods, where in any case the matrix Ks used in the approximation consists of 20% of the entries of the approximated kernel. For each such kernel approximation, we measure the approximation error versus the Hoyer score (Hurley and Rickard, 2009) of the kernel matrix K written as a long vector. The Hoyer score of a vector v Rn is deﬁned by

The Hoyer score is a number between 0 and 1, with a higher score corresponding to a sparser vector. We repeat this procedure 20 times for each approximation scheme and kernel parameter, each time with a diﬀerent random subset of the data. Finally, we plot the mean approximation error for each kernel approximation scheme versus the Hoyer score of the approximated kernel K. We also add to the plot the standard deviation of the error of each scheme, presented as a shaded plot around the mean.

7.2.1 Kernels concentrated along the diagonal

In this example, we demonstrate the performance of each of the kernel approximation schemes of Section 6 using kernel matrices whose energy is concentrated along their diagonal. We build these kernels as follows. We generate a matrix K whose (i, j) and (j, i) entries are|i j| α + X, where X are i.i.d samples from a normal distribution with mean zero and standard deviation 0.0001. Larger values of α correspond to a more concentrated matrix, whereas smaller values of α correspond to a more spread matrix. We execute the experiment described at the beginning of Section 7.2. The results are presented in Figure 5. We can see that for higher Hoyer scores, which correspond to sparser matrices that are concentrated along the diagonal, the error graphs of the pband and sparse kernel approximation schemes are lower than the error graphs of the other kernel approximation schemes. We conclude that when the kernel is concentrated along its diagonal, the p-band kernel approximation and the sparse kernel approximation have superior performance. The performance of the sparse kernel approximation is comparable to the performance of the p-band kernel approximation.

Mitz and Shkolnisky

Figure 5: The error of the various kernel approximation schemes for a kernel concentrated along its diagonal. When the kernel is concentrated along the diagonal, and hence sparser, the p-band kernel approximation outperforms the other schemes. The performance of the sparse kernel approximation is comparable to the performance of the p-band kernel approximation.

7.2.2 Sparse kernels

In this example, we wish to demonstrate the performance of each kernel approximation scheme using kernel matrices that are sparse. As our kernel matrix, we use the symmetric normalized graph Laplacian matrix, used in Laplacian eigenmaps dimensionality reduction (Belkin and Niyogi, 2003). The (i, j) entry of the graph Laplacian matrix is given by exp( xi xj 2 2 /σ), followed by some data-dependant normalization. The reason we use this kernel is that for small values of σ, this kernel is essentially sparse.

In this subsection, we use real-world datasets taken from the UCI Machine Learning Repository (Dua and Graﬀ, 2017), as described in Table 1. For each dataset, we repeat the experiment described in the introduction of Section 7.2 multiple times, each time on a random subset of 1000 points from the dataset, and for several values of σ, reﬂecting the transition between a sparse matrix (small σ) and a dense matrix (large σ). The results of this experiment are presented in Figure 6. We can see that for higher Hoyer scores, which correspond to sparser matrices, the error graphs of the sparse kernel approximation scheme are lower than the error graphs of the other kernel approximation schemes. We conclude that when the kernel admits a sparse structure (typically, Hoyer score > 0.75), the sparse kernel approximation scheme has superior performance. For lower Hoyer scores, however, the kernel is usually no longer sparse and the performance of the sparse kernel approximation scheme is no longer superior to the other kernel approximation schemes. We notice that none of the methods performs well for dense matrices, whereas for sparse matrices, the use of the sparse extension may be the diﬀerence between a nearly meaningless result and an informative one.

A Perturbation-Based Kernel Approximation Framework

Name Dimension Description MNIST 784 Each sample is a grey scale image of a handwritten digit between zero and nine. Superconductivity 81 Each sample contains 81 features extracted from one of 21263 superconductors. Poker 10 Each sample is a hand consisting of ﬁve playing cards drawn from a standard deck of 52 cards. Each card is described using two attributes (suit and rank). Wine quality 11 Each sample corresponds to a variant of a Portuguese wine, where the 11 attributes are numerical characteristics of the wine such as acidity, p H, residual sugar etc.

Table 1: Real-world datasets used.

(b) Superconductivity

(d) Wine quality

Figure 6: Error of the graph Laplacian approximation for various datasets and kernel approximation schemes as a function of the approximated kernel Hoyer score. We can see that for sparse kernels (higher Hoyer score, typically > 0.75), the sparse approximation is superior.

Mitz and Shkolnisky

8. Summary and future work

In this paper, we propose a kernel approximation framework that is based on perturbation theory. We prove that this framework is a generalization of the popular Nystr om method and some of its variants. Furthermore, contrary to existing error bounds for the Nystr om method, our framework provides error bounds for the individual eigenvectors. This is useful when the approximation is used as part of a dimensionality reduction procedure. Our kernel approximation framework is very ﬂexible, and can thus take advantage of the structure of the kernel matrix. We demonstrate our theoretical derivations numerically for kernel matrices that are either sparse or concentrated along their diagonal. Currently, our framework does not handle kernels with degenerate eigenvalues, nor crossovers between eigenvalues. The latter might cause a change of order of the eigenvectors. For a future work, one might consider schemes to construct kernel matrices with structure that can take advantage of our framework. For example, the MEKA algorithm orders the data in clusters, resulting in a block-diagonal kernel.

Acknowledgments

This research was supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement 723991 - CRYOMATH).

A Perturbation-Based Kernel Approximation Framework

Appendix A. Proof of Proposition 1

Let 1 i m. Ignoring the O( E 2 2) term, we split (5) into the known and unknown terms, resulting in

ti tk vk. (18)

As the second term in (18) is unknown, we approximate it by replacing the unknown eigenvalues with a parameter µ, resulting in

ti µ vk = 1 ti µ

k=m+1 Evi, vk vk = 1 ti µ(Evi V (m)V (m)T Evi) = 1 ti µri,

(19) where ri is deﬁned in (7). Formula (8) follows by replacing the second term in (18) with the rightmost term in (19). We denote the approximation error introduced into the approximation (8) by ei, that is

ti tk vk 1 ti µri

By using the identity 1 ti tk = 1 ti µ + tk µ (ti tk)(ti µ),

tk µ (ti tk)(ti µ) Evi, vk vk

By the triangle inequality and the Cauchy-Schwarz inequality we get

|tk µ| |ti tk||ti µ|

Evi, vk E 2 |ti tm+1||ti µ|

k=m+1 |tk µ| .

Recalling that the original perturbation approximation (5) induces an error of O( E 2 2) concludes the proof.

Appendix B. Proof of Proposition 2

Let 1 i m. Ignoring the O( E 2 2) term, we split (5) into the known and unknown terms, resulting in

ti tk vk. (20)

To obtain (10), we expand the unknown (second) term in (20) by using the identity

1 ti tk = 1 ti µ + tk µ (ti µ)2 + (tk µ)2

(ti tk)(ti µ)2 . (21)

Mitz and Shkolnisky

We note that

k=m+1 Evi, vk (tk µ)vk =

k=m+1 Evi, vk tkvk µ

k=m+1 Evi, vk vk (22)

k=m+1 Evi, vk A vk µ

k=m+1 Evi, vk vk (23)

= A ri µri, (24)

where ri is deﬁned in (7). Using (24) and (19), we get that the unknown term in (20) can be written as

ti tk vk = 1 ti µri µ (ti µ)2 ri+ 1 (ti µ)2 A ri+

(ti tk)(ti µ)2 Evi, vk vk.

(25) The second-order formula (10) now follows by discarding the last term in (25). Denoting the approximation error of equation (10) by ei, we have

ti tk vk 1 ti µri µ (ti µ)2 ri + 1 (ti µ)2 A ri

Using (21) and the triangle and the Cauchy-Schwarz inequalities, we obtain

(ti tk)(ti µ)2 Evi, vk vk

|ti tk||ti µ|2 Evi, vk

E 2 |ti tm+1||ti µ|2

k=m+1 |tk µ|2 .

Recalling that the original perturbation approximation (5) induces an error of O( E 2 2) concludes the proof.

Appendix C. Runtime and space complexity

In this section, we discuss the runtime and space complexities of formulas (8) and (10). We start with the ﬁrst-order formula (8). The space complexity of the ﬁrst-order formula is O(mn), since it needs to store in memory the m leading eignevectors of A . As of the runtime of the ﬁrst-order formula, the computation of ri of (7) involves the calculation of Evi that requires O(nnz(E)) operations. The result is then multiplied by V (m)T , which requires O(mn) operations, and then by V (m), which also requires O(mn) operations. Thus, the total runtime complexity for computing all {ri}m i=1 is O(m nnz(E) + m2n) operations. The ﬁrst-order formula also requires to compute O(m2) terms of the form Evi, vk vk for

A Perturbation-Based Kernel Approximation Framework

1 i = k m. Each such term requires O(nnz(E) + n) operations, resulting in a total of O(m2 nnz(E) + m2n) operations for all eigenvectors. We conclude that evaluating the ﬁrst-order formula requires a total of O(m2 nnz(E) + m2n) operations. The analysis of the second-order formula (10) is similar, except that it requires also to store A , which requires O(nnz(A )) memory, and to compute A ri, which requires additional O(m nnz(A )) operations. We conclude that evaluating the second-order formula requires a total of O(m2 nnz(E) + m nnz(A ) + m2n) operations.

Appendix D. Proof of Proposition 3

Let i {1, . . . , m}. Since µ = δ, we get that if A ri = δri then the last two terms in (10) cancel out and (10) reduces to the ﬁrst-order formula (8). Thus, in order to prove that w(1) i = w(2) i , it is suﬃcient to prove that under the settings of the proposition A ri = δri. Indeed,

A ri = A I V (m)V (m)T Evi = V (m)TV (m)T + δI I V (m)V (m)T Evi

= V (m)TV (m)T V (m)TV (m)T + δI δV (m)V (m)T Evi

= δ I V (m)V (m)T Evi = δri.

For the error, we note that the n m unknown eigenvalues of a matrix A of the form A = V (m)TV (m)T + δI are exactly δ, and thus, when choosing µ = δ, the ﬁrst term in (9) and (11) cancels out and we are left with only the O( E 2 2) term.

Appendix E. Proof of Proposition 5

Let i {1, . . . , m}. By (13), for µ = 0

eui = us i +

((K Ks)us i, us k) λs i λs k us k + 1

I Us(m)Us(m)T (K Ks)us i, (26)

and by (15), eλi = λs i + us T i (K Ks)us i. (27)

We ﬁrst prove that ˆui = p m

n eui (see (17)). We start by simplifying (26) based on the speciﬁc choice of Ks. We make the following observations. First, we note that for all i = 1, . . . , m, the last n m entries of us i are 0, and the top left m m submatrix of K Ks

is 0. This implies that the ﬁrst m entries of (K Ks)us i are 0, and so (K Ks)us i, us k = 0 for all 1 i, k m. A direct consequence of the latter is that Us(m)Us(m)T (K Ks)us i = 0 for all 1 i m. Thus, (26) reduces to

eui = us i + 1

λs i (K Ks)us i. (28)

Next, we note that the ﬁrst term in (28), us i , is non-zero only on its ﬁrst m entries, whereas the second term, 1 λs i (K Ks)us i is non-zero only on its last n m entries. This means that

Mitz and Shkolnisky

the ﬁrst m entries of eui are exactly the ﬁrst m entries us i, and the last n m entries of eui are equal to the last n m entries of 1 λs i (K Ks)us i. We start by proving the equivalence between the ﬁrst m entries of eui and ˆui. Let 1 p m. Denote by Ap the p th row of a matrix A. Denote by C the n m matrix consisting of the ﬁrst m columns of K. We note that for the Nystr om method (see (1))

n 1 λ i Cp u i = rm

n 1 λ i K p u i = rm

n 1 λ i λ iu i,p = rm

Thus, the ﬁrst m entries of the approximated vector in the Nystr om method are merely a re-scaling of the vector u i by pm

n . Since Ks is equal to K padded with zeros, we get that the ﬁrst m entries of us i are exactly u i. Thus, we conclude that the ﬁrst m entries of ˆui and p m

n eui are identical. Next, we prove the equivalence between the last n m entries of eui and ˆui. Let m+1 p n. We have for the Nystr om method (see (1))

n 1 λ i Cp u i, (29)

and for the perturbation approximation, by using (28), and since the last n m entries of us i are 0,

λs i (K Ks)p us i = 1

λs i (Kp us i 0) = 1

λs i Cp u i, (30)

where the last equality follows since (as explained above) the ﬁrst m entries of us i are exactly u i. Finally, since λ i = λs i for 1 i m, we conclude by (29) and (30) that ˆui,p = p m

n eui,p for m + 1 p n, that is, the last n m entries of ˆui and pm

n eui are also identical. We now prove the equivalence of the eigenvalues. By the same arguments as above, we note that for all 1 i m, us T i (K Ks)us i = 0. Thus, by (27) we have eλi = λs i and consequently, using (2), ˆλi = n

as required.

Appendix F. Proof of Proposition 8

Denote by Ks shift Rn n the top left m m submatrix of Kshift (see (4)) padded with zeros. By the equivalence of the Nystr om method and the perturbation approximation proved in Proposition 5, the spectral shifted Nystr om method is equivalent to the perturbation approximation of Ks shift using µ = 0. Thus, it suﬃces to prove that the perturbation approximation of the eigenpairs of Ks shift to the eigenpairs of Kshift with µ = 0 equals to the perturbation approximation of the eigenpairs of Ks to the eigenpairs of K with µ = δ. Let the top m eigenpairs of Ks be {(λs i, us i))}m i=1. It follows that the top m eigenpairs of Ks shift are {(λs i δ, us i))}m i=1. Let 1 i m. By (28), the perturbation approximation (13) of the eigenvectors of Ks shift to eigenvectors of Kshift with µ = 0 reduces to

eui = us i + 1 λi δ 0(Kshift Ks shift)us i,

A Perturbation-Based Kernel Approximation Framework

whereas the perturbation approximation of the eigenvectors of Ks to eigenvectors of K with µ = δ reduces to

ˆui = us i + 1 λi δ(K Ks)us i.

But since the last n m entries of us i are 0, and the entries of Kshift Ks shift are equal to those of K Ks except for the last n m diagonal elements, we have that

(Kshift Ks shift)us i = (K Ks)us i, (31)

and we conclude that eui = ˆui. For the eigenvalues, the perturbation approximation of the eigenvalues of Ks shift to the eigenvalues of Kshift (see (15)) yields

eλi = (λs i δ) + us i(Kshift Ks shift)us i.

We note that by (15), (31) and by the proof of this section for the eigenvectors, if {(τi, vi)}m i=1 are the approximated eigenpairs of a matrix A, then {(τi + δ, vi)}m i=1 are the approximated eigenpairs of the matrix Ashift = A+δI. Thus, in order to recover the approximation of the eigenvalues of K, we need to shift the approximated eigenvalues {eλi}m i=1 back by δ, yielding

eλ i = λs i + us i(Kshift Ks shift)us i.

On the other hand, the perturbation approximation of Ks yields

ˆλi = λs i + us i(K Ks)us i.

By (31), we have that eλ i = ˆλi.

Appendix G. Performance of various error metrics

In this section, we provide the results of Section 7.2.2 for the MNIST and wine datasets for three diﬀerent error measures, demonstrating that the performance of our approximation scheme is qualitatively similar in various reasonable error metrics. The results are presented in Figure 7 and in Figure 8 for the MNIST and wine datasets respectively.

(a) Reconstruction error

(b) Subspace projection error

(c) Principle angle

Figure 7: Approximation error for the MNIST dataset in various error metrics.

Mitz and Shkolnisky

(a) Reconstruction error

(b) Subspace projection error

(c) Principle angle

Figure 8: Approximation error for the wine dataset in various error metrics.

Mukund Balasubramanian, Eric L Schwartz, Joshua B Tenenbaum, Vin de Silva, and John C Langford. The isomap algorithm and topological stability. Science, 295(5552):7 7, 2002.

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003.

Matthew Brand. Fast low-rank modiﬁcations of the thin singular value decomposition. Linear algebra and its applications, 415(1):20 30, 2006.

Andreas Buja, Deborah F Swayne, Michael L Littman, Nathaniel Dean, Heike Hofmann, and Lisha Chen. Data visualization with multidimensional scaling. Journal of computational and graphical statistics, 17(2):444 472, 2008.

James R Bunch, Christopher P Nielsen, and Danny C Sorensen. Rank-one modiﬁcation of the symmetric eigenproblem. Numerische Mathematik, 31(1):31 48, 1978.

Frederick W Byron and Robert W Fuller. Mathematics of classical and quantum physics. Courier Corporation, 2012.

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3): 273 297, 1995.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml.

Alex Gittens and Michael Mahoney. Revisiting the Nystr om method for improved largescale machine learning. In International Conference on Machine Learning, pages 567 575. PMLR, 2013.

Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert. An algorithm for the principal component analysis of large data sets. SIAM Journal on Scientiﬁc computing, 33(5):2580 2594, 2011a.

A Perturbation-Based Kernel Approximation Framework

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011b.

Niall Hurley and Scott Rickard. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10):4723 4741, 2009.

Andrew V Knyazev and Peizhen Zhu. Principal angles between subspaces and their tangents. ar Xiv preprint ar Xiv:1209.0523, 2012.

Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Ensemble Nystr om method. In Advances in Neural Information Processing Systems, pages 1060 1068, 2009.

Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the Nystr om method. Journal of Machine Learning Research, 13(Apr):981 1006, 2012.

Amy N Langville and Carl D Meyer. Updating markov chains with an eye on google s pagerank. SIAM journal on matrix analysis and applications, 27(4):968 987, 2006.

Mu Li, Wei Bi, James T Kwok, and Bao-Liang Lu. Large-scale Nystr om kernel matrix approximation using randomized svd. IEEE transactions on neural networks and learning systems, 26(1):152 164, 2014.

Roy Mitz, Nir Sharon, and Yoel Shkolnisky. Symmetric rank-one updates from partial spectrum with an application to out-of-sample extension. SIAM Journal on Matrix Analysis and Applications, 40(3):973 997, 2019.

Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849 856, 2002.

Hyung Seon Oh and Zhe Hu. Multiple-rank modiﬁcation of symmetric eigenvalue problem. Methods X, 5:103 117, 2018.

Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323 2326, 2000.

Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual variables. 1998.

Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888 905, 2000.

Si Si, Cho-Jui Hsieh, and Inderjit S Dhillon. Memory eﬃcient kernel approximation. The Journal of Machine Learning Research, 18(1):682 713, 2017.

Gilbert W Stewart. Matrix perturbation theory. 1990.

Shiliang Sun, Jing Zhao, and Jiang Zhu. A review of Nystr om methods for large-scale machine learning. Information Fusion, 26:36 48, 2015.

Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.

Mitz and Shkolnisky

Shusen Wang and Zhihua Zhang. Improving CUR matrix decomposition and the Nystr om approximation via adaptive sampling. The Journal of Machine Learning Research, 14(1): 2729 2769, 2013.

Shusen Wang and Zhihua Zhang. Eﬃcient algorithms and error analysis for the modiﬁed Nystr om method. In Artiﬁcial Intelligence and Statistics, pages 996 1004. PMLR, 2014.

Shusen Wang, Chao Zhang, Hui Qian, and Zhihua Zhang. Improving the modiﬁed Nystr om method using spectral shifting. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 611 620, 2014.

Christopher KI Williams and Matthias Seeger. Using the Nystr om method to speed up kernel machines. In Advances in neural information processing systems, pages 682 688, 2001.